Numerical Analysis

6月 242020
 

If you have ever run a Kolmogorov-Smirnov test for normality, you have encountered the Kolmogorov D statistic. The Kolmogorov D statistic is used to assess whether a random sample was drawn from a specified distribution. Although it is frequently used to test for normality, the statistic is "distribution free" in the sense that it compares the empirical distribution to any specified distribution. I have previously written about how to interpret the Kolmogorov D statistic and how to compute it in SAS.

If you can compute a statistic, you can use simulation to approximate its sampling distribution and to approximate its critical values. Although I am a fan of simulation, in this article I show how to compute the exact distribution and critical values for Kolmogorov D statistic. The technique used in this article is from Facchinetti (2009), "A Procedure to Find Exact Critical Values of Kolmogorov-Smirnov Test."

An overview of the Facchinetti method

Suppose you have a sample of size n and you perform a Kolmogorov-Smirnov test that compares the empirical distribution to a reference distribution. You obtain the test statistic D. To know whether you should reject the null hypothesis (that the test came from the reference distribution), you need to be able to compute the probability CDF(D) = Pr(Dn ≤ D). Facchinetti shows that you can compute the probability for any D by solving a system of linear equations. If a random sample contains n observations, you can form a certain (2n+2) x (2n+2) matrix whose entries are composed of binomial probabilities. You can use a discrete heat map to display the structure of the matrix that is used in the Facchinetti method. One example is shown below. For n=20, the matrix is 42 x 42. As Facchinetti shows in Figure 6 of her paper, the matrix has a block structure, with triangular matrices in four blocks.

The right-hand side of the linear system is also composed of binomial probabilities. The system is singular, but you can use a generalized inverse to obtain a solution. From the solution, you can compute the probability. For the details, see Facchinetti (2009).

Facchinetti includes a MATLAB program in the appendix of her paper. I converted the MATLAB program into SAS/IML and improved the efficiency by vectorizing some of the computations. You can download the SAS/IML program that computes the CDF of the Kolmogorov D statistic and all the graphs in this article. The name of the SAS/IML function is KolmogorovCDF, which requires two arguments, n (sample size) and D (value of the test statistic).

Examples of using the CDF of the Kolmorov distribution

Suppose you have a sample of size 20. You can use a statistical table to look up the critical value of the Kolmogorov D statistic. For α = 0.05, the (two-sided) critical value for n=20 is D = 0.294. If you load and call the KolmogorovCDF function, you can confirm that the CDF for the Kolmogorov D distribution is very close to 0.95 when D=0.294:

proc iml;
load  module=KolmogorovCDF;      /* load the function that implements the Facchinetti method */
/* simple example of calling the function */
n = 20;
D = 0.29408;
cdf = KolmogorovCDF(n, D);
print n D cdf;

The interpretation of this result is that if you draw a random samples of size 20 from the reference distribution, there is a 5% chance that the Kolmogorov D value will exceed 0.294. A test statistic that exceeds 0.294 implies that you should reject the null hypothesis at the 0.05 significance level.

Visualize the Kolmogorov D distribution

If you can compute the CDF for one value of D, you can compute it for a sequence of D values. In this way, you can visualize the CDF curve. The following SAS/IML statements compute the CDF for the Kolmogorov D distribution when n=20 by running the computation on a sequence of D values in the open interval (0, 1):

D = do(0.01, 0.99, 0.01);             /* sequence of D values */
CDF = j(1, ncol(D), .);
do i = 1 to ncol(D);
   CDF[i] = KolmogorovCDF(n, D[i]);   /* CDF for each D */ 
end;
title "Kolmogorov CDF, n=20";
call series(D, CDF) grid={x y};       /* create a graph of the CDF */

The graph enables you to read probabilities and quantiles for the D20 distribution. The graph shows that almost all random samples of size 20 (drawn from the reference distribution) will have a D statistic of less than 0.4. Geometrically, the D statistic represents the largest gap between the empirical distribution and the reference distribution. A large gap indicates that the sample was probably not drawn from the reference distribution.

Visualize a family of Kolmogorov D distributions

The previous graph was for n=20, but you can repeat the computation for other values of n to obtain a family of CDF curves for the Kolmogorov D statistic. (Facchinetti writes Dn to emphasize that the statistic depends on the sample size.) The computation is available in the program that accompanies this article. The results are shown below:

This graph enables you to estimate probabilities and quantiles for several Dn distributions.

Visualize a family of density curves

Because the PDF is the derivative of the CDF, you can use a forward difference approximation of the derivative to also plot the density of the distribution. The following density curves are derived from the previous CDF curves:

A table of critical values

If you can compute the CDF, you can compute quantiles by using the inverse CDF. You can use a root-finding technique to obtain a quantile.

In the SAS/IML language, the FROOT function can find univariate roots. The critical values of the Dn statistic are usually tabulated by listing the sample size down columns and the significance level (α) or confidence level (1-α) across the columns. Facchinetti (p. 352) includes a table of critical values in her paper. The following SAS/IML statements show how to recreate the table. For simplicity, I show only a subset of the table for n=20, 25, 20, and 35, and for four common values of α.

/* A critical value is the root of the function
   f(D) = KolmogorovCDF(n, D) - (1-alpha)
   for any specified n and alpha value.
*/
start KolmogorovQuantile(D) global (n, alpha);
   return  KolmogorovCDF(n, D) - (1-alpha);
finish;
 
/* create a table of critical values for the following vector of n and alpha values */
nVals = do(20, 35, 5);
alphaVals = {0.001 0.01 0.05 0.10};
CritVal = j(ncol(nVals), ncol(alphaVals), .);
do i = 1 to ncol(nVals);
   n = nVals[i];
   do j = 1 to ncol(alphaVals);
      alpha = alphaVals[j];
      CritVal[i,j] = froot("KolmogorovQuantile", {0.01, 0.99});
   end;
end;
print CritVal[r=nVals c=alphaVals format=7.5 L="Critical Values (n,alpha)"];

For each sample size, the corresponding row of the table shows the critical values of D for which CDF(D) = 1-α. Of course, a printed table is unnecessary when you have a programmatic way to compute the quantiles.

Summary

Facchinetti (2009) shows how to compute the CDF for the Kolmogorov D distribution for any value of n and D. The Kolmogorov D statistic is often used for goodness-of-fit tests. Facchinetti's method consists of forming and solving a system of linear equations. To make her work available to the SAS statistical programmer, I translated her MATLAB program into a SAS/IML program that computes the CDF of the Kolmogorov D distribution. This article demonstrates how to use the KolmogorovCDF in SAS/IML to compute and visualize the CDF and density of the Kolmogorov D statistic. It also shows how to use the computation in conjunction with a root-finding method to obtain quantiles and exact critical values of the Kolmogorov D distribution.

The post The Kolmogorov D distribution and exact critical values appeared first on The DO Loop.

6月 012020
 

In a previous article, I discussed the definition of the Kullback-Leibler (K-L) divergence between two discrete probability distributions. For completeness, this article shows how to compute the Kullback-Leibler divergence between two continuous distributions. When f and g are discrete distributions, the K-L divergence is the sum of f(x)*log(f(x)/g(x)) over all x values for which f(x) > 0. When f and g are continuous distributions, the sum becomes an integral:
KL(f,g) = ∫ f(x)*log( f(x)/g(x) ) dx
which is equivalent to
KL(f,g) = ∫ f(x)*( log(f(x)) – log(g(x)) ) dx
The integral is over the support of f, and clearly g must be positive inside the support of f for the integral to be defined.

The Kullback-Leibler divergence between normal distributions

I like to perform numerical integration in SAS by using the QUAD subroutine in the SAS/IML language. You specify the function that you want to integrate (the integrand) and the domain of integration and get back the integral on the domain. Use the special missing value .M to indicate "minus infinity" and the missing value .P to indicate "positive infinity."

As a first example, suppose the f(x) is the pdf of the normal distribution N(μ1, σ1) and g(x) is the pdf of the normal distribution N(μ2, σ2). From the definition, you can exactly calculate the Kullback-Leibler divergence between normal distributions:
KL between Normals: log(σ2/σ1) + (σ12 – σ22 + (μ1 – μ2)2) / (2*σ22)

You can define the integrand by using the PDF and LOGPDF functions in SAS. The following computation integrates the distributions on the infinite integral (-∞, ∞), although a "six sigma" interval about μ1, such as [-6, 6], would give the same result:

proc iml;
/* Example 1: K-L divergence between two normal distributions */
start KLDivNormal(x) global(mu1, mu2, sigma1, sigma2);
   return pdf("Normal", x, mu1, sigma1) #
          (logpdf("Normal", x, mu1, sigma1) - logpdf("Normal", x, mu2, sigma2));
finish;
 
/* f is PDF for N(0,1) and g is PDF for N(2,3) */ 
mu1 = 0; sigma1 = 1;
mu2 = 2; sigma2 = 3;
 
call quad(KL, "KLDivNormal", {.M .P});  /* integral on (-infinity, infinity) */
KLExact = log(sigma2/sigma1) + 
          (sigma1##2 - sigma2##2+ (mu1-mu2)##2) / (2*sigma2##2);
print KL KLExact (KL-KLExact)[L="Difference"];

The output indicates that the numerical integration agrees with the exact result to many decimal places.

The Kullback-Leibler divergence between exponential and gamma distributions

A blog post by John D. Cook computes the K-L divergence between an exponential and a gamma(a=2) distribution. This section performs the same computation in SAS.

This is a good time to acknowledge that numerical integration can be challenging. Numerical integration routines have default settings that enable you to integrate a wide variety of functions, but sometimes you need to modify the default parameters to get a satisfactory result—especially for integration on infinite intervals. I have previously written about how to use the PEAK= option for the QUAD routine to achieve convergence in certain cases by providing additional guidance to the QUAD routine about the location and scale of the integrand. Following the advice I gave in the previous article, a value of PEAK=0.1 seems to be a good choice for the K-L divergence between the exponential and gamma(a=2) distributions: it is inside the domain of integration and is close to the mode of the integrand, which is at x=0.

/* Example 2: K-L divergence between exponential and gamma(2). Example from
   https://www.johndcook.com/blog/2017/11/08/why-is-kullback-leibler-divergence-not-a-distance/
*/
start KLDivExpGamma(x) global(a);
   return pdf("Exponential", x) #
          (logpdf("Exponential", x) - logpdf("Gamma", x, a));
finish;
 
a = 2;
/* Use PEAK= option: https://blogs.sas.com/content/iml/2014/08/13/peak-option-in-quad.html */
call quad(KL2, "KLDivExpGamma", {0 .P}) peak=0.1;
print KL2;

The value of the K-L divergence agrees with the answer in John D. Cook's blog post.

Approximating the Exponential Distribution by the Gamma Distribution

Recall that the K-L divergence is a measure of the dissimilarity between two distributions. For example, the previous example indicates how well the gamma(a=2) distribution approximates the exponential distribution. As I showed for discrete distributions, you can use the K-L divergence to select the parameters for a distribution that make it the most similar to another distribution. You can do this by using optimization methods, but I will merely plot the K-L divergence as a function of the shape parameter (a) to indicate how the K-L divergence depends on the parameter.

You might recall that the gamma distribution includes the exponential distribution as a special case. When the shape parameter a=1, the gamma(1) distribution is an exponential distribution. Therefore, the K-L divergence between the exponential and the gamma(1) distribution should be zero.

The following program computes the K-L divergence for values of a in the range [0.5, 2] and plots the K-L divergence versus the parameter:

/* Plot KL(Expo, Gamma(a)) for various values of a.
   Note that gamma(a=1) is the exponential distribution. */
aa = do(0.5, 2, 0.01);
KL = j(1, ncol(aa), .);
do i = 1 to ncol(aa);
   a = aa[i];
   call quad(KL_i, "KLDivExpGamma", {0 .P}) peak=0.1;
   KL[i] = KL_i;
end;
title "Kullback-Leibler Divergence Between Exponential and Gamma{a)";
call series(aa, KL) grid={x y} label={'Gamma Shape Parameter (a)' 'K-L Divergence'};

As expected, the graph of the K-L divergence reaches a minimum value at a=1, which is the best approximation to an exponential distribution by the gamma(a) distribution. Note that the K-L divergence equals zero when a=1, which indicates that the distributions are identical when a=1.

Summary

The Kullback-Leibler divergence between two continuous probability distributions is an integral. This article shows how to use the QUAD function in SAS/IML to compute the K-L divergence between two probability distributions.

The post The Kullback–Leibler divergence between continuous probability distributions appeared first on The DO Loop.

5月 202020
 

This article shows how to perform two-dimensional bilinear interpolation in SAS by using a SAS/IML function. It is assumed that you have observed the values of a response variable on a regular grid of locations.

A previous article showed how to interpolate inside one rectangular cell. When you have a grid of cells and a point (x, y) at which to interpolate, the first step is to efficiently locate which cell contains (x, y). You can then use the values at the corners of the cell to interpolate at (x, y), as shown in the previous article. For example, the adjacent graph shows the bilinear interpolation function that is defined by 12 points on a 4 x 3 grid.

You can download the SAS program that creates the analyses and graphs in this article.

Bilinear interpolation assumptions

For two-dimensional linear interpolation of points, the following sets of numbers are the inputs for bilinear interpolation:

  1. Grid locations: Let x1 < x2 < ... < xp be the horizontal coordinates of a regular grid. Let y1 < y2 < ... < yn be the vertical coordinates.
  2. Response Values: Let Z be an n x p matrix of response values on the grid. The (j, i)th value of Z is the response value at the location (xi, yj). Notice that the columns of Z correspond to the horizontal positions for the X values and the rows of Z correspond to the vertical positions for the Y values. (This might seem "backwards" to you.) The response values data should not contain any missing values. The X, Y, and Z values (called the "fitting data") define the bilinear interpolation function on the rectangle [x1, xp] x [y1, yn].
  3. Values to score: Let (s1, t1), (s2, t2), ..., (sk, tk) be a set of k new (x, y) values at which you want to interpolate. All values must be within the range of the data: x1 ≤ si ≤ xp and y1 ≤ si ≤ yn for all (i,j) pairs. These values are called the "scoring data" or the "query data."

The goal of interpolation is to estimate the response at each location (si, tj) based on the values at the corner of the cell that contains the location. I have written a SAS/IML function called BilinearInterp, which you can freely download and use. The following statements indicate how to define and store the bilinearInterp function:

proc iml;
/* xGrd : vector of Nx points along x axis
   yGrd : vector of Ny points along y axis
   z    : Ny x Nx matrix. The value Z[i,j] is the value at xGrd[j] and yGrd[i]
   t    : k x 2 matrix of points at which to interpolate
   The function returns a k x 1 vector, which is the bilinear interpolation at each row of t. 
*/
start bilinearInterp(xgrd, ygrd, z, t);
   /* download function from 
      https://github.com/sascommunities/the-do-loop-blog/blob/master/interpolation/bilinearInterp.sas 
   */
finish;
store module=bilinearInterp;       /* store the module so it can be loaded later */
QUIT;

You only need to store the module one time. You then load the module in any SAS/IML program that needs to use it. You can learn more about storing and loading SAS/IML modules.

The structure of the fitting data

The fitting data for bilinear interpolation consists of the grid of points (Z) and the X and Y locations of each grid point. The X and Y values do not need to be evenly spaced, but they can be.

Often the fitting data are stored in "long form" as triplets of (X, Y, Z) values. If so, the first step is to read the triplets and convert them into a matrix of Z values where the columns are associated with the X values and the rows are associated with the Y values. For example, the following example data set specifies a response variable (Z) on a 4 x 3 grid. The values of the grid in the Y direction are {0, 1, 1.5, 3} and the values in the X direction are {1, 2, 5}.

As explained in the previous article, the data must be sorted by Y and then by X. The following statements read the data into a SAS/IML matrix and extract the grid points in the X and Y direction:

data Have;
input x y z;
datalines;
1 0   6 
2 0   7 
5 0   5 
1 1   9 
2 1   9 
5 1   7 
1 1.5 10 
2 1.5 9 
5 1.5 6 
1 3   12 
2 3   9 
5 3   8 
;
 
/* If the data are not already sorted, sort by Y and X */
proc sort data=Have; by Y X; run; 
 
proc iml;
use Have; read all var {x y z}; close;
xGrd = unique(x);                      /* find grid points */                     
yGrd = unique(y);
z = shape(z, ncol(yGrd), ncol(xGrd)); /* data must be sorted by y, then x */
print z[r=(char(yGrd)) c=(char(xGrd))];

Again, note that the columns of Z represent the X values and the rows represent the Y values. The xGrd and yGrd vectors contain the grid points along the horizontal and vertical dimensions of the grid, respectively.

The structure of the scoring data

The bilinearInterp function assumes that the points at which to interpolate are stored in a k x 2 matrix. Each row is an (x,y) location at which to interpolate from the fitting data. If an (x,y) location is outside of the fitting data, then the bilinearInterp function returns a missing value. The scoring locations do not need to be sorted.

The following example specifies six valid points and two invalid interpolation points:

t = {0   0,     /* not in data range */
     1   1,
     1   2,
     2.5 2,
     4   0.5,
     4   2,
     5   3,
     6   3};    /* not in data range */
 
/* LOAD the previously stored function */
load module=bilinearInterp;  
F = bilinearInterp(xGrd, yGrd, z, t);
print t[c={'x' 'y'}] F[format=Best4.];

The grid is defined on the rectangle [1,5] x [0,3]. The first and last rows of t are outside of the rectangle, so the interpolation returns missing values at those locations. The locations (1,1) and (5,3) are grid locations (corners of a cell), and the interpolation returns the correct response values. The other locations are in one of the grid cells. The interpolation is found by scaling the appropriate rectangle onto the unit square and applying the bilinear interpolation formula, as shown in the previous article.

Visualizing the bilinear interpolation function

You can use the ExpandGrid function to generate a fine grid of locations at which to interpolate. You can visualize the bilinear interpolation function by using a heat map or by using a surface plot.

The heat map visualization is shown at the top of this article. The colors indicate the interpolated values of the response variable. The markers indicate the observed values of the response variable. The reference lines show the grid that is defined by the fitting data.

The following graph visualizes the interpolation function by using a surface plot. The function is piecewise quadratic. Within each cell, it is linear on lines of constant X and constant Y. The "ridge lines" on the surface correspond to the locations where two adjacent grid cells meet.

Summary

In summary, you can perform bilinear interpolation in SAS by using a SAS/IML function. The function uses fitting data (X, Y, and Z locations on a grid) to interpolate. Inside each cell, the interpolation is quadratic and is linear on lines of constant X and constant Y. For each location point, the interpolation function first determines what cell the point is in. It then uses the corners of the cell to interpolate at the point. You can download the SAS program that performs bilinear interpolation and generates the tables and graphs in this article.

The post Bilinear interpolation in SAS appeared first on The DO Loop.

5月 182020
 

I've previously written about linear interpolation in one dimension. Bilinear interpolation is a method for two-dimensional interpolation on a rectangle. If the value of a function is known at the four corners of a rectangle, an interpolation scheme gives you a way to estimate the function at any point in the rectangle's interior. Bilinear interpolation is a weighted average of the values at the four corners of the rectangle. For an (x,y) position inside the rectangle, the weights are determined by the distance between the point and the corners. Corners that are closer to the point get more weight. The generic result is a saddle-shaped quadratic function, as shown in the contour plot to the right.

The Wikipedia article on bilinear interpolation provides a lot of formulas, but the article is needlessly complicated. The only important formula is how to interpolate on the unit square [0,1] x [0,1]. To interpolate on any other rectangle, simply map your rectangle onto the unit square and do the interpolation there. This trick works for bilinear interpolation because the weighted average depends only on the relative position of a point and the corners of the rectangle. Given a rectangle with lower-left corner (x0, y0) and upper right corner (x1, y1), you can map it into the unit square by using the transformation (x, y) → (u, v) where u = (x-x0)/(x1-x0) and v = (y-y0)/(y1-y0).

Bilinear interpolation on the unit square

Suppose that you want to interpolate on the unit square. Assume you know the values of a continuous function at the corners of the square:

  • z00 is the function value at (0,0), the lower left corner
  • z10 is the function value at (1,0), the lower right corner
  • z01 is the function value at (0,1), the upper left corner
  • z11 is the function value at (1,1), the upper right corner

If (x,y) is any point inside the unit square, the interpolation at that point is the following weighted average of the values at the four corners:
F(x,y) = z00*(1-x)*(1-y) + z10*x*(1-y) + z01*(1-x)*y + z11*x*y

Notice that the interpolant is linear with respect to the values of the corners of the square. It is quadratic with respect to the location of the interpolation point. The interpolant is linear along every horizontal line (lines of constant y) and along every vertical line (lines of constant x).

Implement bilinear interpolation on the unit square in SAS

Suppose that the function values at the corners of a unit square are z00 = 0, z10 = 4, z01 = 2, and z11 = 1. For these values, the bilinear interpolant on the unit square is shown at the top of this article. Notice that the value of the interpolant at the corners agrees with the data values. Notice also that the interpolant is linear along each edge of the square. It is not easy to see that the interpolant is linear along each horizontal and vertical line, but that is true.

In SAS, you can use the SAS/IML matrix language to define a function that performs bilinear interpolation on the unit square. The function takes two arguments:

  • z is a four-element that specifies the values of the data at the corners of the square. The values are specified in the following order: lower left, lower right, upper left, and upper right. This is the "fitting data."
  • t is a k x 2 matrix, where each row specifies a point at which to interpolate. This is the "scoring data."

In a matrix language such as SAS/IML, the function can be written in vectorized form without using any loops:

proc iml;
start bilinearInterpSquare(z, t);
   z00 = z[1]; z10 = z[2]; z01 = z[3]; z11 = z[4]; 
   x = t[,1];  y = t[,2];
   cx = 1-x;                    /* cx = complement of x */
   return( (z00*cx + z10*x)#(1-y) + 
           (z01*cx + z11*x)#y );
finish;
 
/* specify the fitting data */
z = {0 4,                       /* bottom: values at (0,0) and (1,0) */
     2 1};                      /* top: values at (0,1) and (1,1) */
print z[c={'x=0' 'x=1'} r={'y=0' 'y=1'}];
 
t = {0.5 0,                     /* specify the scoring data */
     0    0.25,
     1    0.66666666,
     0.5  0.75,
     0.75 0.5     };
F = bilinearInterpSquare(z, t); /* test the bilinear interpolation function */
print t[c={'x' 'y'} format=FRACT.] F;

For points on the edge of the square, you can check a few values in your head:

  • The point (1/2, 0) is halfway between the corners with values 0 and 4. Therefore the interpolation gives 2.
  • The point (0, 1/4) is a quarter of the way between the corners with values 0 and 2. Therefore the interpolation gives 0.5.
  • The point (1, 2/3) is two thirds of the way between the corners with values 4 and 1. Therefore the interpolation gives 2.

The other points are in the interior of the square. The interpolant at those points is a weighted average of the corner values.

Visualize the interpolant function

Let's use the bilinearInterpSquare function to evaluate a grid of points. You can use the ExpandGrid function in SAS/IML to generate a grid of points. When we look at the values of the interpolant on a grid or in a heat map, we want the X values to change for each column and the Y values to change for each row. This means that you should reverse the usual order of the arguments to the ExpandGrid function:

/* for visualization, reverse the arguments to ExpandGrid and then swap 1st and 2nd columns of t */
xGrd = do(0, 1, 0.2);
yGrd = do(0, 1, 0.2);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
print Q[r=(char(YGrd,3)) c=(char(xGrd,3)) label="Bilinear Interpolant"];

The table shows the interpolant evaluated at the coordinates of a regular 6 x 6 grid. The column headers give the coordinate of the X variable and the row headers give the coordinates of the Y variable. You can see that each column and each row is an arithmetic sequence, which shows that the interpolant is linear on vertical and horizontal lines.

You can use a finer grid and a heat map to visualize the interpolant as a surface. Notice that in the previous table, the Y values increase as you go down the rows. If you want the Y axis to point up instead of down, you can reverse the rows of the grid (and the labels) before you create a heat map:

/* increase the grid resolution and visualize by using a heat map */
xGrd = do(0, 1, 0.1);
yGrd = do(0, 1, 0.1);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
 
/* currently, the Y axis is pointing down. Flip it and the labels. */
H = Q[ nrow(Q):1, ];
reverseY = yGrd[ ,ncol(yGrd):1];
call heatmapcont(H) range={0 4} displayoutlines=0
     xValues=xGrd yValues=reverseY
     colorramp=palette("Spectral", 7)[,7:1]
     title="Interpolation at Multiple Points in the Unit Square";

The heat map agrees with the contour plot at the top of this article. The heat map also makes it easier to see that the bilinear interpolant is linear on each row and column.

In summary, you can create a short SAS/IML function that performs bilinear interpolation on the unit square. (You can download the SAS program that generates the tables and graphs in this article.) The interpolant is a quadratic saddle-shaped function inside the square. You can use the ideas in this article to create a function that performs bilinear interpolation on an arbitrary grid of fitting data. The next article shows a general function for bilinear interpolation in SAS.

The post What is bilinear interpolation? appeared first on The DO Loop.

5月 182020
 

I've previously written about linear interpolation in one dimension. Bilinear interpolation is a method for two-dimensional interpolation on a rectangle. If the value of a function is known at the four corners of a rectangle, an interpolation scheme gives you a way to estimate the function at any point in the rectangle's interior. Bilinear interpolation is a weighted average of the values at the four corners of the rectangle. For an (x,y) position inside the rectangle, the weights are determined by the distance between the point and the corners. Corners that are closer to the point get more weight. The generic result is a saddle-shaped quadratic function, as shown in the contour plot to the right.

The Wikipedia article on bilinear interpolation provides a lot of formulas, but the article is needlessly complicated. The only important formula is how to interpolate on the unit square [0,1] x [0,1]. To interpolate on any other rectangle, simply map your rectangle onto the unit square and do the interpolation there. This trick works for bilinear interpolation because the weighted average depends only on the relative position of a point and the corners of the rectangle. Given a rectangle with lower-left corner (x0, y0) and upper right corner (x1, y1), you can map it into the unit square by using the transformation (x, y) → (u, v) where u = (x-x0)/(x1-x0) and v = (y-y0)/(y1-y0).

Bilinear interpolation on the unit square

Suppose that you want to interpolate on the unit square. Assume you know the values of a continuous function at the corners of the square:

  • z00 is the function value at (0,0), the lower left corner
  • z10 is the function value at (1,0), the lower right corner
  • z01 is the function value at (0,1), the upper left corner
  • z11 is the function value at (1,1), the upper right corner

If (x,y) is any point inside the unit square, the interpolation at that point is the following weighted average of the values at the four corners:
F(x,y) = z00*(1-x)*(1-y) + z10*x*(1-y) + z01*(1-x)*y + z11*x*y

Notice that the interpolant is linear with respect to the values of the corners of the square. It is quadratic with respect to the location of the interpolation point. The interpolant is linear along every horizontal line (lines of constant y) and along every vertical line (lines of constant x).

Implement bilinear interpolation on the unit square in SAS

Suppose that the function values at the corners of a unit square are z00 = 0, z10 = 4, z01 = 2, and z11 = 1. For these values, the bilinear interpolant on the unit square is shown at the top of this article. Notice that the value of the interpolant at the corners agrees with the data values. Notice also that the interpolant is linear along each edge of the square. It is not easy to see that the interpolant is linear along each horizontal and vertical line, but that is true.

In SAS, you can use the SAS/IML matrix language to define a function that performs bilinear interpolation on the unit square. The function takes two arguments:

  • z is a four-element that specifies the values of the data at the corners of the square. The values are specified in the following order: lower left, lower right, upper left, and upper right. This is the "fitting data."
  • t is a k x 2 matrix, where each row specifies a point at which to interpolate. This is the "scoring data."

In a matrix language such as SAS/IML, the function can be written in vectorized form without using any loops:

proc iml;
start bilinearInterpSquare(z, t);
   z00 = z[1]; z10 = z[2]; z01 = z[3]; z11 = z[4]; 
   x = t[,1];  y = t[,2];
   cx = 1-x;                    /* cx = complement of x */
   return( (z00*cx + z10*x)#(1-y) + 
           (z01*cx + z11*x)#y );
finish;
 
/* specify the fitting data */
z = {0 4,                       /* bottom: values at (0,0) and (1,0) */
     2 1};                      /* top: values at (0,1) and (1,1) */
print z[c={'x=0' 'x=1'} r={'y=0' 'y=1'}];
 
t = {0.5 0,                     /* specify the scoring data */
     0    0.25,
     1    0.66666666,
     0.5  0.75,
     0.75 0.5     };
F = bilinearInterpSquare(z, t); /* test the bilinear interpolation function */
print t[c={'x' 'y'} format=FRACT.] F;

For points on the edge of the square, you can check a few values in your head:

  • The point (1/2, 0) is halfway between the corners with values 0 and 4. Therefore the interpolation gives 2.
  • The point (0, 1/4) is a quarter of the way between the corners with values 0 and 2. Therefore the interpolation gives 0.5.
  • The point (1, 2/3) is two thirds of the way between the corners with values 4 and 1. Therefore the interpolation gives 2.

The other points are in the interior of the square. The interpolant at those points is a weighted average of the corner values.

Visualize the interpolant function

Let's use the bilinearInterpSquare function to evaluate a grid of points. You can use the ExpandGrid function in SAS/IML to generate a grid of points. When we look at the values of the interpolant on a grid or in a heat map, we want the X values to change for each column and the Y values to change for each row. This means that you should reverse the usual order of the arguments to the ExpandGrid function:

/* for visualization, reverse the arguments to ExpandGrid and then swap 1st and 2nd columns of t */
xGrd = do(0, 1, 0.2);
yGrd = do(0, 1, 0.2);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
print Q[r=(char(YGrd,3)) c=(char(xGrd,3)) label="Bilinear Interpolant"];

The table shows the interpolant evaluated at the coordinates of a regular 6 x 6 grid. The column headers give the coordinate of the X variable and the row headers give the coordinates of the Y variable. You can see that each column and each row is an arithmetic sequence, which shows that the interpolant is linear on vertical and horizontal lines.

You can use a finer grid and a heat map to visualize the interpolant as a surface. Notice that in the previous table, the Y values increase as you go down the rows. If you want the Y axis to point up instead of down, you can reverse the rows of the grid (and the labels) before you create a heat map:

/* increase the grid resolution and visualize by using a heat map */
xGrd = do(0, 1, 0.1);
yGrd = do(0, 1, 0.1);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
 
/* currently, the Y axis is pointing down. Flip it and the labels. */
H = Q[ nrow(Q):1, ];
reverseY = yGrd[ ,ncol(yGrd):1];
call heatmapcont(H) range={0 4} displayoutlines=0
     xValues=xGrd yValues=reverseY
     colorramp=palette("Spectral", 7)[,7:1]
     title="Interpolation at Multiple Points in the Unit Square";

The heat map agrees with the contour plot at the top of this article. The heat map also makes it easier to see that the bilinear interpolant is linear on each row and column.

In summary, you can create a short SAS/IML function that performs bilinear interpolation on the unit square. (You can download the SAS program that generates the tables and graphs in this article.) The interpolant is a quadratic saddle-shaped function inside the square. You can use the ideas in this article to create a function that performs bilinear interpolation on an arbitrary grid of fitting data. The next article shows a general function for bilinear interpolation in SAS.

The post What is bilinear interpolation? appeared first on The DO Loop.

5月 132020
 

This article shows how to find local maxima and maxima on a regression curve, which means finding points where the slope of the curve is zero. An example appears at the right, which shows locations where the loess smoother in a scatter plot has local minima and maxima. Except for simple cases like quadratic regression, you need to use numerical techniques to locate these values.

In a previous article, I showed how to use SAS to evaluate the slope of a regression curve at specific points. The present article applies that technique by scoring the regression curve on a fine grid of point. You can use finite differences to approximate the slope of the curve at each point on the grid. You can then estimate the locations where the slope is zero.

In this article, I use the LOESS procedure to demonstrate the technique, but the method applies equally well to any one-dimensional regression curve. There are several ways to score a regression model:

  • Most parametric regression procedures in SAS (GLM, GLIMMIX, MIXED, ...) support the STORE statement. The STORE statement saves a representation of the model in a SAS item store. You can use PROC PLM to score a model from an item store.
  • Some nonparametric regression procedures in SAS do not support the STORE statement but support a SCORE statement. PROC ADAPTIVEREG and PROC LOESS are two examples. This article shows how to use the SCORE statement to find points where the regression curve has zero slope.
  • Some regression procedures in SAS do not support either the STORE or SCORE statements. For those procedures, you need to use the use the missing value trick to score the model.

The technique in this article will not detect inflection points. An inflection point is a location where the curve has zero slope but is not a local min or max. Consequently, this article is really about "how to find a point where a regression curve has a local extremum," but I will use the slightly inaccurate phrase "find points where the slope is zero."

How to find locations where the slope of the curve is zero?

For convenience, I assume the explanatory variable is named X and the response variable is named Y. The goal is to find locations where a nonparametric curve (x, f(x)) has zero slopes, where f(x) is the regression model. The general outline follows:

  1. Create a grid of points in the range of the explanatory variable. The grid does not have to be evenly spaced, but it is in this example.
  2. Score the model at the grid locations.
  3. Use finite differencing to approximate the slope of the regression curve at the grid points. If the slope changes sign between consecutive grid points, estimate the location between the grid points where the slope is exactly zero. Use linear interpolation to approximate the response at that location.
  4. Optionally, graph the original data, the regression curve, and the point along the curve where the slope is zero.

Example data

SAS distributes the ENSO data set in the SASHelp library. You can create a DATA step view that renames the explanatory and response variables to X and Y, respectively, so that it is easier to follow the logic of the program:

/* Create VIEW where x is the independent variable and y is the response */
data Have / view=Have;
set Sashelp.Enso(rename=(Month=x Pressure=y));
keep x y;
run;

Create a grid of points

After the data set is created, you can use PROC SQL to find the minimum and maximum values of the explanatory variable. You can create an evenly spaced grid of points for the range of the explanatory variable.

/* Put min and max into macro variables */
proc sql noprint;
  select min(x), max(x) into :min_x, :max_x 
  from Have;
quit;
 
/* Evaluate the model and estimate derivatives at these points */
data Grid;
dx = (&max_x - &min_x)/201;    /* choose the step size wisely */
do x = &min_x to &max_x by dx;   
   output;
end;
drop dx;
run;

Score the model at the grid locations

This is the step that will vary from procedure to procedure. You have to know how to use the procedure to score the regression model on the points in the Grid data set. The LOESS procedure supports a SCORE statement, so the call fits the model and scores the model on the Grid data set:

/* Score the model on the grid */
ods select none;    /* do not display the tables */
proc loess data=Have plots=none;
   model y = x;
   score data=Grid;  /* PROC LOESS does not support an OUT= option */
   /* Most procedures support an OUT= option to save the scored values.
      PROC LOESS displays the scored values in a table, so use ODS to
      save the table to an output data set */
   ods output ScoreResults=ScoreOut;
run;
ods select all;

If a procedure supports the STORE statement, you can use PROC PLM to score the model on the data. The SAS program that accompanies this article includes an example that uses the GAMPL procedure. The GAMPL procedure does not support the STORE or SCORE statements, but you can use the missing value trick to find zero derivatives.

Find the locations where the slope is zero

This is the mathematical portion of the computation. You can use a backward difference scheme to estimate the derivative (slope) of the curve. If (x0, y0) and (x1, y1) are two consecutive points along the curve (in the ScoreOut data set), then the slope at (x1, y1) is approximately m = (y1 - y0) / (x1 - x0). When the slope changes sign between consecutive points, it indicates that the slope changed from positive to negative (or vice versa) between the points. If the slope is continuous, it must have been exactly zero somewhere on the interval. You can use a linear approximation to find the point, t, where the slope is zero. You can then use linear interpolation to approximate the point (t, f(t)) at which the curve is a local min or max.

You can use the following SAS DATA step to process the scoring data, approximate the slope, and estimate where the slope of the curve is zero:

/* Compute slope by using finite difference formula. */
data Deriv0;
set ScoreOut;
Slope = dif(p_y) / dif(x);      /* (f(x) - f(x-dx)) / dx */
xPrev = lag(x);  yPrev = lag(p_y);  SlopePrev = lag(Slope);
if n(SlopePrev) AND sign(SlopePrev) ^= sign(Slope) then do;
   /* The slope changes sign between this obs and the previous.
      Assuming linearity on the interval, find (t, f(t))
      where slope is exactly zero */
   t0 = xPrev - SlopePrev * (x - xPrev)/(Slope - SlopePrev); 
   /* use linear interpolation to find the corresponding y value:
      f(t) ~ y0 + (y1-y0)/(x1-x0) * (t - x0)       */
   f_t0 = yPrev + (yPrev - p_y)/(x - xPrev) * (t0 - xPrev);
   if sign(SlopePrev) > 0 then _Type_ = "Max";
   else _Type_ = "Min";
   output; 
end;
keep t0 f_t0 _Type_;
label f_t0 = "f(t0)";
run;
 
proc print data=Deriv0 label;
run;

The table shows that there are seven points at which the derivative of the loess regression curve has a local min or max.

Graph the results

If you want to display the local extreme on the graph of the regression curve, you can concatenate the original data, the regression curve, and the local extreme. You can then use PROC SGPLOT to overlay the three layers. The resulting graph is shown at the top of this article.

data Combine;
merge Have                           /* data   : (x, y)     */
      ScoreOut(rename=(x=t p_y=p_t)) /* curve  : (t, p_t)   */
      Deriv0;                        /* extrema: (t0, f_t0) */
run;
 
title "Loess Smoother";
title2 "Red Markers Indicate Zero Slope for Smoother";
proc sgplot data=Combine noautolegend;
   scatter x=x y=y;
   series x=t y=p_t / lineattrs=GraphData2;
   scatter x=t0 y=f_t0 / markerattrs=(symbol=circlefilled color=red);
   yaxis grid; 
run;

Summary

In summary, if you can evaluate a regression curve on a grid of points, you can approximate the slope at each point along the curve. By looking for when the slope changes sign, you can find local minima and maxima. You can then use a simple linear estimator on the interval to estimate where the slope is exactly zero.

You can download the SAS program that performs the computations in this article.

The post Find points where a regression curve has zero slope appeared first on The DO Loop.

5月 112020
 

I recently showed how to use linear interpolation in SAS. Linear interpolation is a common way to interpolate between a set of planar points, but the interpolating function (the interpolant) is not smooth. If you want a smoother interpolant, you can use cubic spline interpolation. This article describes how to use a cubic spline interpolation in SAS.

As mentioned in my previous post, an interpolation requires two sets of numbers:

  1. Data: Let (x1, y1), (x2, y2), ..., (xn, yn) be a set of n data points. These sample data should not contain any missing values. The data must be ordered so that x1 < x2 < ... < xn.
  2. Values to score: Let {t1, t2, ..., tk} be a set of k new values for the X variable. For interpolation, all values must be within the range of the data: x1 ≤ ti ≤ xn for all i. The goal of interpolation is to produce a new Y value for each value of ti. The scoring data is also called the "query data."

The following SAS DATA steps define the data for this example. The POINTS data set contains the sample data, which are shown as blue markers on the graph to the right. The SCORE data set contains the scoring data, which are shown as red tick marks along the horizontal axis.

/* Example dats for 1-D interpolation */
data Points;  /* these points define the model */
input x y;
datalines;
0  1
1  3
4  5
5  4
7  6
8  3
10 3
;
 
data Score; /* these points are to be interpolated */
input t @@;
datalines;
2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9
;

On the graph, the blue curve is the cubic spline interpolant. Every point that you interpolate will be on that curve. The red asterisks are the interpolated values for the values in the SCORE data set. Notice that points -1 and 10.5 are not interpolated because they are outside of the data range. The following section shows how to compute the cubic spline interpolation in SAS.

Cubic spline interpolation in SAS

A linear interpolation uses a linear function on each interval between the data points. In general, the linear segments do not meet smoothly: the resulting interpolant is continuous but not smooth. In contrast, spline interpolation uses a polynomial function on each interval and chooses the polynomials so that the interpolant is smooth where adjacent polynomials meet. For polynomials of degree k, you can match the first k – 1 derivatives at each data point.

A cubic spline is composed of piecewise cubic polynomials whose first and second derivatives match at each data point. Typically, the second derivatives at the minimum and maximum of the data are set to zero. This kind of spline is known as a "natural cubic spline" with knots placed at each data point.

I have previously shown how use the SPLINE call in SAS/IML to compute a smoothing spline. A smoothing spline is not an interpolant because it does not pass through the original data points. However, you can get interpolation by using the SMOOTH=0 option. Adding the TYPE='zero' option results in a natural cubic spline.

For more control over the interpolation, you can use the SPLINEC function ('C' for coefficients) to fit the cubic splines to the data and obtain a matrix of coefficients. You can then use that matrix in the SPLINEV function ('V' for value) to evaluate the interpolant at the locations in the scoring data.

The following SAS/IML function (CubicInterp) computes the spline coefficients from the sample data and then interpolates the scoring data. The details of the computation are provided in the comments, but you do not need to know the details in order to use the function to interpolate data:

/* Cubic interpolating spline in SAS.
   The interpolation is based on the values (x1,y1), (x2,y2), ..., (xn, yn).
   The X  values must be nonmissing and in increasing order: x1 < x2 < ... < xn
   The values of the t vector are interpolated.
*/
proc iml;
start CubicInterp(x, y, t);
   d = dif(x, 1, 1);                     /* check that x[i+1] > x[i] */
   if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing.";
   idx = loc(t>=min(x) && t<=max(x));    /* check for valid scoring values */
   if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x.";
 
   /* fit the cubic model to the data */
   call splinec(splPred, coeff, endSlopes, x||y) smooth=0 type="zero";
 
   p = j(nrow(t)*ncol(t), 1, .);       /* allocate output (prediction) vector */
   call sortndx(ndx, colvec(t));       /* SPLINEV wants sorted data, so get sort index for t */
   sort_t = t[ndx];                    /* sorted version of t */
   sort_pred = splinev(coeff, sort_t); /* evaluate model at (sorted) points of t */
   p[ndx] = sort_pred[,2];             /* "unsort" by using the inverse sort index */
   return( p );
finish;
 
/* example of linear interpolation in SAS */
use Points; read all var {'x' 'y'}; close;
use Score; read all var 't'; close;
 
pred = CubicInterp(x, y, t);
create PRED var {'t' 'pred'}; append; close;
QUIT;

The visualization of the interpolation is similar to the code in the previous article, so the code is not shown here. However, you can download the SAS program that performs the cubic interpolation and creates the graph at the top of this article.

Although cubic spline interpolation is slower than linear interpolation, it is still fast: The CubicInterp program takes about 0.75 seconds to fit 1000 data points and interpolate one million scoring values.

Summary

In summary, the SAS/IML language provides the computational tools for cubic spline interpolation. The CubicInterp function in this article encapsulates the functionality so that you can perform cubic spline interpolation of your data in an efficient manner.

The post Cubic spline interpolation in SAS appeared first on The DO Loop.

5月 042020
 

SAS programmers sometimes ask about ways to perform one-dimensional linear interpolation in SAS. This article shows three ways to perform linear interpolation in SAS: PROC IML (in SAS/IML software), PROC EXPAND (in SAS/ETS software), and PROC TRANSREG (in SAS/STAT software). Of these, PROC IML Is the simplest to use and has the most flexibility. This article shows how to implement an efficient 1-D linear interpolation algorithm in SAS. You can download the SAS program that creates the analyses and graphs in this article.

Linear interpolation assumptions

For one-dimensional linear interpolation of points in the plane, you need two sets of numbers:

  1. Data: Let (x1, y1), (x2, y2), ..., (xn, yn) be a set of n data points. The data should not contain any missing values. The data must be ordered so that x1 < x2 < ... < xn. These values uniquely define the linear interpolation function on [x1, xn]. I call this the "sample data" or "fitting data" because it is used to create the linear interpolation model.
  2. Values to score: Let {t1, t2, ..., tk} be a set of k new values for the X variable. For interpolation, all values must be within the range of the data: x1 ≤ ti ≤ xn for all i. The goal of interpolation is to produce a new Y value for each value of ti. The scoring data is also called the "query data."

Interpolation requires a model. For linear interpolation, the model is the unique piecewise linear function that passes through each sample point and is linear on each interval [xi, xi+1]. The model is usually undefined outside of the range of the data, although there are various (nonunique) ways to extrapolate the model beyond the range of the data. You fit the model by using the data. You then score the model on the set of new values.

The following SAS data sets define example data for linear interpolation. The POINTS data set contains fitting data that define the linear model. The SCORE data set contains the new query points at which we want to interpolate. The linear interpolation is shown to the right. The sample data are shown as blue markers. The model is shown as blue lines. The query values to score are shown as a fringe plot along the X axis. The interpolated values are shown as red markers.

The data used for this example are:

/* Example data for 1-D interpolation */
data Points;  /* these points define the model */
input x y;
datalines;
0  1
1  3
4  5
5  4
7  6
8  3
10 3
;
 
data Score; /* these points are to be interpolated */
input t @@;
datalines;
2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9
;

For convenience, the fitting data are already sorted by the X variable, which is in the range [0, 10]. The scoring data set does not need to be sorted.

The scoring data for this example contains five special values:

  • Two scoring values (-1 and 10.5) are outside of the range of the data. An interpolation algorithm should return a missing value for these values. (Otherwise, it is extrapolation.)
  • Two scoring values (0 and 1) are duplicates of X values in the data. Ideally, this should not present a problem for the interpolation algorithm.
  • The value 9 appears twice in the scoring data.

Linear interpolation in SAS by using PROC IML

As is often the case, PROC IML enables you to implement a custom algorithm in only a few lines of code. For simplicity, suppose you have a single value, t, that you want to interpolate, based on the data (x1, y1), (x2, y2), ..., (xn, yn). The main steps for linear interpolation are:

  1. Check that the X values are nonmissing and in increasing order: x1 < x2 < ... < xn. Check that t is in the range [x1, xn]. If not, return a missing value.
  2. Find the first interval that contains t. You can use the BIN function in SAS/IML to find the first value i for which x_i <= t <= x_{i+1}.
  3. Define the left and right endpoint of the interval: xL = x_i and xR = x_{i+1}. Define the corresponding response values: yL = y_i and yR = y_{i+1}.
  4. Let f = (t - xL) / (xR - xL) be the proportion of the interval to the left of t. Then p = (1 - f)*yL + f*yR is the linear interpolation at t.

The steps are implemented in the following SAS/IML function. The function accepts a vector of scoring values, t. Notice that the program does not contain any loops over the elements of t. All statements and operations are vectorized, which is very efficient.

/* Linear interpolation based on the values (x1,y1), (x2,y2),....
   The X  values must be nonmissing and in increasing order: x1 < x2 < ... < xn
   The values of the t vector are linearly interpolated.
*/
proc iml;
start LinInterp(x, y, _t);
   d = dif(x, 1, 1);                     /* check that x[i+1] > x[i] */
   if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing.";
   idx = loc(_t>=min(x) && _t<=max(x));  /* check for valid scoring values */
   if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x.";
 
   p = j(nrow(_t)*ncol(_t), 1, .);     /* allocate output (prediction) vector */
   t = _t[idx];                        /* subset t values inside range(x) */
   k = bin(t, x);                      /* find interval [x_i, x_{i+1}] that contains s */
   xL = x[k];   yL = y[k];             /* find (xL, yL) and (xR, yR) */
   xR = x[k+1]; yR = y[k+1];
   f = (t - xL) / (xR - xL);           /* f = fraction of interval [xL, xR] */
   p[idx] = (1 - f)#yL + f#yR;        /* interpolate between yL and yR */
   return( p );
finish;
 
/* example of linear interpolation in SAS */
use Points; read all var {'x' 'y'}; close;
use Score; read all var 't'; close;
 
pred = LinInterp(x, y, t);
create PRED var {'t' 'pred'}; append; close;
QUIT;

Visualize a linear interpolation in SAS

The previous program writes the interpolated values to the PRED data set. You can concatenate the original data and the interpolated values to visualize the linear interpolation:

/* Visualize: concatenate data and predicted (interpolated) values */
data All;
set Points Pred;
run;
 
title "Linear Interpolation";
title2 "No Extrapolation";
proc sgplot data=All noautolegend;
   series x=x y=y;
   scatter x=x y=y / markerattrs=(symbol=CircleFilled size=12)
                    name="data" legendlabel="Data";
   scatter x=t y=Pred / markerattrs=(symbol=asterisk size=12 color=red)
                    name="interp" legendlabel="Interpolated Values";
   fringe t / lineattrs=(color=red thickness=2)
                    name="score" legendlabel="Values to Score";
   xaxis grid values=(0 to 10) valueshint label="X";
   yaxis grid label="Y" offsetmin=0.05;
   keylegend "data" "score" "interp";
run;

The graph is shown at the top of this article. A few noteworthy items:

  • The values -1 and 10.5 are not scored because they are outside the range of the data.
  • The values 0 and 1 correspond to a data point. The interpolated value is the corresponding data value.
  • The other values are interpolated onto the straight line segments that connect the data.

Performance of the IML algorithm

The IML algorithm is very fast. On my Windows PC (Pentium i7), the interpolation takes only 0.2 seconds for 1,000 data points and one million scoring values. For 10,000 data points and one million scoring values, the interpolation takes about 0.25 seconds. The SAS program that accompanies this article includes timing code.

Other SAS procedures that can perform linear interpolation

According to a SAS Usage Note, you can perform linear interpolation in SAS by using PROC EXPAND in SAS/ETS software or PROC TRANSREG in SAS/STAT software. Each has some limitations that I don't like:

  • Both procedures use the missing value trick to perform the fitting and scoring in a single call. That means you must concatenate the sample data (which is often small) and the query data (which can be large).
  • PROC EXPAND requires that the combined data be sorted by X. That can be easily accomplished by calling PROC SORT after you concatenate the sample and query data. However, PROC EXPAND does not support duplicate X values! For me, this makes PROC EXPAND unusable. It means that you cannot score the model at points that are in the original data, nor can you have repeated values in the scoring data.
  • If you use PROC TRANSREG for linear interpolation, you must know the number of sample data points, n. You must specify n – 2 on the NKNOTS= option on a SPLINE transformation. Usually, this means that you must perform an extra step (DATA step or PROC MEANS) and store n – 2 in a macro variable.
  • For scoring values outside the range of the data, PROC EXPAND returns a missing value. However, PROC TRANSREG extrapolates. If t < x1, then the extrapolated value at t is y1. Similarly, if t > xn, then the extrapolated value at t is yn.

I've included examples of using PROC EXPAND and PROC TRANSREG in the SAS program that accompanies this article. You can use these procedures for linear interpolation, but neither is as convenient as PROC IML.

With effort, you can use Base SAS routines such as the DATA step to implement a linear interpolation algorithm. An example is provided by KSharp in a SAS Support Community thread.

Summary

If you want to perform linear interpolation in SAS, the easiest and most efficient method is PROC IML. I have provided a SAS/IML function that implements linear interpolation by using two input vectors (x and y) to define the model and one input vector (t) to specify the points at which to interpolate. The function returns the interpolated values for t.

The post Linear interpolation in SAS appeared first on The DO Loop.

5月 042020
 

SAS programmers sometimes ask about ways to perform one-dimensional linear interpolation in SAS. This article shows three ways to perform linear interpolation in SAS: PROC IML (in SAS/IML software), PROC EXPAND (in SAS/ETS software), and PROC TRANSREG (in SAS/STAT software). Of these, PROC IML Is the simplest to use and has the most flexibility. This article shows how to implement an efficient 1-D linear interpolation algorithm in SAS. You can download the SAS program that creates the analyses and graphs in this article.

Linear interpolation assumptions

For one-dimensional linear interpolation of points in the plane, you need two sets of numbers:

  1. Data: Let (x1, y1), (x2, y2), ..., (xn, yn) be a set of n data points. The data should not contain any missing values. The data must be ordered so that x1 < x2 < ... < xn. These values uniquely define the linear interpolation function on [x1, xn]. I call this the "sample data" or "fitting data" because it is used to create the linear interpolation model.
  2. Values to score: Let {t1, t2, ..., tk} be a set of k new values for the X variable. For interpolation, all values must be within the range of the data: x1 ≤ ti ≤ xn for all i. The goal of interpolation is to produce a new Y value for each value of ti. The scoring data is also called the "query data."

Interpolation requires a model. For linear interpolation, the model is the unique piecewise linear function that passes through each sample point and is linear on each interval [xi, xi+1]. The model is usually undefined outside of the range of the data, although there are various (nonunique) ways to extrapolate the model beyond the range of the data. You fit the model by using the data. You then score the model on the set of new values.

The following SAS data sets define example data for linear interpolation. The POINTS data set contains fitting data that define the linear model. The SCORE data set contains the new query points at which we want to interpolate. The linear interpolation is shown to the right. The sample data are shown as blue markers. The model is shown as blue lines. The query values to score are shown as a fringe plot along the X axis. The interpolated values are shown as red markers.

The data used for this example are:

/* Example data for 1-D interpolation */
data Points;  /* these points define the model */
input x y;
datalines;
0  1
1  3
4  5
5  4
7  6
8  3
10 3
;
 
data Score; /* these points are to be interpolated */
input t @@;
datalines;
2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9
;

For convenience, the fitting data are already sorted by the X variable, which is in the range [0, 10]. The scoring data set does not need to be sorted.

The scoring data for this example contains five special values:

  • Two scoring values (-1 and 10.5) are outside of the range of the data. An interpolation algorithm should return a missing value for these values. (Otherwise, it is extrapolation.)
  • Two scoring values (0 and 1) are duplicates of X values in the data. Ideally, this should not present a problem for the interpolation algorithm.
  • The value 9 appears twice in the scoring data.

Linear interpolation in SAS by using PROC IML

As is often the case, PROC IML enables you to implement a custom algorithm in only a few lines of code. For simplicity, suppose you have a single value, t, that you want to interpolate, based on the data (x1, y1), (x2, y2), ..., (xn, yn). The main steps for linear interpolation are:

  1. Check that the X values are nonmissing and in increasing order: x1 < x2 < ... < xn. Check that t is in the range [x1, xn]. If not, return a missing value.
  2. Find the first interval that contains t. You can use the BIN function in SAS/IML to find the first value i for which x_i <= t <= x_{i+1}.
  3. Define the left and right endpoint of the interval: xL = x_i and xR = x_{i+1}. Define the corresponding response values: yL = y_i and yR = y_{i+1}.
  4. Let f = (t - xL) / (xR - xL) be the proportion of the interval to the left of t. Then p = (1 - f)*yL + f*yR is the linear interpolation at t.

The steps are implemented in the following SAS/IML function. The function accepts a vector of scoring values, t. Notice that the program does not contain any loops over the elements of t. All statements and operations are vectorized, which is very efficient.

/* Linear interpolation based on the values (x1,y1), (x2,y2),....
   The X  values must be nonmissing and in increasing order: x1 < x2 < ... < xn
   The values of the t vector are linearly interpolated.
*/
proc iml;
start LinInterp(x, y, _t);
   d = dif(x, 1, 1);                     /* check that x[i+1] > x[i] */
   if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing.";
   idx = loc(_t>=min(x) && _t<=max(x));  /* check for valid scoring values */
   if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x.";
 
   p = j(nrow(_t)*ncol(_t), 1, .);     /* allocate output (prediction) vector */
   t = _t[idx];                        /* subset t values inside range(x) */
   k = bin(t, x);                      /* find interval [x_i, x_{i+1}] that contains s */
   xL = x[k];   yL = y[k];             /* find (xL, yL) and (xR, yR) */
   xR = x[k+1]; yR = y[k+1];
   f = (t - xL) / (xR - xL);           /* f = fraction of interval [xL, xR] */
   p[idx] = (1 - f)#yL + f#yR;        /* interpolate between yL and yR */
   return( p );
finish;
 
/* example of linear interpolation in SAS */
use Points; read all var {'x' 'y'}; close;
use Score; read all var 't'; close;
 
pred = LinInterp(x, y, t);
create PRED var {'t' 'pred'}; append; close;
QUIT;

Visualize a linear interpolation in SAS

The previous program writes the interpolated values to the PRED data set. You can concatenate the original data and the interpolated values to visualize the linear interpolation:

/* Visualize: concatenate data and predicted (interpolated) values */
data All;
set Points Pred;
run;
 
title "Linear Interpolation";
title2 "No Extrapolation";
proc sgplot data=All noautolegend;
   series x=x y=y;
   scatter x=x y=y / markerattrs=(symbol=CircleFilled size=12)
                    name="data" legendlabel="Data";
   scatter x=t y=Pred / markerattrs=(symbol=asterisk size=12 color=red)
                    name="interp" legendlabel="Interpolated Values";
   fringe t / lineattrs=(color=red thickness=2)
                    name="score" legendlabel="Values to Score";
   xaxis grid values=(0 to 10) valueshint label="X";
   yaxis grid label="Y" offsetmin=0.05;
   keylegend "data" "score" "interp";
run;

The graph is shown at the top of this article. A few noteworthy items:

  • The values -1 and 10.5 are not scored because they are outside the range of the data.
  • The values 0 and 1 correspond to a data point. The interpolated value is the corresponding data value.
  • The other values are interpolated onto the straight line segments that connect the data.

Performance of the IML algorithm

The IML algorithm is very fast. On my Windows PC (Pentium i7), the interpolation takes only 0.2 seconds for 1,000 data points and one million scoring values. For 10,000 data points and one million scoring values, the interpolation takes about 0.25 seconds. The SAS program that accompanies this article includes timing code.

Other SAS procedures that can perform linear interpolation

According to a SAS Usage Note, you can perform linear interpolation in SAS by using PROC EXPAND in SAS/ETS software or PROC TRANSREG in SAS/STAT software. Each has some limitations that I don't like:

  • Both procedures use the missing value trick to perform the fitting and scoring in a single call. That means you must concatenate the sample data (which is often small) and the query data (which can be large).
  • PROC EXPAND requires that the combined data be sorted by X. That can be easily accomplished by calling PROC SORT after you concatenate the sample and query data. However, PROC EXPAND does not support duplicate X values! For me, this makes PROC EXPAND unusable. It means that you cannot score the model at points that are in the original data, nor can you have repeated values in the scoring data.
  • If you use PROC TRANSREG for linear interpolation, you must know the number of sample data points, n. You must specify n – 2 on the NKNOTS= option on a SPLINE transformation. Usually, this means that you must perform an extra step (DATA step or PROC MEANS) and store n – 2 in a macro variable.
  • For scoring values outside the range of the data, PROC EXPAND returns a missing value. However, PROC TRANSREG extrapolates. If t < x1, then the extrapolated value at t is y1. Similarly, if t > xn, then the extrapolated value at t is yn.

I've included examples of using PROC EXPAND and PROC TRANSREG in the SAS program that accompanies this article. You can use these procedures for linear interpolation, but neither is as convenient as PROC IML.

With effort, you can use Base SAS routines such as the DATA step to implement a linear interpolation algorithm. An example is provided by KSharp in a SAS Support Community thread.

Summary

If you want to perform linear interpolation in SAS, the easiest and most efficient method is PROC IML. I have provided a SAS/IML function that implements linear interpolation by using two input vectors (x and y) to define the model and one input vector (t) to specify the points at which to interpolate. The function returns the interpolated values for t.

The post Linear interpolation in SAS appeared first on The DO Loop.

4月 272020
 

I've previously written about how to generate points that are uniformly distributed in the unit disk. A seemingly unrelated topic is the distribution of eigenvalues (in the complex plane) of various kinds of random matrices. However, I recently learned that these topics are somewhat related!

A mathematical result called the "Circular Law" states that the (complex) eigenvalues of a (scaled) random n x n matrix are uniformly distributed in a disk as n → ∞. For this article, a random matrix is one whose entries are independent random variates from a specified distribution that has mean 0 and unit variance. A more precise statement of the circular law is found in the link, but, loosely speaking, if An is a sequence of random matrices of increasing size, the distribution of the eigenvalues of (1/sqrt(n)) An converges to the uniform distribution in the unit disk as n → ∞.

This article uses SAS to visualize the eigenvalue of several large random matrices. The graph to the right shows the complex eigenvalues for an n x n matrix whose entries are normal random variates from N(0, 1/sqrt(n)), where n = 2000. These eigenvalues appear to be uniformly distributed, as the circular law dictates, and most are inside the unit disk.

To demonstrate the circular law, you can fill a large n x n matrix with independent variates from any probability distribution that has mean 0 and unit variance. The matrix is then scaled by dividing each entry by sqrt(n). Equivalently, you can save a step and fill the matrix with random variates from a distribution that has 0 mean and standard deviation 1/sqrt(n).

Eigenvalues of random matrices

Let me be perfectly clear: There are simple and efficient algorithms to simulate uniformly distributed variates in the unit disk. No one should use the eigenvalue of a large matrix as a way to simulate uniform variates in a disk!

The following SAS/IML program creates a random 2000 x 2000 matrix where each element is a random variate from the N(0, 1/sqrt(n)) distribution, which has mean 0 and standard deviation 1/sqrt(n). It then computes the eigenvalues (which are complex) and writes them to a data set. Because I intend to repeat these computations for other distributions, I use a macro variable (DSName) to hold the name of the variable and I use a macro (RandCall) to specify the IML statement that fills a matrix, M, with random variates:

/* Demonstrate the circular law: eigenvalues are uniform in unit disk */
title "Eigenvalues of n x n Matrix";
 
/* title, data set name, and IML statement for a distribution */
title2 "Elements ~ N(0, 1/sqrt(n))";
%let DSName = Normal;
%macro RandCall;
   call randgen(M, "Normal", 0, 1/sqrt(n));
%mend;
 
proc iml;
call randseed(12345);
n = 2000;                            /* sample size = dim of matrix */
M = j(n, n);
%RandCall;
v = eigval(M);
create &DSName from v[c={"Re" "Im"}]; append from v; close; 
quit;

To visualize the eigenvalues, you can create a scatter plot that displays the real part (Re) and the imaginary part (Im) of each eigenvalue. I want to overlay a unit circle on the graph, so the following DATA step creates a piecewise linear approximation of a circle and uses the POLYGON statement in PROC SGPLOT to overlay the eigenvalues and the circle:

data Circle;
retain ID 1  k 100;
pi = constant('pi');
do theta = 0 to 2*pi by 2*pi/k;
   px = cos(theta);   py = sin(theta);   output;
end;
keep ID px py;
run;
 
data All;
set &DSName Circle;
label Re = "Real Part of Eigenvalue"
      Im = "Imaginary Part of Eigenvalue"
run;
 
ods graphics / width=480px height=480px;
proc sgplot data=All noautolegend;
   polygon x=px y=py ID=ID;   /* the unit circle */
   scatter x=Re y=Im;         /* the complex eigenvalues */
   xaxis grid; yaxis grid;
quit;

The graph is shown at the top of this article. It certainly appears that the eigenvalues of the 2000 x 2000 matrix are uniformly distributed. However, if you look carefully, a few eigenvalues appear to be outside of the unit disk. You can compute the distance from the origin for each eigenvalue (Re2 + Im2) and compare that distance to 1 to count how many of the 2000 eigenvalues are outside the unit circle:

data InOut;
set &DSName;
InDisk = (Re**2 + Im**2 <= 1);   /* indicator variable */
run;
 
proc freq data=InOut order=freq;
   tables InDisk / nocum ;
run;

For this random matrix, 12 eigenvalues are not inside the unit disk. Thus, although the eigenvalues appear to be approximately uniformly distributed, it is only in the limit as n → ∞ that all eigenvalues are inside the closed disk.

Other probability distributions

The remarkable thing about the circular law is that it is true for ANY probability distribution that has mean 0 and standard deviation 1/sqrt(n). To demonstrate, you can graph the eigenvalues for matrices that are filled with random variates from two other probability distributions:

  • U(-sqrt(3/n), sqrt(3/n)): For a uniform distribution on [a, b], the mean is b - a and the standard deviation is (b - a) / sqrt(12).
  • A discrete distribution for the random variable X/sqrt(n), where X can take on the values -3, 0, and 1 with probabilities 1/12, 8/12, and 3/12, respectively. For a discrete distribution, the mean is μ = Σi piXi and the variance is Σi pi(Xi - μ)2, where pi = P(X=xi).

To plot the eigenvalues of the matrix filled with random variates from the uniform distribution, you can modify the macro variables in the previous program:

/* matrix of random uniform variates */
title2 "Elements ~ U(-sqrt(3/n), sqrt(3/n))";
%let DSName = Uniform;
%macro RandCall;
   call randgen(M, "Uniform", -sqrt(3/n), sqrt(3/n));
%mend;

To plot the eigenvalues of the matrix filled with random variates from the discrete distribution, modify the macro variables as follows:

/* matrix of discrete random uniform variates */
title2 "Elements ~ P{X=-3)=1/12, P(X=0)=8/12, P(X=1)=3/12";
%let DSName = Discrete;
%macro RandCall;
   call randgen(M, "Table", {0.0833333, 0.6666667, 0.25});  /* returns values 1, 2, 3 */
   vals = {-3, 0, 1} / sqrt(n);     /* substitute values from discrete distribution */
   M = shape( vals[M], n, n );      /* reshape into n x n matrix */
%mend;

The graph for each matrix is shown below. To my eye, they look similar. I had wondered whether I would be able to determine which eigenvalues came from a matrix that has only three discrete entries (-3/2000, 0, and 1/2000). It is amazing that the spectrum of a matrix that has only three unique elements is so similar to the spectrum of a matrix in which almost every element is distinct!

In summary, I used SAS to visualize the Circular Law, which is an interesting theorem about the distribution of the eigenvalues of a random matrix. The theorem states that the distribution converges to the uniform distribution on the unit disk. The elements of the matrix are independent random variates from any probability distribution that has mean 0 and standard deviation 1/sqrt(n). The SAS/IML language makes it convenient to create the random matrices and compute their eigenvalues.

The post The circular law for eigenvalues appeared first on The DO Loop.