Numerical Analysis

5月 202020
 

This article shows how to perform two-dimensional bilinear interpolation in SAS by using a SAS/IML function. It is assumed that you have observed the values of a response variable on a regular grid of locations.

A previous article showed how to interpolate inside one rectangular cell. When you have a grid of cells and a point (x, y) at which to interpolate, the first step is to efficiently locate which cell contains (x, y). You can then use the values at the corners of the cell to interpolate at (x, y), as shown in the previous article. For example, the adjacent graph shows the bilinear interpolation function that is defined by 12 points on a 4 x 3 grid.

You can download the SAS program that creates the analyses and graphs in this article.

Bilinear interpolation assumptions

For two-dimensional linear interpolation of points, the following sets of numbers are the inputs for bilinear interpolation:

  1. Grid locations: Let x1 < x2 < ... < xp be the horizontal coordinates of a regular grid. Let y1 < y2 < ... < yn be the vertical coordinates.
  2. Response Values: Let Z be an n x p matrix of response values on the grid. The (j, i)th value of Z is the response value at the location (xi, yj). Notice that the columns of Z correspond to the horizontal positions for the X values and the rows of Z correspond to the vertical positions for the Y values. (This might seem "backwards" to you.) The response values data should not contain any missing values. The X, Y, and Z values (called the "fitting data") define the bilinear interpolation function on the rectangle [x1, xp] x [y1, yn].
  3. Values to score: Let (s1, t1), (s2, t2), ..., (sk, tk) be a set of k new (x, y) values at which you want to interpolate. All values must be within the range of the data: x1 ≤ si ≤ xp and y1 ≤ si ≤ yn for all (i,j) pairs. These values are called the "scoring data" or the "query data."

The goal of interpolation is to estimate the response at each location (si, tj) based on the values at the corner of the cell that contains the location. I have written a SAS/IML function called BilinearInterp, which you can freely download and use. The following statements indicate how to define and store the bilinearInterp function:

proc iml;
/* xGrd : vector of Nx points along x axis
   yGrd : vector of Ny points along y axis
   z    : Ny x Nx matrix. The value Z[i,j] is the value at xGrd[j] and yGrd[i]
   t    : k x 2 matrix of points at which to interpolate
   The function returns a k x 1 vector, which is the bilinear interpolation at each row of t. 
*/
start bilinearInterp(xgrd, ygrd, z, t);
   /* download function from 
      https://github.com/sascommunities/the-do-loop-blog/blob/master/interpolation/bilinearInterp.sas 
   */
finish;
store module=bilinearInterp;       /* store the module so it can be loaded later */
QUIT;

You only need to store the module one time. You then load the module in any SAS/IML program that needs to use it. You can learn more about storing and loading SAS/IML modules.

The structure of the fitting data

The fitting data for bilinear interpolation consists of the grid of points (Z) and the X and Y locations of each grid point. The X and Y values do not need to be evenly spaced, but they can be.

Often the fitting data are stored in "long form" as triplets of (X, Y, Z) values. If so, the first step is to read the triplets and convert them into a matrix of Z values where the columns are associated with the X values and the rows are associated with the Y values. For example, the following example data set specifies a response variable (Z) on a 4 x 3 grid. The values of the grid in the Y direction are {0, 1, 1.5, 3} and the values in the X direction are {1, 2, 5}.

As explained in the previous article, the data must be sorted by Y and then by X. The following statements read the data into a SAS/IML matrix and extract the grid points in the X and Y direction:

data Have;
input x y z;
datalines;
1 0   6 
2 0   7 
5 0   5 
1 1   9 
2 1   9 
5 1   7 
1 1.5 10 
2 1.5 9 
5 1.5 6 
1 3   12 
2 3   9 
5 3   8 
;
 
/* If the data are not already sorted, sort by Y and X */
proc sort data=Have; by Y X; run; 
 
proc iml;
use Have; read all var {x y z}; close;
xGrd = unique(x);                      /* find grid points */                     
yGrd = unique(y);
z = shape(z, ncol(yGrd), ncol(xGrd)); /* data must be sorted by y, then x */
print z[r=(char(yGrd)) c=(char(xGrd))];

Again, note that the columns of Z represent the X values and the rows represent the Y values. The xGrd and yGrd vectors contain the grid points along the horizontal and vertical dimensions of the grid, respectively.

The structure of the scoring data

The bilinearInterp function assumes that the points at which to interpolate are stored in a k x 2 matrix. Each row is an (x,y) location at which to interpolate from the fitting data. If an (x,y) location is outside of the fitting data, then the bilinearInterp function returns a missing value. The scoring locations do not need to be sorted.

The following example specifies six valid points and two invalid interpolation points:

t = {0   0,     /* not in data range */
     1   1,
     1   2,
     2.5 2,
     4   0.5,
     4   2,
     5   3,
     6   3};    /* not in data range */
 
/* LOAD the previously stored function */
load module=bilinearInterp;  
F = bilinearInterp(xGrd, yGrd, z, t);
print t[c={'x' 'y'}] F[format=Best4.];

The grid is defined on the rectangle [1,5] x [0,3]. The first and last rows of t are outside of the rectangle, so the interpolation returns missing values at those locations. The locations (1,1) and (5,3) are grid locations (corners of a cell), and the interpolation returns the correct response values. The other locations are in one of the grid cells. The interpolation is found by scaling the appropriate rectangle onto the unit square and applying the bilinear interpolation formula, as shown in the previous article.

Visualizing the bilinear interpolation function

You can use the ExpandGrid function to generate a fine grid of locations at which to interpolate. You can visualize the bilinear interpolation function by using a heat map or by using a surface plot.

The heat map visualization is shown at the top of this article. The colors indicate the interpolated values of the response variable. The markers indicate the observed values of the response variable. The reference lines show the grid that is defined by the fitting data.

The following graph visualizes the interpolation function by using a surface plot. The function is piecewise quadratic. Within each cell, it is linear on lines of constant X and constant Y. The "ridge lines" on the surface correspond to the locations where two adjacent grid cells meet.

Summary

In summary, you can perform bilinear interpolation in SAS by using a SAS/IML function. The function uses fitting data (X, Y, and Z locations on a grid) to interpolate. Inside each cell, the interpolation is quadratic and is linear on lines of constant X and constant Y. For each location point, the interpolation function first determines what cell the point is in. It then uses the corners of the cell to interpolate at the point. You can download the SAS program that performs bilinear interpolation and generates the tables and graphs in this article.

The post Bilinear interpolation in SAS appeared first on The DO Loop.

5月 182020
 

I've previously written about linear interpolation in one dimension. Bilinear interpolation is a method for two-dimensional interpolation on a rectangle. If the value of a function is known at the four corners of a rectangle, an interpolation scheme gives you a way to estimate the function at any point in the rectangle's interior. Bilinear interpolation is a weighted average of the values at the four corners of the rectangle. For an (x,y) position inside the rectangle, the weights are determined by the distance between the point and the corners. Corners that are closer to the point get more weight. The generic result is a saddle-shaped quadratic function, as shown in the contour plot to the right.

The Wikipedia article on bilinear interpolation provides a lot of formulas, but the article is needlessly complicated. The only important formula is how to interpolate on the unit square [0,1] x [0,1]. To interpolate on any other rectangle, simply map your rectangle onto the unit square and do the interpolation there. This trick works for bilinear interpolation because the weighted average depends only on the relative position of a point and the corners of the rectangle. Given a rectangle with lower-left corner (x0, y0) and upper right corner (x1, y1), you can map it into the unit square by using the transformation (x, y) → (u, v) where u = (x-x0)/(x1-x0) and v = (y-y0)/(y1-y0).

Bilinear interpolation on the unit square

Suppose that you want to interpolate on the unit square. Assume you know the values of a continuous function at the corners of the square:

  • z00 is the function value at (0,0), the lower left corner
  • z10 is the function value at (1,0), the lower right corner
  • z01 is the function value at (0,1), the upper left corner
  • z11 is the function value at (1,1), the upper right corner

If (x,y) is any point inside the unit square, the interpolation at that point is the following weighted average of the values at the four corners:
F(x,y) = z00*(1-x)*(1-y) + z10*x*(1-y) + z01*(1-x)*y + z11*x*y

Notice that the interpolant is linear with respect to the values of the corners of the square. It is quadratic with respect to the location of the interpolation point. The interpolant is linear along every horizontal line (lines of constant y) and along every vertical line (lines of constant x).

Implement bilinear interpolation on the unit square in SAS

Suppose that the function values at the corners of a unit square are z00 = 0, z10 = 4, z01 = 2, and z11 = 1. For these values, the bilinear interpolant on the unit square is shown at the top of this article. Notice that the value of the interpolant at the corners agrees with the data values. Notice also that the interpolant is linear along each edge of the square. It is not easy to see that the interpolant is linear along each horizontal and vertical line, but that is true.

In SAS, you can use the SAS/IML matrix language to define a function that performs bilinear interpolation on the unit square. The function takes two arguments:

  • z is a four-element that specifies the values of the data at the corners of the square. The values are specified in the following order: lower left, lower right, upper left, and upper right. This is the "fitting data."
  • t is a k x 2 matrix, where each row specifies a point at which to interpolate. This is the "scoring data."

In a matrix language such as SAS/IML, the function can be written in vectorized form without using any loops:

proc iml;
start bilinearInterpSquare(z, t);
   z00 = z[1]; z10 = z[2]; z01 = z[3]; z11 = z[4]; 
   x = t[,1];  y = t[,2];
   cx = 1-x;                    /* cx = complement of x */
   return( (z00*cx + z10*x)#(1-y) + 
           (z01*cx + z11*x)#y );
finish;
 
/* specify the fitting data */
z = {0 4,                       /* bottom: values at (0,0) and (1,0) */
     2 1};                      /* top: values at (0,1) and (1,1) */
print z[c={'x=0' 'x=1'} r={'y=0' 'y=1'}];
 
t = {0.5 0,                     /* specify the scoring data */
     0    0.25,
     1    0.66666666,
     0.5  0.75,
     0.75 0.5     };
F = bilinearInterpSquare(z, t); /* test the bilinear interpolation function */
print t[c={'x' 'y'} format=FRACT.] F;

For points on the edge of the square, you can check a few values in your head:

  • The point (1/2, 0) is halfway between the corners with values 0 and 4. Therefore the interpolation gives 2.
  • The point (0, 1/4) is a quarter of the way between the corners with values 0 and 2. Therefore the interpolation gives 0.5.
  • The point (1, 2/3) is two thirds of the way between the corners with values 4 and 1. Therefore the interpolation gives 2.

The other points are in the interior of the square. The interpolant at those points is a weighted average of the corner values.

Visualize the interpolant function

Let's use the bilinearInterpSquare function to evaluate a grid of points. You can use the ExpandGrid function in SAS/IML to generate a grid of points. When we look at the values of the interpolant on a grid or in a heat map, we want the X values to change for each column and the Y values to change for each row. This means that you should reverse the usual order of the arguments to the ExpandGrid function:

/* for visualization, reverse the arguments to ExpandGrid and then swap 1st and 2nd columns of t */
xGrd = do(0, 1, 0.2);
yGrd = do(0, 1, 0.2);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
print Q[r=(char(YGrd,3)) c=(char(xGrd,3)) label="Bilinear Interpolant"];

The table shows the interpolant evaluated at the coordinates of a regular 6 x 6 grid. The column headers give the coordinate of the X variable and the row headers give the coordinates of the Y variable. You can see that each column and each row is an arithmetic sequence, which shows that the interpolant is linear on vertical and horizontal lines.

You can use a finer grid and a heat map to visualize the interpolant as a surface. Notice that in the previous table, the Y values increase as you go down the rows. If you want the Y axis to point up instead of down, you can reverse the rows of the grid (and the labels) before you create a heat map:

/* increase the grid resolution and visualize by using a heat map */
xGrd = do(0, 1, 0.1);
yGrd = do(0, 1, 0.1);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
 
/* currently, the Y axis is pointing down. Flip it and the labels. */
H = Q[ nrow(Q):1, ];
reverseY = yGrd[ ,ncol(yGrd):1];
call heatmapcont(H) range={0 4} displayoutlines=0
     xValues=xGrd yValues=reverseY
     colorramp=palette("Spectral", 7)[,7:1]
     title="Interpolation at Multiple Points in the Unit Square";

The heat map agrees with the contour plot at the top of this article. The heat map also makes it easier to see that the bilinear interpolant is linear on each row and column.

In summary, you can create a short SAS/IML function that performs bilinear interpolation on the unit square. (You can download the SAS program that generates the tables and graphs in this article.) The interpolant is a quadratic saddle-shaped function inside the square. You can use the ideas in this article to create a function that performs bilinear interpolation on an arbitrary grid of fitting data. The next article shows a general function for bilinear interpolation in SAS.

The post What is bilinear interpolation? appeared first on The DO Loop.

5月 182020
 

I've previously written about linear interpolation in one dimension. Bilinear interpolation is a method for two-dimensional interpolation on a rectangle. If the value of a function is known at the four corners of a rectangle, an interpolation scheme gives you a way to estimate the function at any point in the rectangle's interior. Bilinear interpolation is a weighted average of the values at the four corners of the rectangle. For an (x,y) position inside the rectangle, the weights are determined by the distance between the point and the corners. Corners that are closer to the point get more weight. The generic result is a saddle-shaped quadratic function, as shown in the contour plot to the right.

The Wikipedia article on bilinear interpolation provides a lot of formulas, but the article is needlessly complicated. The only important formula is how to interpolate on the unit square [0,1] x [0,1]. To interpolate on any other rectangle, simply map your rectangle onto the unit square and do the interpolation there. This trick works for bilinear interpolation because the weighted average depends only on the relative position of a point and the corners of the rectangle. Given a rectangle with lower-left corner (x0, y0) and upper right corner (x1, y1), you can map it into the unit square by using the transformation (x, y) → (u, v) where u = (x-x0)/(x1-x0) and v = (y-y0)/(y1-y0).

Bilinear interpolation on the unit square

Suppose that you want to interpolate on the unit square. Assume you know the values of a continuous function at the corners of the square:

  • z00 is the function value at (0,0), the lower left corner
  • z10 is the function value at (1,0), the lower right corner
  • z01 is the function value at (0,1), the upper left corner
  • z11 is the function value at (1,1), the upper right corner

If (x,y) is any point inside the unit square, the interpolation at that point is the following weighted average of the values at the four corners:
F(x,y) = z00*(1-x)*(1-y) + z10*x*(1-y) + z01*(1-x)*y + z11*x*y

Notice that the interpolant is linear with respect to the values of the corners of the square. It is quadratic with respect to the location of the interpolation point. The interpolant is linear along every horizontal line (lines of constant y) and along every vertical line (lines of constant x).

Implement bilinear interpolation on the unit square in SAS

Suppose that the function values at the corners of a unit square are z00 = 0, z10 = 4, z01 = 2, and z11 = 1. For these values, the bilinear interpolant on the unit square is shown at the top of this article. Notice that the value of the interpolant at the corners agrees with the data values. Notice also that the interpolant is linear along each edge of the square. It is not easy to see that the interpolant is linear along each horizontal and vertical line, but that is true.

In SAS, you can use the SAS/IML matrix language to define a function that performs bilinear interpolation on the unit square. The function takes two arguments:

  • z is a four-element that specifies the values of the data at the corners of the square. The values are specified in the following order: lower left, lower right, upper left, and upper right. This is the "fitting data."
  • t is a k x 2 matrix, where each row specifies a point at which to interpolate. This is the "scoring data."

In a matrix language such as SAS/IML, the function can be written in vectorized form without using any loops:

proc iml;
start bilinearInterpSquare(z, t);
   z00 = z[1]; z10 = z[2]; z01 = z[3]; z11 = z[4]; 
   x = t[,1];  y = t[,2];
   cx = 1-x;                    /* cx = complement of x */
   return( (z00*cx + z10*x)#(1-y) + 
           (z01*cx + z11*x)#y );
finish;
 
/* specify the fitting data */
z = {0 4,                       /* bottom: values at (0,0) and (1,0) */
     2 1};                      /* top: values at (0,1) and (1,1) */
print z[c={'x=0' 'x=1'} r={'y=0' 'y=1'}];
 
t = {0.5 0,                     /* specify the scoring data */
     0    0.25,
     1    0.66666666,
     0.5  0.75,
     0.75 0.5     };
F = bilinearInterpSquare(z, t); /* test the bilinear interpolation function */
print t[c={'x' 'y'} format=FRACT.] F;

For points on the edge of the square, you can check a few values in your head:

  • The point (1/2, 0) is halfway between the corners with values 0 and 4. Therefore the interpolation gives 2.
  • The point (0, 1/4) is a quarter of the way between the corners with values 0 and 2. Therefore the interpolation gives 0.5.
  • The point (1, 2/3) is two thirds of the way between the corners with values 4 and 1. Therefore the interpolation gives 2.

The other points are in the interior of the square. The interpolant at those points is a weighted average of the corner values.

Visualize the interpolant function

Let's use the bilinearInterpSquare function to evaluate a grid of points. You can use the ExpandGrid function in SAS/IML to generate a grid of points. When we look at the values of the interpolant on a grid or in a heat map, we want the X values to change for each column and the Y values to change for each row. This means that you should reverse the usual order of the arguments to the ExpandGrid function:

/* for visualization, reverse the arguments to ExpandGrid and then swap 1st and 2nd columns of t */
xGrd = do(0, 1, 0.2);
yGrd = do(0, 1, 0.2);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
print Q[r=(char(YGrd,3)) c=(char(xGrd,3)) label="Bilinear Interpolant"];

The table shows the interpolant evaluated at the coordinates of a regular 6 x 6 grid. The column headers give the coordinate of the X variable and the row headers give the coordinates of the Y variable. You can see that each column and each row is an arithmetic sequence, which shows that the interpolant is linear on vertical and horizontal lines.

You can use a finer grid and a heat map to visualize the interpolant as a surface. Notice that in the previous table, the Y values increase as you go down the rows. If you want the Y axis to point up instead of down, you can reverse the rows of the grid (and the labels) before you create a heat map:

/* increase the grid resolution and visualize by using a heat map */
xGrd = do(0, 1, 0.1);
yGrd = do(0, 1, 0.1);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
 
/* currently, the Y axis is pointing down. Flip it and the labels. */
H = Q[ nrow(Q):1, ];
reverseY = yGrd[ ,ncol(yGrd):1];
call heatmapcont(H) range={0 4} displayoutlines=0
     xValues=xGrd yValues=reverseY
     colorramp=palette("Spectral", 7)[,7:1]
     title="Interpolation at Multiple Points in the Unit Square";

The heat map agrees with the contour plot at the top of this article. The heat map also makes it easier to see that the bilinear interpolant is linear on each row and column.

In summary, you can create a short SAS/IML function that performs bilinear interpolation on the unit square. (You can download the SAS program that generates the tables and graphs in this article.) The interpolant is a quadratic saddle-shaped function inside the square. You can use the ideas in this article to create a function that performs bilinear interpolation on an arbitrary grid of fitting data. The next article shows a general function for bilinear interpolation in SAS.

The post What is bilinear interpolation? appeared first on The DO Loop.

5月 132020
 

This article shows how to find local maxima and maxima on a regression curve, which means finding points where the slope of the curve is zero. An example appears at the right, which shows locations where the loess smoother in a scatter plot has local minima and maxima. Except for simple cases like quadratic regression, you need to use numerical techniques to locate these values.

In a previous article, I showed how to use SAS to evaluate the slope of a regression curve at specific points. The present article applies that technique by scoring the regression curve on a fine grid of point. You can use finite differences to approximate the slope of the curve at each point on the grid. You can then estimate the locations where the slope is zero.

In this article, I use the LOESS procedure to demonstrate the technique, but the method applies equally well to any one-dimensional regression curve. There are several ways to score a regression model:

  • Most parametric regression procedures in SAS (GLM, GLIMMIX, MIXED, ...) support the STORE statement. The STORE statement saves a representation of the model in a SAS item store. You can use PROC PLM to score a model from an item store.
  • Some nonparametric regression procedures in SAS do not support the STORE statement but support a SCORE statement. PROC ADAPTIVEREG and PROC LOESS are two examples. This article shows how to use the SCORE statement to find points where the regression curve has zero slope.
  • Some regression procedures in SAS do not support either the STORE or SCORE statements. For those procedures, you need to use the use the missing value trick to score the model.

The technique in this article will not detect inflection points. An inflection point is a location where the curve has zero slope but is not a local min or max. Consequently, this article is really about "how to find a point where a regression curve has a local extremum," but I will use the slightly inaccurate phrase "find points where the slope is zero."

How to find locations where the slope of the curve is zero?

For convenience, I assume the explanatory variable is named X and the response variable is named Y. The goal is to find locations where a nonparametric curve (x, f(x)) has zero slopes, where f(x) is the regression model. The general outline follows:

  1. Create a grid of points in the range of the explanatory variable. The grid does not have to be evenly spaced, but it is in this example.
  2. Score the model at the grid locations.
  3. Use finite differencing to approximate the slope of the regression curve at the grid points. If the slope changes sign between consecutive grid points, estimate the location between the grid points where the slope is exactly zero. Use linear interpolation to approximate the response at that location.
  4. Optionally, graph the original data, the regression curve, and the point along the curve where the slope is zero.

Example data

SAS distributes the ENSO data set in the SASHelp library. You can create a DATA step view that renames the explanatory and response variables to X and Y, respectively, so that it is easier to follow the logic of the program:

/* Create VIEW where x is the independent variable and y is the response */
data Have / view=Have;
set Sashelp.Enso(rename=(Month=x Pressure=y));
keep x y;
run;

Create a grid of points

After the data set is created, you can use PROC SQL to find the minimum and maximum values of the explanatory variable. You can create an evenly spaced grid of points for the range of the explanatory variable.

/* Put min and max into macro variables */
proc sql noprint;
  select min(x), max(x) into :min_x, :max_x 
  from Have;
quit;
 
/* Evaluate the model and estimate derivatives at these points */
data Grid;
dx = (&max_x - &min_x)/201;    /* choose the step size wisely */
do x = &min_x to &max_x by dx;   
   output;
end;
drop dx;
run;

Score the model at the grid locations

This is the step that will vary from procedure to procedure. You have to know how to use the procedure to score the regression model on the points in the Grid data set. The LOESS procedure supports a SCORE statement, so the call fits the model and scores the model on the Grid data set:

/* Score the model on the grid */
ods select none;    /* do not display the tables */
proc loess data=Have plots=none;
   model y = x;
   score data=Grid;  /* PROC LOESS does not support an OUT= option */
   /* Most procedures support an OUT= option to save the scored values.
      PROC LOESS displays the scored values in a table, so use ODS to
      save the table to an output data set */
   ods output ScoreResults=ScoreOut;
run;
ods select all;

If a procedure supports the STORE statement, you can use PROC PLM to score the model on the data. The SAS program that accompanies this article includes an example that uses the GAMPL procedure. The GAMPL procedure does not support the STORE or SCORE statements, but you can use the missing value trick to find zero derivatives.

Find the locations where the slope is zero

This is the mathematical portion of the computation. You can use a backward difference scheme to estimate the derivative (slope) of the curve. If (x0, y0) and (x1, y1) are two consecutive points along the curve (in the ScoreOut data set), then the slope at (x1, y1) is approximately m = (y1 - y0) / (x1 - x0). When the slope changes sign between consecutive points, it indicates that the slope changed from positive to negative (or vice versa) between the points. If the slope is continuous, it must have been exactly zero somewhere on the interval. You can use a linear approximation to find the point, t, where the slope is zero. You can then use linear interpolation to approximate the point (t, f(t)) at which the curve is a local min or max.

You can use the following SAS DATA step to process the scoring data, approximate the slope, and estimate where the slope of the curve is zero:

/* Compute slope by using finite difference formula. */
data Deriv0;
set ScoreOut;
Slope = dif(p_y) / dif(x);      /* (f(x) - f(x-dx)) / dx */
xPrev = lag(x);  yPrev = lag(p_y);  SlopePrev = lag(Slope);
if n(SlopePrev) AND sign(SlopePrev) ^= sign(Slope) then do;
   /* The slope changes sign between this obs and the previous.
      Assuming linearity on the interval, find (t, f(t))
      where slope is exactly zero */
   t0 = xPrev - SlopePrev * (x - xPrev)/(Slope - SlopePrev); 
   /* use linear interpolation to find the corresponding y value:
      f(t) ~ y0 + (y1-y0)/(x1-x0) * (t - x0)       */
   f_t0 = yPrev + (yPrev - p_y)/(x - xPrev) * (t0 - xPrev);
   if sign(SlopePrev) > 0 then _Type_ = "Max";
   else _Type_ = "Min";
   output; 
end;
keep t0 f_t0 _Type_;
label f_t0 = "f(t0)";
run;
 
proc print data=Deriv0 label;
run;

The table shows that there are seven points at which the derivative of the loess regression curve has a local min or max.

Graph the results

If you want to display the local extreme on the graph of the regression curve, you can concatenate the original data, the regression curve, and the local extreme. You can then use PROC SGPLOT to overlay the three layers. The resulting graph is shown at the top of this article.

data Combine;
merge Have                           /* data   : (x, y)     */
      ScoreOut(rename=(x=t p_y=p_t)) /* curve  : (t, p_t)   */
      Deriv0;                        /* extrema: (t0, f_t0) */
run;
 
title "Loess Smoother";
title2 "Red Markers Indicate Zero Slope for Smoother";
proc sgplot data=Combine noautolegend;
   scatter x=x y=y;
   series x=t y=p_t / lineattrs=GraphData2;
   scatter x=t0 y=f_t0 / markerattrs=(symbol=circlefilled color=red);
   yaxis grid; 
run;

Summary

In summary, if you can evaluate a regression curve on a grid of points, you can approximate the slope at each point along the curve. By looking for when the slope changes sign, you can find local minima and maxima. You can then use a simple linear estimator on the interval to estimate where the slope is exactly zero.

You can download the SAS program that performs the computations in this article.

The post Find points where a regression curve has zero slope appeared first on The DO Loop.

5月 112020
 

I recently showed how to use linear interpolation in SAS. Linear interpolation is a common way to interpolate between a set of planar points, but the interpolating function (the interpolant) is not smooth. If you want a smoother interpolant, you can use cubic spline interpolation. This article describes how to use a cubic spline interpolation in SAS.

As mentioned in my previous post, an interpolation requires two sets of numbers:

  1. Data: Let (x1, y1), (x2, y2), ..., (xn, yn) be a set of n data points. These sample data should not contain any missing values. The data must be ordered so that x1 < x2 < ... < xn.
  2. Values to score: Let {t1, t2, ..., tk} be a set of k new values for the X variable. For interpolation, all values must be within the range of the data: x1 ≤ ti ≤ xn for all i. The goal of interpolation is to produce a new Y value for each value of ti. The scoring data is also called the "query data."

The following SAS DATA steps define the data for this example. The POINTS data set contains the sample data, which are shown as blue markers on the graph to the right. The SCORE data set contains the scoring data, which are shown as red tick marks along the horizontal axis.

/* Example dats for 1-D interpolation */
data Points;  /* these points define the model */
input x y;
datalines;
0  1
1  3
4  5
5  4
7  6
8  3
10 3
;
 
data Score; /* these points are to be interpolated */
input t @@;
datalines;
2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9
;

On the graph, the blue curve is the cubic spline interpolant. Every point that you interpolate will be on that curve. The red asterisks are the interpolated values for the values in the SCORE data set. Notice that points -1 and 10.5 are not interpolated because they are outside of the data range. The following section shows how to compute the cubic spline interpolation in SAS.

Cubic spline interpolation in SAS

A linear interpolation uses a linear function on each interval between the data points. In general, the linear segments do not meet smoothly: the resulting interpolant is continuous but not smooth. In contrast, spline interpolation uses a polynomial function on each interval and chooses the polynomials so that the interpolant is smooth where adjacent polynomials meet. For polynomials of degree k, you can match the first k – 1 derivatives at each data point.

A cubic spline is composed of piecewise cubic polynomials whose first and second derivatives match at each data point. Typically, the second derivatives at the minimum and maximum of the data are set to zero. This kind of spline is known as a "natural cubic spline" with knots placed at each data point.

I have previously shown how use the SPLINE call in SAS/IML to compute a smoothing spline. A smoothing spline is not an interpolant because it does not pass through the original data points. However, you can get interpolation by using the SMOOTH=0 option. Adding the TYPE='zero' option results in a natural cubic spline.

For more control over the interpolation, you can use the SPLINEC function ('C' for coefficients) to fit the cubic splines to the data and obtain a matrix of coefficients. You can then use that matrix in the SPLINEV function ('V' for value) to evaluate the interpolant at the locations in the scoring data.

The following SAS/IML function (CubicInterp) computes the spline coefficients from the sample data and then interpolates the scoring data. The details of the computation are provided in the comments, but you do not need to know the details in order to use the function to interpolate data:

/* Cubic interpolating spline in SAS.
   The interpolation is based on the values (x1,y1), (x2,y2), ..., (xn, yn).
   The X  values must be nonmissing and in increasing order: x1 < x2 < ... < xn
   The values of the t vector are interpolated.
*/
proc iml;
start CubicInterp(x, y, t);
   d = dif(x, 1, 1);                     /* check that x[i+1] > x[i] */
   if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing.";
   idx = loc(t>=min(x) && t<=max(x));    /* check for valid scoring values */
   if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x.";
 
   /* fit the cubic model to the data */
   call splinec(splPred, coeff, endSlopes, x||y) smooth=0 type="zero";
 
   p = j(nrow(t)*ncol(t), 1, .);       /* allocate output (prediction) vector */
   call sortndx(ndx, colvec(t));       /* SPLINEV wants sorted data, so get sort index for t */
   sort_t = t[ndx];                    /* sorted version of t */
   sort_pred = splinev(coeff, sort_t); /* evaluate model at (sorted) points of t */
   p[ndx] = sort_pred[,2];             /* "unsort" by using the inverse sort index */
   return( p );
finish;
 
/* example of linear interpolation in SAS */
use Points; read all var {'x' 'y'}; close;
use Score; read all var 't'; close;
 
pred = CubicInterp(x, y, t);
create PRED var {'t' 'pred'}; append; close;
QUIT;

The visualization of the interpolation is similar to the code in the previous article, so the code is not shown here. However, you can download the SAS program that performs the cubic interpolation and creates the graph at the top of this article.

Although cubic spline interpolation is slower than linear interpolation, it is still fast: The CubicInterp program takes about 0.75 seconds to fit 1000 data points and interpolate one million scoring values.

Summary

In summary, the SAS/IML language provides the computational tools for cubic spline interpolation. The CubicInterp function in this article encapsulates the functionality so that you can perform cubic spline interpolation of your data in an efficient manner.

The post Cubic spline interpolation in SAS appeared first on The DO Loop.

5月 042020
 

SAS programmers sometimes ask about ways to perform one-dimensional linear interpolation in SAS. This article shows three ways to perform linear interpolation in SAS: PROC IML (in SAS/IML software), PROC EXPAND (in SAS/ETS software), and PROC TRANSREG (in SAS/STAT software). Of these, PROC IML Is the simplest to use and has the most flexibility. This article shows how to implement an efficient 1-D linear interpolation algorithm in SAS. You can download the SAS program that creates the analyses and graphs in this article.

Linear interpolation assumptions

For one-dimensional linear interpolation of points in the plane, you need two sets of numbers:

  1. Data: Let (x1, y1), (x2, y2), ..., (xn, yn) be a set of n data points. The data should not contain any missing values. The data must be ordered so that x1 < x2 < ... < xn. These values uniquely define the linear interpolation function on [x1, xn]. I call this the "sample data" or "fitting data" because it is used to create the linear interpolation model.
  2. Values to score: Let {t1, t2, ..., tk} be a set of k new values for the X variable. For interpolation, all values must be within the range of the data: x1 ≤ ti ≤ xn for all i. The goal of interpolation is to produce a new Y value for each value of ti. The scoring data is also called the "query data."

Interpolation requires a model. For linear interpolation, the model is the unique piecewise linear function that passes through each sample point and is linear on each interval [xi, xi+1]. The model is usually undefined outside of the range of the data, although there are various (nonunique) ways to extrapolate the model beyond the range of the data. You fit the model by using the data. You then score the model on the set of new values.

The following SAS data sets define example data for linear interpolation. The POINTS data set contains fitting data that define the linear model. The SCORE data set contains the new query points at which we want to interpolate. The linear interpolation is shown to the right. The sample data are shown as blue markers. The model is shown as blue lines. The query values to score are shown as a fringe plot along the X axis. The interpolated values are shown as red markers.

The data used for this example are:

/* Example data for 1-D interpolation */
data Points;  /* these points define the model */
input x y;
datalines;
0  1
1  3
4  5
5  4
7  6
8  3
10 3
;
 
data Score; /* these points are to be interpolated */
input t @@;
datalines;
2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9
;

For convenience, the fitting data are already sorted by the X variable, which is in the range [0, 10]. The scoring data set does not need to be sorted.

The scoring data for this example contains five special values:

  • Two scoring values (-1 and 10.5) are outside of the range of the data. An interpolation algorithm should return a missing value for these values. (Otherwise, it is extrapolation.)
  • Two scoring values (0 and 1) are duplicates of X values in the data. Ideally, this should not present a problem for the interpolation algorithm.
  • The value 9 appears twice in the scoring data.

Linear interpolation in SAS by using PROC IML

As is often the case, PROC IML enables you to implement a custom algorithm in only a few lines of code. For simplicity, suppose you have a single value, t, that you want to interpolate, based on the data (x1, y1), (x2, y2), ..., (xn, yn). The main steps for linear interpolation are:

  1. Check that the X values are nonmissing and in increasing order: x1 < x2 < ... < xn. Check that t is in the range [x1, xn]. If not, return a missing value.
  2. Find the first interval that contains t. You can use the BIN function in SAS/IML to find the first value i for which x_i <= t <= x_{i+1}.
  3. Define the left and right endpoint of the interval: xL = x_i and xR = x_{i+1}. Define the corresponding response values: yL = y_i and yR = y_{i+1}.
  4. Let f = (t - xL) / (xR - xL) be the proportion of the interval to the left of t. Then p = (1 - f)*yL + f*yR is the linear interpolation at t.

The steps are implemented in the following SAS/IML function. The function accepts a vector of scoring values, t. Notice that the program does not contain any loops over the elements of t. All statements and operations are vectorized, which is very efficient.

/* Linear interpolation based on the values (x1,y1), (x2,y2),....
   The X  values must be nonmissing and in increasing order: x1 < x2 < ... < xn
   The values of the t vector are linearly interpolated.
*/
proc iml;
start LinInterp(x, y, _t);
   d = dif(x, 1, 1);                     /* check that x[i+1] > x[i] */
   if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing.";
   idx = loc(_t>=min(x) && _t<=max(x));  /* check for valid scoring values */
   if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x.";
 
   p = j(nrow(_t)*ncol(_t), 1, .);     /* allocate output (prediction) vector */
   t = _t[idx];                        /* subset t values inside range(x) */
   k = bin(t, x);                      /* find interval [x_i, x_{i+1}] that contains s */
   xL = x[k];   yL = y[k];             /* find (xL, yL) and (xR, yR) */
   xR = x[k+1]; yR = y[k+1];
   f = (t - xL) / (xR - xL);           /* f = fraction of interval [xL, xR] */
   p[idx] = (1 - f)#yL + f#yR;        /* interpolate between yL and yR */
   return( p );
finish;
 
/* example of linear interpolation in SAS */
use Points; read all var {'x' 'y'}; close;
use Score; read all var 't'; close;
 
pred = LinInterp(x, y, t);
create PRED var {'t' 'pred'}; append; close;
QUIT;

Visualize a linear interpolation in SAS

The previous program writes the interpolated values to the PRED data set. You can concatenate the original data and the interpolated values to visualize the linear interpolation:

/* Visualize: concatenate data and predicted (interpolated) values */
data All;
set Points Pred;
run;
 
title "Linear Interpolation";
title2 "No Extrapolation";
proc sgplot data=All noautolegend;
   series x=x y=y;
   scatter x=x y=y / markerattrs=(symbol=CircleFilled size=12)
                    name="data" legendlabel="Data";
   scatter x=t y=Pred / markerattrs=(symbol=asterisk size=12 color=red)
                    name="interp" legendlabel="Interpolated Values";
   fringe t / lineattrs=(color=red thickness=2)
                    name="score" legendlabel="Values to Score";
   xaxis grid values=(0 to 10) valueshint label="X";
   yaxis grid label="Y" offsetmin=0.05;
   keylegend "data" "score" "interp";
run;

The graph is shown at the top of this article. A few noteworthy items:

  • The values -1 and 10.5 are not scored because they are outside the range of the data.
  • The values 0 and 1 correspond to a data point. The interpolated value is the corresponding data value.
  • The other values are interpolated onto the straight line segments that connect the data.

Performance of the IML algorithm

The IML algorithm is very fast. On my Windows PC (Pentium i7), the interpolation takes only 0.2 seconds for 1,000 data points and one million scoring values. For 10,000 data points and one million scoring values, the interpolation takes about 0.25 seconds. The SAS program that accompanies this article includes timing code.

Other SAS procedures that can perform linear interpolation

According to a SAS Usage Note, you can perform linear interpolation in SAS by using PROC EXPAND in SAS/ETS software or PROC TRANSREG in SAS/STAT software. Each has some limitations that I don't like:

  • Both procedures use the missing value trick to perform the fitting and scoring in a single call. That means you must concatenate the sample data (which is often small) and the query data (which can be large).
  • PROC EXPAND requires that the combined data be sorted by X. That can be easily accomplished by calling PROC SORT after you concatenate the sample and query data. However, PROC EXPAND does not support duplicate X values! For me, this makes PROC EXPAND unusable. It means that you cannot score the model at points that are in the original data, nor can you have repeated values in the scoring data.
  • If you use PROC TRANSREG for linear interpolation, you must know the number of sample data points, n. You must specify n – 2 on the NKNOTS= option on a SPLINE transformation. Usually, this means that you must perform an extra step (DATA step or PROC MEANS) and store n – 2 in a macro variable.
  • For scoring values outside the range of the data, PROC EXPAND returns a missing value. However, PROC TRANSREG extrapolates. If t < x1, then the extrapolated value at t is y1. Similarly, if t > xn, then the extrapolated value at t is yn.

I've included examples of using PROC EXPAND and PROC TRANSREG in the SAS program that accompanies this article. You can use these procedures for linear interpolation, but neither is as convenient as PROC IML.

With effort, you can use Base SAS routines such as the DATA step to implement a linear interpolation algorithm. An example is provided by KSharp in a SAS Support Community thread.

Summary

If you want to perform linear interpolation in SAS, the easiest and most efficient method is PROC IML. I have provided a SAS/IML function that implements linear interpolation by using two input vectors (x and y) to define the model and one input vector (t) to specify the points at which to interpolate. The function returns the interpolated values for t.

The post Linear interpolation in SAS appeared first on The DO Loop.

5月 042020
 

SAS programmers sometimes ask about ways to perform one-dimensional linear interpolation in SAS. This article shows three ways to perform linear interpolation in SAS: PROC IML (in SAS/IML software), PROC EXPAND (in SAS/ETS software), and PROC TRANSREG (in SAS/STAT software). Of these, PROC IML Is the simplest to use and has the most flexibility. This article shows how to implement an efficient 1-D linear interpolation algorithm in SAS. You can download the SAS program that creates the analyses and graphs in this article.

Linear interpolation assumptions

For one-dimensional linear interpolation of points in the plane, you need two sets of numbers:

  1. Data: Let (x1, y1), (x2, y2), ..., (xn, yn) be a set of n data points. The data should not contain any missing values. The data must be ordered so that x1 < x2 < ... < xn. These values uniquely define the linear interpolation function on [x1, xn]. I call this the "sample data" or "fitting data" because it is used to create the linear interpolation model.
  2. Values to score: Let {t1, t2, ..., tk} be a set of k new values for the X variable. For interpolation, all values must be within the range of the data: x1 ≤ ti ≤ xn for all i. The goal of interpolation is to produce a new Y value for each value of ti. The scoring data is also called the "query data."

Interpolation requires a model. For linear interpolation, the model is the unique piecewise linear function that passes through each sample point and is linear on each interval [xi, xi+1]. The model is usually undefined outside of the range of the data, although there are various (nonunique) ways to extrapolate the model beyond the range of the data. You fit the model by using the data. You then score the model on the set of new values.

The following SAS data sets define example data for linear interpolation. The POINTS data set contains fitting data that define the linear model. The SCORE data set contains the new query points at which we want to interpolate. The linear interpolation is shown to the right. The sample data are shown as blue markers. The model is shown as blue lines. The query values to score are shown as a fringe plot along the X axis. The interpolated values are shown as red markers.

The data used for this example are:

/* Example data for 1-D interpolation */
data Points;  /* these points define the model */
input x y;
datalines;
0  1
1  3
4  5
5  4
7  6
8  3
10 3
;
 
data Score; /* these points are to be interpolated */
input t @@;
datalines;
2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9
;

For convenience, the fitting data are already sorted by the X variable, which is in the range [0, 10]. The scoring data set does not need to be sorted.

The scoring data for this example contains five special values:

  • Two scoring values (-1 and 10.5) are outside of the range of the data. An interpolation algorithm should return a missing value for these values. (Otherwise, it is extrapolation.)
  • Two scoring values (0 and 1) are duplicates of X values in the data. Ideally, this should not present a problem for the interpolation algorithm.
  • The value 9 appears twice in the scoring data.

Linear interpolation in SAS by using PROC IML

As is often the case, PROC IML enables you to implement a custom algorithm in only a few lines of code. For simplicity, suppose you have a single value, t, that you want to interpolate, based on the data (x1, y1), (x2, y2), ..., (xn, yn). The main steps for linear interpolation are:

  1. Check that the X values are nonmissing and in increasing order: x1 < x2 < ... < xn. Check that t is in the range [x1, xn]. If not, return a missing value.
  2. Find the first interval that contains t. You can use the BIN function in SAS/IML to find the first value i for which x_i <= t <= x_{i+1}.
  3. Define the left and right endpoint of the interval: xL = x_i and xR = x_{i+1}. Define the corresponding response values: yL = y_i and yR = y_{i+1}.
  4. Let f = (t - xL) / (xR - xL) be the proportion of the interval to the left of t. Then p = (1 - f)*yL + f*yR is the linear interpolation at t.

The steps are implemented in the following SAS/IML function. The function accepts a vector of scoring values, t. Notice that the program does not contain any loops over the elements of t. All statements and operations are vectorized, which is very efficient.

/* Linear interpolation based on the values (x1,y1), (x2,y2),....
   The X  values must be nonmissing and in increasing order: x1 < x2 < ... < xn
   The values of the t vector are linearly interpolated.
*/
proc iml;
start LinInterp(x, y, _t);
   d = dif(x, 1, 1);                     /* check that x[i+1] > x[i] */
   if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing.";
   idx = loc(_t>=min(x) && _t<=max(x));  /* check for valid scoring values */
   if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x.";
 
   p = j(nrow(_t)*ncol(_t), 1, .);     /* allocate output (prediction) vector */
   t = _t[idx];                        /* subset t values inside range(x) */
   k = bin(t, x);                      /* find interval [x_i, x_{i+1}] that contains s */
   xL = x[k];   yL = y[k];             /* find (xL, yL) and (xR, yR) */
   xR = x[k+1]; yR = y[k+1];
   f = (t - xL) / (xR - xL);           /* f = fraction of interval [xL, xR] */
   p[idx] = (1 - f)#yL + f#yR;        /* interpolate between yL and yR */
   return( p );
finish;
 
/* example of linear interpolation in SAS */
use Points; read all var {'x' 'y'}; close;
use Score; read all var 't'; close;
 
pred = LinInterp(x, y, t);
create PRED var {'t' 'pred'}; append; close;
QUIT;

Visualize a linear interpolation in SAS

The previous program writes the interpolated values to the PRED data set. You can concatenate the original data and the interpolated values to visualize the linear interpolation:

/* Visualize: concatenate data and predicted (interpolated) values */
data All;
set Points Pred;
run;
 
title "Linear Interpolation";
title2 "No Extrapolation";
proc sgplot data=All noautolegend;
   series x=x y=y;
   scatter x=x y=y / markerattrs=(symbol=CircleFilled size=12)
                    name="data" legendlabel="Data";
   scatter x=t y=Pred / markerattrs=(symbol=asterisk size=12 color=red)
                    name="interp" legendlabel="Interpolated Values";
   fringe t / lineattrs=(color=red thickness=2)
                    name="score" legendlabel="Values to Score";
   xaxis grid values=(0 to 10) valueshint label="X";
   yaxis grid label="Y" offsetmin=0.05;
   keylegend "data" "score" "interp";
run;

The graph is shown at the top of this article. A few noteworthy items:

  • The values -1 and 10.5 are not scored because they are outside the range of the data.
  • The values 0 and 1 correspond to a data point. The interpolated value is the corresponding data value.
  • The other values are interpolated onto the straight line segments that connect the data.

Performance of the IML algorithm

The IML algorithm is very fast. On my Windows PC (Pentium i7), the interpolation takes only 0.2 seconds for 1,000 data points and one million scoring values. For 10,000 data points and one million scoring values, the interpolation takes about 0.25 seconds. The SAS program that accompanies this article includes timing code.

Other SAS procedures that can perform linear interpolation

According to a SAS Usage Note, you can perform linear interpolation in SAS by using PROC EXPAND in SAS/ETS software or PROC TRANSREG in SAS/STAT software. Each has some limitations that I don't like:

  • Both procedures use the missing value trick to perform the fitting and scoring in a single call. That means you must concatenate the sample data (which is often small) and the query data (which can be large).
  • PROC EXPAND requires that the combined data be sorted by X. That can be easily accomplished by calling PROC SORT after you concatenate the sample and query data. However, PROC EXPAND does not support duplicate X values! For me, this makes PROC EXPAND unusable. It means that you cannot score the model at points that are in the original data, nor can you have repeated values in the scoring data.
  • If you use PROC TRANSREG for linear interpolation, you must know the number of sample data points, n. You must specify n – 2 on the NKNOTS= option on a SPLINE transformation. Usually, this means that you must perform an extra step (DATA step or PROC MEANS) and store n – 2 in a macro variable.
  • For scoring values outside the range of the data, PROC EXPAND returns a missing value. However, PROC TRANSREG extrapolates. If t < x1, then the extrapolated value at t is y1. Similarly, if t > xn, then the extrapolated value at t is yn.

I've included examples of using PROC EXPAND and PROC TRANSREG in the SAS program that accompanies this article. You can use these procedures for linear interpolation, but neither is as convenient as PROC IML.

With effort, you can use Base SAS routines such as the DATA step to implement a linear interpolation algorithm. An example is provided by KSharp in a SAS Support Community thread.

Summary

If you want to perform linear interpolation in SAS, the easiest and most efficient method is PROC IML. I have provided a SAS/IML function that implements linear interpolation by using two input vectors (x and y) to define the model and one input vector (t) to specify the points at which to interpolate. The function returns the interpolated values for t.

The post Linear interpolation in SAS appeared first on The DO Loop.

4月 272020
 

I've previously written about how to generate points that are uniformly distributed in the unit disk. A seemingly unrelated topic is the distribution of eigenvalues (in the complex plane) of various kinds of random matrices. However, I recently learned that these topics are somewhat related!

A mathematical result called the "Circular Law" states that the (complex) eigenvalues of a (scaled) random n x n matrix are uniformly distributed in a disk as n → ∞. For this article, a random matrix is one whose entries are independent random variates from a specified distribution that has mean 0 and unit variance. A more precise statement of the circular law is found in the link, but, loosely speaking, if An is a sequence of random matrices of increasing size, the distribution of the eigenvalues of (1/sqrt(n)) An converges to the uniform distribution in the unit disk as n → ∞.

This article uses SAS to visualize the eigenvalue of several large random matrices. The graph to the right shows the complex eigenvalues for an n x n matrix whose entries are normal random variates from N(0, 1/sqrt(n)), where n = 2000. These eigenvalues appear to be uniformly distributed, as the circular law dictates, and most are inside the unit disk.

To demonstrate the circular law, you can fill a large n x n matrix with independent variates from any probability distribution that has mean 0 and unit variance. The matrix is then scaled by dividing each entry by sqrt(n). Equivalently, you can save a step and fill the matrix with random variates from a distribution that has 0 mean and standard deviation 1/sqrt(n).

Eigenvalues of random matrices

Let me be perfectly clear: There are simple and efficient algorithms to simulate uniformly distributed variates in the unit disk. No one should use the eigenvalue of a large matrix as a way to simulate uniform variates in a disk!

The following SAS/IML program creates a random 2000 x 2000 matrix where each element is a random variate from the N(0, 1/sqrt(n)) distribution, which has mean 0 and standard deviation 1/sqrt(n). It then computes the eigenvalues (which are complex) and writes them to a data set. Because I intend to repeat these computations for other distributions, I use a macro variable (DSName) to hold the name of the variable and I use a macro (RandCall) to specify the IML statement that fills a matrix, M, with random variates:

/* Demonstrate the circular law: eigenvalues are uniform in unit disk */
title "Eigenvalues of n x n Matrix";
 
/* title, data set name, and IML statement for a distribution */
title2 "Elements ~ N(0, 1/sqrt(n))";
%let DSName = Normal;
%macro RandCall;
   call randgen(M, "Normal", 0, 1/sqrt(n));
%mend;
 
proc iml;
call randseed(12345);
n = 2000;                            /* sample size = dim of matrix */
M = j(n, n);
%RandCall;
v = eigval(M);
create &DSName from v[c={"Re" "Im"}]; append from v; close; 
quit;

To visualize the eigenvalues, you can create a scatter plot that displays the real part (Re) and the imaginary part (Im) of each eigenvalue. I want to overlay a unit circle on the graph, so the following DATA step creates a piecewise linear approximation of a circle and uses the POLYGON statement in PROC SGPLOT to overlay the eigenvalues and the circle:

data Circle;
retain ID 1  k 100;
pi = constant('pi');
do theta = 0 to 2*pi by 2*pi/k;
   px = cos(theta);   py = sin(theta);   output;
end;
keep ID px py;
run;
 
data All;
set &DSName Circle;
label Re = "Real Part of Eigenvalue"
      Im = "Imaginary Part of Eigenvalue"
run;
 
ods graphics / width=480px height=480px;
proc sgplot data=All noautolegend;
   polygon x=px y=py ID=ID;   /* the unit circle */
   scatter x=Re y=Im;         /* the complex eigenvalues */
   xaxis grid; yaxis grid;
quit;

The graph is shown at the top of this article. It certainly appears that the eigenvalues of the 2000 x 2000 matrix are uniformly distributed. However, if you look carefully, a few eigenvalues appear to be outside of the unit disk. You can compute the distance from the origin for each eigenvalue (Re2 + Im2) and compare that distance to 1 to count how many of the 2000 eigenvalues are outside the unit circle:

data InOut;
set &DSName;
InDisk = (Re**2 + Im**2 <= 1);   /* indicator variable */
run;
 
proc freq data=InOut order=freq;
   tables InDisk / nocum ;
run;

For this random matrix, 12 eigenvalues are not inside the unit disk. Thus, although the eigenvalues appear to be approximately uniformly distributed, it is only in the limit as n → ∞ that all eigenvalues are inside the closed disk.

Other probability distributions

The remarkable thing about the circular law is that it is true for ANY probability distribution that has mean 0 and standard deviation 1/sqrt(n). To demonstrate, you can graph the eigenvalues for matrices that are filled with random variates from two other probability distributions:

  • U(-sqrt(3/n), sqrt(3/n)): For a uniform distribution on [a, b], the mean is b - a and the standard deviation is (b - a) / sqrt(12).
  • A discrete distribution for the random variable X/sqrt(n), where X can take on the values -3, 0, and 1 with probabilities 1/12, 8/12, and 3/12, respectively. For a discrete distribution, the mean is μ = Σi piXi and the variance is Σi pi(Xi - μ)2, where pi = P(X=xi).

To plot the eigenvalues of the matrix filled with random variates from the uniform distribution, you can modify the macro variables in the previous program:

/* matrix of random uniform variates */
title2 "Elements ~ U(-sqrt(3/n), sqrt(3/n))";
%let DSName = Uniform;
%macro RandCall;
   call randgen(M, "Uniform", -sqrt(3/n), sqrt(3/n));
%mend;

To plot the eigenvalues of the matrix filled with random variates from the discrete distribution, modify the macro variables as follows:

/* matrix of discrete random uniform variates */
title2 "Elements ~ P{X=-3)=1/12, P(X=0)=8/12, P(X=1)=3/12";
%let DSName = Discrete;
%macro RandCall;
   call randgen(M, "Table", {0.0833333, 0.6666667, 0.25});  /* returns values 1, 2, 3 */
   vals = {-3, 0, 1} / sqrt(n);     /* substitute values from discrete distribution */
   M = shape( vals[M], n, n );      /* reshape into n x n matrix */
%mend;

The graph for each matrix is shown below. To my eye, they look similar. I had wondered whether I would be able to determine which eigenvalues came from a matrix that has only three discrete entries (-3/2000, 0, and 1/2000). It is amazing that the spectrum of a matrix that has only three unique elements is so similar to the spectrum of a matrix in which almost every element is distinct!

In summary, I used SAS to visualize the Circular Law, which is an interesting theorem about the distribution of the eigenvalues of a random matrix. The theorem states that the distribution converges to the uniform distribution on the unit disk. The elements of the matrix are independent random variates from any probability distribution that has mean 0 and standard deviation 1/sqrt(n). The SAS/IML language makes it convenient to create the random matrices and compute their eigenvalues.

The post The circular law for eigenvalues appeared first on The DO Loop.

11月 112019
 

In grade school, students learn how to round numbers to the nearest integer. In later years, students learn variations, such as rounding up and rounding down by using the greatest integer function and least integer function, respectively. My sister, who is an engineer, learned a rounding method that rounds half-integers to the nearest even number. This method is called the round-to-even method. (Other names include the round-half-to-even method, the round-ties-to-even method, and "bankers' rounding.") When people first encounter the round-to-even method, they are often confused. Why would anyone use such a strange rounding scheme? This article describes the round-to-even method, explains why it is useful, and shows how to use SAS software to apply the round-to-even method.

What is the round-to-even method of rounding?

The round-to-even method is used in engineering, finance, and computer science to reduce bias when you use rounded numbers to estimate sums and averages. The round-to-even method works like this:

  • If the difference between the number and the nearest integer is less than 0.5, round to the nearest integer. This familiar rule is used by many rounding methods.
  • If the difference between the number and the nearest integer is exactly 0.5, look at the integer part of the number. If the integer part is EVEN, round towards zero. If the integer part of the number is ODD, round away from zero. In either case, the rounded number is an even integer.

All rounding functions are discontinuous step functions that map the real numbers onto the integers. The graph of the round-to-even function is shown to the right.

I intentionally use the phrase "round away from zero" instead of "round up" because you can apply the rounding to positive or negative numbers. If you round the number -2.5 away from zero, you get -3. If you round the number -2.5 towards zero, you get -2. However, for simplicity, the remainder of the article uses positive numbers.

Examples of the round-to-even method of rounding

For the number 2.5, which integer is closest to it? Well, it's a tie: 2.5 is a "half-integer" that is just as close to 2 as it is to 3. So which integer should you choose as THE rounded value? The traditional rounding method rounds the midpoint between two integers away from zero, so 2.5 is traditionally rounded to 3. This produces a systematic bias: all half-integers are rounded away from zero. This fact leads to biased estimates when you use the rounded data in an analysis.

To reduce this systematic bias, you can use the round-to-even method, which rounds some half-integers away from zero and others towards zero. For the round-to-even method, the number 2.5 rounds down to 2, whereas the number 3.5 rounds up to 4.

The table at the right shows some decimal values and the results of rounding the values under the standard method and the round-to-even method. The second column (Round(x)) shows the result of the traditional rounding method where all half-integers are rounded away from 0. The third column (RoundE(x)) is the round-to-even method that rounds half-integers to the nearest even integer. The red boxes indicate numbers for which the Round and RoundE functions produce different answers. Notice that for the round-to-even method, 50% of the half-integers round towards 0 and 50% round away from 0. In the table, 0.5 and 2.5 are rounded down by the round-to-even method, whereas 1.5 and 3.5 are rounded up.

Why use the round-to-even method of rounding?

The main reason to use the round-to-even method is to avoid systematic bias when calculating with rounded numbers. One application involves mental arithmetic. If you want to estimate the sum (or mean) of a list of numbers, you can mentally round the numbers to the nearest integer and add up the numbers in your head. The round-to-even method helps to avoid bias in the estimate, especially if many of the values are half-integers.

Most computers use the round-to-even method for numerical computations. The round-to-even method has been a part of the IEEE standards for rounding since 1985.

How to use the round-to-even method in SAS?

SAS software supports the ROUND function for standard rounding of numbers and the ROUNDE function ('E' for 'even') for round-to-even rounding. For example, the following DATA step produces the table that is shown earlier in this article:

data Seq;
keep x Round RoundEven;
label Round = "Round(x)" RoundEven="RoundE(x)";
do x = 0 to 3.5 by 0.25;
   Round     = round(x);    /* traditional: half-integers rounded away from 0 */
   RoundEven = rounde(x);   /* round half-integers to the nearest even integer */
   output;
end;
run;
 
proc print data=Seq noobs label;
run;

An application: Estimate the average length of lengths with SAS

Although the previous sections discuss rounding values like 0.5 to the nearest integer, the same ideas apply when you round to the nearest tenth, hundredth, thousandth, etc. The next example rounds values to the nearest tenth. Values like 0.95, 1.05, 1.15, etc., are equidistant from the nearest tenth and can be rounded up or down, depending on the rounding method you choose. In SAS, you can use an optional argument to the ROUND and ROUNDE functions to specify the unit to which you want to round. For example, the expression ROUND(x, 0.1) rounds x to the nearest tenth.

An example in the SAS documentation for PROC UNIVARIATE contains the effective channel length (in microns) for 425 transistors from "Lot 1" of a production facility. In the data set, the lengths are recorded to two decimal places. What would be the impact on statistical measurements if the engineer had been lazy and decided to round the measurements to one decimal place, rather than typing all those extra digits?

The following SAS DATA step rounds the data to one decimal place (0.1 microns) by using the ROUND and ROUNDE functions. The call to PROC MEANS computes the mean and sum of the unrounded and rounded values. For the full-precision data, the estimate of the mean length is 1.011 microns. If you round the data by using the standard rounding method, the estimate shoots up to 1.018, which overestimates the average. In contrast, if you round the data by using the round-to-even method, the estimate is 1.014, which is closer to the average of the unrounded numbers (less biased). Similarly, the Sum column shows that the sum of the round-to-even values is much closer to the sum of the unrounded values.

/* round real data to the nearest 0.1 unit */
data rounding;
set Channel1;
   Round     = round(Length, 0.1);    /* traditional: round to nearest tenth */
   RoundEven = rounde(Length, 0.1);   /* use round-to-even method to round to nearest tenth */
   /* create a binary indicator variable: Was x rounded up or down? */
   RoundUp     = (Round > Length);    /* 1 if rounded up; 0 if rounded down */
   RoundEvenUp = (RoundEven > Length);
run;
 
proc means data=rounding sum mean ndec=3;
   label Length="True values" Round ="Rounded values" RoundEven="Round-to-even values";
   var Length Round RoundEven;
run;

As mentioned earlier, when you use the traditional rounding method, you introduce a bias every time you encounter a "half-unit" datum such as 0.95, 1.05, or 1.15. For this real data, you can count how many data were rounded up versus rounded down by each method. To get an unbiased result, you should round up the half-unit data about as often as you round down. The following call to PROC MEANS shows the proportion of data that are rounded up and rounded down by each method. The output shows that about 55% of the data are rounded up by the traditional rounding method, whereas a more equitable 50.1% of the values are rounded up by the round-to-even method.

proc means data=rounding mean ndec=3;
   label RoundUp    = "Proportion rounded up for ROUND"
         RoundEvenUp= "Proportion rounded up for ROUNDE";
   var RoundUp RoundEvenUp;
run;

This example illustrates a general truism: The round-to-even method is a less biased way to round data.

Summary

This article explains the round-to-even method of rounding. This method is not universally taught, but it is taught to college students in certain disciplines. The method rounds most numbers to the nearest integer. However, half-integers, which are equally close to two integers, are rounded to the nearest EVEN integer, thus giving the method its name. This method reduces the bias when you use rounded values to estimate quantities such as sums, means, standard deviations, and so forth.

Whereas SAS provides separate ROUND and ROUNDE functions, other languages might default to the round-to-even method. For example, the round() function in python and the round function in R both implement the round-to-even method. Because some users are unfamiliar with this method of rounding, the R documentation provides an example and then explicitly states, "this is *good* behaviour -- do *NOT* report it as bug!"

You can download the SAS programs and data that are used in this article.

The post Round to even appeared first on The DO Loop.

9月 172018
 

The SAS/IML language and the MATLAB language are similar. Both provide a natural syntax for performing high-level computations on vectors and matrices, including basic linear algebra subroutines. Sometimes a SAS programmer will convert an algorithm from MATLAB into SAS/IML. Because the languages are not identical, I am sometimes asked, "what is the SAS/IML function that is equivalent to the XYZ function in MATLAB?" One function I am often asked about is the linspace function in MATLAB, which generates a row vector of n evenly spaced points in a closed interval. Although I have written about how to generate evenly spaced points in SAS/IML (and in the DATA step, too!), the name of the SAS/IML function that performs this operation (the DO function) is not very descriptive. Understandably, someone who browses the documentation might pass by the DO function without realizing that it is the function that generates a linearly spaced vector. This article shows how to construct a SAS/IML function that is equivalent to the MATLAB linspace function.

Generate equally spaced points in an interval

Syntactically, the main difference between the DO function in SAS/IML and the linspace function in MATLAB is that the third argument to the DO function is a step size (an increment), whereas the third function to the linspace function is the number of points to generate in an interval. But that's no problem: to generate n evenly spaced points on the interval [a, b], you can use a step size of (b – a)/(n – 1). Therefore, the following SAS/IML function is a drop-in replacement for the MATLAB linspace function:

proc iml;
/* generate n evenly spaced points (a linearly spaced vector) in the interval [a,b] */
start linspace(a, b, numPts=100);
   n = floor(numPts);               /* if n is not an integer, truncate */
   if n < 1 then return( {} );      /* return empty matrix */
   else if n=1 then return( b );    /* return upper endpoint */
   return( do(a, b, (b-a)/(n-1)) ); /* return n equally spaced points */
finish;

A typical use for the linspace function is to generate points in the domain of a function so that you can quickly visualize the function on an interval. For example, the following statements visualize the function exp( -x2 ) on the domain [-3, 3]:

x = linspace(-3, 3);                /* by default, 100 points in [-3,3] */
title "y = exp( -x^2 )";
call series(x, exp(-x##2));         /* graph the function */
Graph of y = exp( -x^2 )

Reminder: "10.0 times 0.1 is hardly ever 1.0"

This is a good time to remind everyone of the programmer's maxim (from Kernighan and Plauger, 1974, The Elements of Programming Style) that "10.0 times 0.1 is hardly ever 1.0." Similarly, "5 times 0.2 is hardly ever 1.0." The maxim holds because many finite decimal values in base 10 have a binary representation that is infinite and repeating. For example, 0.1 and 0.2 are represented by repeating decimals in base 2. Specifically, 0.210 = 0.00110011...2. Thus, just as 3 * (0.3333333) is not equal to 1 in base 10, so too is 5 * 0.00110011...2 not equal to 1 in base 2.

A consequence of this fact is that you should avoid testing floating-point values for equality. For example, if you generate evenly spaced points in the interval [-1, 1] with a step size of 0.2, do not expect that 0.0 is one of the points that are generated, as shown by the following statements:

z = do(-1, 1, 0.2);
/* find all points that are integers */
idx = loc( z = int(z) );               /* test for equality (bad idea) */
print (idx // z[,idx])[r={'idx', 'z[idx]'}];  /* oops! 0.0 is not there! */
 
print z;                               /* show that 0.0 is not one of the points */

When you query for all values for which z = int(z), only the values -1 and +1 are found. If you print out the values in the vector, you'll see that the middle value is an extremely tiny but nonzero value (-5.55e=17). This is not a bug but is a consequence of the fact that 0.2 is represented as a repeating value in binary.

So how can you find the points in a vector that "should be" integers (in exact arithmetic) but might be slightly different than an integer in floating-point arithmetic? The standard approach is to choose a small distance (such as 1e-12 or 1e-14) and look for floating-point numbers that are within that distance from an integer. In SAS, you can use the ROUND function or check the absolute value of the difference, as follows:

eps = 1e-12;
w = round(z, eps);                            /* Round to nearest eps */
idx = loc( int(w) = w);                       /* find points are within epsilon of integer */
print idx;                                    
 
idx = loc( abs(int(z) - z) < eps );           /* find points whose distance to integer is less than eps */
print (idx // z[,idx])[r={'idx', 'z[idx]'}];

In summary, this article shows how to define a SAS/IML function that is equivalent to the MATLAB linspace function. It also reminds us that some finite decimal values (such as 0.1 and 0.2) do not have finite binary representations. When these values are used to generate an arithmetic sequence, the resulting vector of values might be different from what you expect. A wise practice is to never test a floating-point value for equality, but instead to test whether a floating-point value is within a small distance from a target value.

The post Linearly spaced vectors in SAS appeared first on The DO Loop.