Statistical Programming

2月 152021
 

I've previously written about how to generate all pairwise interactions for a regression model in SAS. For a model that contains continuous effects, the easiest way is to use the EFFECT statement in PROC GLMSELECT to generate second-degree "polynomial effects." However, a SAS programmer was running a simulation study and wanted to be able to generate all pairwise interaction effects within SAS/IML. One way to solve the programmer's problem is to use the HDIR function in SAS/IML. This article introduces the HDIR function and shows how to use it to generate pairwise (quadratic) interactions for continuous effects.

Generate pairwise interactions by using the EFFECT statement

Before trying to solve a problem with a new tool, I like to use an existing method to solve the problem. Then I can compare the old and new answers and make sure they are the same.

For sample data, let's use the numerical variables in the Sashelp.Class data. So that the techniques are clearer, I will rename the variables to X1, X2, and X3. We want to generate all six pairwise interactions, including the "pure quadratic" terms where a variable interacts with itself. The names of the interaction effects will be X1_2, X1_X2, x1_X3, X2_2, X2_X3, and X3_2. The names with "_2" at the end are pure quadratic effects; the others are interactions.

One way to generate a design matrix that has all pairwise interactions is to use the EFFECT statement in PROC GLMSELECT. The interactions are written to the OUTDESIGN= data set. The columns of the design matrix have interpretable names, which are stored in the _GLSMOD macro variable.

data Have;         /* create the example data */
set sashelp.class(rename=(Age=X1 Height=X2 Weight=X3));
if _N_=3 then X1=.;    /* add a missing value */
keep _NUMERIC_;
run;
 
/* Use PROC GLMSELECT to generate all pairwise interactions.
   Step 1: Add a fake response variable */
%let DSIn  = Have;         /* the name of the input data set */
%let DSOut = Want;         /* the name of the output data set */
data AddFakeY / view=AddFakeY;
   set &DSIn;
   _Y = 0;      /* add a fake response variable */
run;
 
/* Step 2: Use the EFFECT statement to generate the degree-2 effects, as shown in 
      https://blogs.sas.com/content/iml/2017/09/07/polynomial-effects-regression-sas.html */
proc glmselect data=AddFakeY NOPRINT outdesign(addinputvars)=&DSOut(drop=_Y);
   effect poly2 = polynomial( X1 X2 X3 / degree=2);
   model _Y = poly2 /  noint selection=none;  /* equiv to X1*X1 X1*X2 X1*X3 X2*X2 X2*X3 X3*X3 */
run;
%put &=_GLSMOD;    /* look at names of the interaction effects  in the OUTDESIGN= data */
 
proc print data=&DSOut(obs=5);
   var X1_2 X1_X2 X1_X3 X2_2 X2_X3 X3_2; /* names are generated automatically */
run;

The output from PROC GLMSELECT contains the "pure quadratic" interactions (X1_2, X2_2, and X3_2) and the cross-variable interactions (X1_X2, X1_X3, and X2_X3). If you have k variables, there will be k pure quadratic terms and "k choose 2" cross-variable terms. Hence, the total number of quadratic interactions is "(k+1) choose 2," which is (k+1)*k/2. Here, k=3 and there are six quadratic interactions.

Notice how PROC GLMSELECT handles the missing value in the third observation: because the X1 value is missing, the procedure puts a missing value into all interaction effects.

The horizontal direct product between matrices

The horizontal direct product between matrices A and B is formed by the elementwise multiplication of their columns. The operation is most often used to form interactions between dummy variables for two categorical variables. If A is the design matrix for the categorical variable C1 and B is the design matrix for the categorical variable C2, then HDIR(A,B) is the design matrix for the interaction effect C1*C2.

The following simple example shows how the HDIR function works. The HDIR function returns a matrix whose columns are formed by elementwise multiplication of the columns of A and the matrix B. The first set of columns is the product A[,1]#B, the second set is A[,2]#B, and so forth. If A has k1 columns and B has k2 columns, then HDIR(A,B) has k1*k2 columns.

proc iml;
/* horizontal direct product multiplies each column of A by each column of B */
A = {1 2,
     2 3,
     4 5};
B = {0  1 2,
     1  1 3,
     2 -1 4};
C = hdir(A,B);
print C[c={'A1_B1' 'A1_B2' 'A1_B3' 'A2_B1' 'A2_B2' 'A2_B3'} L='hdir(A,B)'];

Interactions of continuous variables

The previous section shows that if X is a data matrix of continuous variables, the function call HDIR(X,X) generates all pairwise combinations of columns of X. Unfortunately, the matrix HDIR(X,X) contains more columns than we need. if X contains k columns, we need the "(k+1) choose 2" =(k+1)k/2 columns of interactions, but the matrix HDIR(X,X) contains k*k columns. The problem is that HDIR(X,X) contains columns for X1*X2 and for X2*X1, even though those columns are the same. The same holds for other crossed terms such as X1*X3 and X3*X1, X2*X3 and X3*X2, and so forth.

If you want to use the HDIR function to generate all pairwise interactions, you have a choice: You can generate ALL products of pairs and delete the redundant ones, or you can compute only the unique pairs. I will show the latter because it is more efficient.

varNames = 'X1':'X3';
use Have;
   read all var varNames into X;
close;
 
/* Compute only the columns we need */
A1 = HDIR(X[,1], X[,1:3]);    /* interaction of X1 with X1, X2, X3 */
A2 = HDIR(X[,2], X[,2:3]);    /* interaction of X2 with X2 X3 */
A3 = HDIR(X[,3], X[,3]);      /* interaction of X3 with X3 */
A = A1 || A2 || A3;
 
/* get the HEAD module from 
   https://blogs.sas.com/content/iml/2021/02/10/print-top-rows-of-data-sas.html
*/
load module=(Head);
run Head(A) colname={'X1_2' 'X1_X2' 'X1_X3' 'X2_2' 'X2_X3' 'X3_2'};

The HEAD module displays only the top five rows of the matrix of interactions. This output is almost the same as the output from PROC GLMSELECT. A difference is how missing values propagate. For the third row, X1 is missing. PROC GLMSELECT puts a missing value in all columns of the pairwise interactions. In contrast, the SAS/IML output only has missing values in the first three columns because those are the only columns that involve the X1 variable. If your data contain missing values, you might want to use the CompleteCases function to delete the rows that have missing values before performing additional analyses.

A function that computes interactions of continuous variables

The previous section computes all quadratic interactions for a matrix that has three variables. It is straightforward to generalize the idea to k variables. You simply compute the interactions HDIR(X[,i],X[,i:k]) for i=1,2,...,k. This is done in the following program. During the computation, it is convenient to also generate names for the interactions. The following program generates the same names as PROC GLMSELECT. To make the code easier to understand, I encapsulate the logic for names into a helper function called GetNamePairs. The helper function is called as part of the InteractPairs module, which returns both the matrix of interactions and the character vector that names the columns:

/* Helper function: Form names of interaction effects.
   Input: s is a scalar name. t is a character vector of names. 
   Return row vector of pairs of names "s_2" or "s_t[i]" 
   Example: GetNamePairs('X1', 'X1':'X3') returns {'X1_2' 'X1_X2' 'X1_X3'} */
start GetNamePairs(s, t);
   k = nrow(t)*ncol(t);
   b = blankstr(nleng(s)+nleng(t)+1);  /* max length of interaction name */
   pairs = j(1, k, b);
   do i = 1 to k;
      if s=t[i] then pairs[i] = strip(s) + "_2";
      else           pairs[i] = catx("_", strip(s), strip(t[i])); 
   end;
   return pairs;
finish;
 
/* Generate all quadratic interactions for continuous variables.
   Input: design matrix X and a vector of column names.
   Output: OutX:     a matrix whose columns are the pairwise interactions
           OutNames: the names of interactions, such as Age_2 and Height_Age */
start InteractPairs(OutX, OutNames, X, Names);
   k = ncol(X);
   numInteract = comb(k+1,2);  /* = k + comb(k,2) */
   OutX = j(nrow(X), numInteract, .);
   OutNames = j(1, numInteract, BlankStr(2*nleng(Names)+1));
   col = 1;              /* initial column to fill */
   do i = 1 to k;
      m = k-i+1;         /* number of interaction for this loop */
      OutX[,col:col+m-1]     = HDIR(X[,i], X[,i:k]);
      OutNames[,col:col+m-1] = GetNamePairs(Names[i], Names[i:k]);
      col = col + m;
   end;
finish;
 
run InteractPairs(OutX, OutNames, X, varNames);
run Head(OutX) colname=OutNames;

The output is identical to the earlier output. The input arguments to this module can have an arbitrary number of columns. In this example, the design matrix does not include an intercept column (a column that is all ones). Consequently, the output is the set of all quadratic interactions. If your design matrix includes an intercept column, the output will contain an intercept column and all main effects in addition to the quadratic effects.

Technically, you don't need to use the HDIR function in the InteractPairs module. Instead, you can use the familiar '#' operator to perform the elementwise multiplication. However, if you try to generalize the module to handle the interaction between categorical variables and continuous variables, the HDIR function will be useful.

Summary

This article introduces the HDIR function in SAS/IML, which computes the horizontal direct product of matrices. You can use the HDIR function to generate interaction effects in regression models. This article shows how to generate all quadratic interactions among a set of continuous variables.

The post Generate all quadratic interactions in a regression model appeared first on The DO Loop.

2月 032021
 

In a previous article, I showed how to generate random points uniformly inside a d-dimensional sphere. In that article, I stated the following fact:

If Y is drawn from the uncorrelated multivariate normal distribution, then S = Y / ||Y|| has the uniform distribution on the unit sphere.

I was reminded of this fact recently when I wrote about how to simulate a random walk in high dimensions. Each step of a random walk needs to generate a random direction and a random step length. For a 2-D random walk, it is easy to generate a random direction as an angle, θ, chosen from the uniform distribution on the interval [0,2π). However, for higher-dimensional random walks, I do not suggest using random angles to determine directions. Although it is mathematically possible to use spherical coordinates in d-dimensions, computations in a spherical coordinate system are extremely messy.

A better alternative is to use random unit vectors to determine a direction for each step in a random walk. (In fact, a unit vector is sometimes called a direction vector.) Since a unit vector is a vector from the origin to a point on the unit sphere, this article shows how to generate random points uniformly on the unit sphere. I show the computation twice: once by using the SAS/IML matrix language and again by using the SAS DATA step. In order to visualize the data, I show how to create a contour plot of ungridded data in SAS.

Random points on a circle

For simplicity, let's see how to generate random points on the unit circle in 2-D. The following SAS/IML program simulates N=1000 points from d=2 independent normal distributions. The notation X[ ,##] is a subscript reduction operator that returns a row vector that contains the sum of the squared elements of each row of X. Thus the expression sqrt(X[,##]) gives the Euclidean distance of each row to the origin. If you divide the coordinates by the distance, you obtain a point on the unit circle:

%let N = 1000;         /* number of steps in random walk */
%let d = 2;            /* S^{d-1} sphere embedded in R^d */
%let r = 1;            /* radius of sphere */
 
proc iml;
call randseed(12345);
r = &r; d = &d;
x = randfun(&N // d, "Normal");   /* N points from MVN(0, I(d)) */
norm = sqrt( x[,##] );            /* Euclidean distance to origin */
x = r * x / norm;                 /* scale point so that distance to origin is r */
title "&n Random Points on Circle";
call scatter(x[,1], x[,2]) grid={x y}
             procopt="aspect=1" option="transparency=0.8";
QUIT;

As promised, the resulting points are on the circle of radius r=1. Recall that uniformly at random does not mean evenly spaced. Point patterns that are generated by a uniform process will typically have gaps and clusters.

Random points on a sphere

You can also use the SAS DATA step to generate the random points on a sphere. You can generate each coordinate independently from a normal distribution and use the EUCLID function in Base SAS to compute the Euclidean distance from a point to the origin. You can then scale the coordinates so that the point is on the sphere of radius r. The following DATA step generates d-dimensional points for any d:

%let N = 2000;         /* number of steps in random walk */
%let d = 3;            /* S^{d-1} sphere embedded in R^d */
%let r = 1;            /* radius of sphere */
data RandSphere;
array x[&d];
call streaminit(12345);
do i = 1 to &N;
   do j = 1 to &d;
      x[j] = rand("Normal");      /* random point from MVN(0, I(d)) */
   end;
   norm = Euclid( of x[*] );      /* Euclidean distance to origin */
   do j = 1 to &d;
      x[j] = &r * x[j] / norm;    /* scale point so that distance to origin is r */
   end;
   output;
end;
drop j norm;
run;

How can we visualize the points to assure ourselves that the program is running correctly? One way is to plot the first two coordinates and use colors to represent the value of the points in the third coordinate direction. For example, the following call to PROC SGPLOT creates a scatter plot of (x1,x2) and uses the COLORRESPONSE=x3 option to color the markers. The default color ramp is the ThreeAltColor ramp, which (for my ODS style) is a blue-black-red color ramp. That means that points near the "south pole" (x3 = -1) will be colored blue, points near the equator will be colored black, and points near the "north pole" (x3 = 1) will be colored red. I have also used the TRANSPARENCY= option so that the density of the points is shown.

title "&n Random Points on a Sphere";
proc sgplot data=RandSphere aspect=1;
   scatter x=x1 y=x2 / colorresponse=x3 transparency=0.5 markerattrs=(symbol=CircleFilled);
   xaxis grid; yaxis grid;
run;

Switching to (x,y,z) instead of (x1,x2,x3) notation, the graph shows that observations near the (x,y)-plane (for which x2 + y2 &asymp 1) have a dark color because the z-value is close 1 zero. In contrast, observations near for which x2 + y2 &asymp 0 have either a pure blue or pure red color since those observations are near one of the poles.

A contour plot of ungridded data

Another way to visualize the data would be to construct a contour plot of the points in the upper hemisphere of the sphere. The exact contours of the upper hemisphere are concentric circles (lines of constant latitude) centered at (x,y)=(0,0). For these simulated data, a contour plot should display contours that are approximately concentric circles.

In a previous article, I showed how to use PROC TEMPLATE and PROC SGRENDER in SAS to create a contour plot. However, in that article the data were aligned on an evenly spaced grid in the (x,y) plane. The data in this article have random positions. Consequently, on the CONTOURPLOTPARM statement in PROC TEMPLATE, you must use the GRIDDED=FALSE option so that the plot interpolates the data onto a rectangular grid. The following Graph Template Language (GTL) defines a contour plot for irregularly spaced data and creates the contour plot for the random point on the upper hemisphere:

/* Set GRIDDED=FALSE if the points are not arranged on a regular grid */
proc template;
define statgraph ContourPlotParmNoGrid;
dynamic _X _Y _Z _TITLE;
begingraph;
   entrytitle _TITLE;
   layout overlay;
      contourplotparm x=_X y=_Y z=_Z / GRIDDED=FALSE /* for irregular data */
        contourtype=fill nhint=12  colormodel=twocolorramp name="Contour";
      continuouslegend "Contour" / title=_Z;
   endlayout;
endgraph;
end;
run;
 
ods graphics / width=480px height=480px;
proc sgrender data=RandSphere template=ContourPlotParmNoGrid;
where x3 >= 0;
dynamic _TITLE="Contour Plot of &n Random Points on Upper Hemisphere"
        _X="x1" _Y="x2" _Z="x3";
run;

The data are first interpolated onto an evenly spaced grid and then the contours are estimated. The contours are approximately concentric circles. From the contour plot, you can conclude that the data form a dome or hemisphere above the (x,y) plane.

Summary

This article shows how to use SAS to generate random points on the sphere in any dimension. You can use this technique to generate random directions for a random walk. In addition, the article shows how to use the GRIDDED=FALSE option on the CONTOURPLOTPARM statement in GTL to create a contour plot from irregularly spaced ("ungridded") data in SAS.

The post Generate random points on a sphere appeared first on The DO Loop.

2月 012021
 

Imagine an animal that is searching for food in a vast environment where food is scarce. If no prey is nearby, the animal's senses (such as smell and sight) are useless. In that case, a reasonable search strategy is a random walk. The animal can choose a random direction, walk/swim/fly in that direction for a while, and hope that it encounters food. If not, it can choose another random direction and try again.

According to a story in Quanta magazine, an optimal search strategy is to use a variant of a random walk that is known as a Lévy flight. This article describes a Lévy flight, simulates it in SAS, and compares it to Gaussian random walk.

Random walks

For the purpose of this article, a 2-D random walk is defined as follows. From the current position,

  • Choose a random direction by choosing an angle, θ, uniformly in [0, 2π).
  • Choose a random distance, r.
  • Move that distance in that direction.

The amount of territory that gets explored during a random walk depends on the distribution of the distances. For example, if you always choose a unit distance, you obtain the "drunkard's walk." If you choose distances from a normal distribution, you obtain a Gaussian walk. In both scenarios, the animal primarily searches near where it has been previously. These random walks that always use short distances are good for finding food in a food-rich environment.

In a food-scarce environment, it is a waste of time and energy to always move a short distance from the previous position. It turns out that a better strategy is to occasionally move a long distance. That puts the animal in a new location, which it can explore by moving small distances again. According to the Quanta article, this behavior has been observed "to greater or lesser extents" in many animals that hunt at sea, including albatross, sharks, turtles, penguins, and tuna (Sims, et al., Nature, 2008).

Levy walks and the Levy distribution

A popular model for this behavior is the Lévy walk, which is often called a Lévy flight because it was first applied to the flight path of an albatross over open water. In a Lévy flight, the distance is a random draw from a heavy-tailed distribution. In this article, I will simulate a Lévy flight by drawing distances from a Lévy distribution, which is a special case of the inverse gamma distribution.

The Lévy distribution Levy(μ, c) has a location parameter (μ) and a shape parameter (c). It is a special case of the inverse gamma distribution, which is represented by IGamma(α, β). If X ~ Levy(μ, c) is a random variable, then X – μ ~ IGamma(1/2, β/2). In this article, μ=0, so a random draw from the Levy(0,c) distribution is a random draw from IGamma(1/2, c/2) distribution. Thus, you can use the SAS program in the previous article to generate a random sample from the Lévy distribution.

The inverse gamma distribution with α=1/2 has very heavy tails. This means that a random variate can be arbitrarily large. When modeling a physical or biological system, there are often practical bounds on the maximum value of a quantity. For example, if you are modeling the flight of an albatross, you know that there is a maximum distance that an albatross can fly in a given time. Thus, researchers often use a truncated Lévy distribution to model the distances that an animal travels in one unit of time. For example, an albatross can fly up to 800 km in one day (!!), so a model of albatross distances should truncate the distances at 800.

Simulate a Gaussian and Levy Walk

You can use the SAS DATA step to simulate random walks. The program in this section generates a random direction in [0, 2π), then generates a random distance. For the Gaussian walk, the distance is the absolute value of a random N(0, σ) variate. For the Lévy walk, the distances are sampled from the Levy(0, 0.3*σ) distribution. The constant 0.3 is chosen so that the two distance distributions have approximately the same median.

For convenience, you can define a macro that makes it easy to generate random variates from the Lévy distribution. The following program uses the %RandIGamma macro (explained in the previous article), which generates a random variate from the inverse gamma distribution.

The following program simulates four individuals (animals) who start at the corners of a unit square with coordinates (0,0), (1,0), (1,1), and (0,1). At each step, the animals choose a random direction and move a random distance in that direction. In one scenario, the distances generated as the absolute value of a normal distribution. In another scenario, the distances are Levy distributed.

/* useful macros: https://blogs.sas.com/content/iml/2021/01/27/inverse-gamma-distribution-sas.html */
%macro RandIGamma(alpha, beta);          /* IGamma(alpha,beta) */
   (1 / rand('Gamma', &alpha, 1/(&beta)))
%mend;
%macro RandLevy(mu, scale);              /* Levy(mu,c) */
   (&mu + %RandIGamma(0.5, (&scale)/2));
%mend;
%macro RandTruncLevy(mu, scale, max);    /* Truncated Levy */
   min((&mu + %RandIGamma(0.5, (&scale)/2)), &max);
%mend;
 
%let N = 1000;         /* number of steps in random walk */
%let sigma = 0.1;      /* scale parameter for N(0,sigma) */
%let c = (0.3*&sigma); /* makes medians similar */
 
data Walks;
array x0[4] _temporary_ (0 1 1 0);  /* initial positions */
array y0[4] _temporary_ (0 0 1 1);
call streaminit(12345);
twopi = 2*constant('pi');
c = &c;                    /* Levy parameter */
do subject = 1 to dim(x0);
   t=0;                    /* initial position for subject */
   x = x0[subject];   y = y0[subject]; 
   r=.; theta=.;           /* r = random distance; theta=random angle */
   Walk = "Gaussian";  output;
   Walk = "Levy";      output;
   xN = x; yN = y;         /* (xN,yN) are coords for Gaussian walk */
   xL = x; yL = y;         /* (xL,yL) are coords for Levy walk */
   do t = 1 to &N;         /* perform steps of a random walk */
      theta = rand("Uniform", 0, twopi);  /* random angle */
      ux = cos(theta); uy = sin(theta);   /* u=(ux,uy) is random unit vector */
      /* take a Gaussian step; remember location */
      Walk = "Gaussian";
      r = abs( rand("Normal", 0, &sigma) ); /* ~ abs(N(0,sigma)) */
      xN = xN + r*ux;   yN = yN + r*uy;
      /* output the Gaussian walk step */
      x = xN; y = yN; output;
 
      /* take a (truncated) Levy step; remember location */
      Walk = "Levy";
      r = %RandTruncLevy(0, c, 2);  /* Levy(0,c) truncated into [0,2] */
      xL = xL + r*ux;   yL = yL + r*uy;
      /* output the Levy walk step */
      x = xL; y = yL; output;
   end;
end;
drop twopi c xN yN xL yL ux uy;
run;

Let's compare the step sizes for the two types of random walks. The following call to PROC SGPANEL creates histograms of the step lengths for each type of random walk:

title "Distribution of Step Lengths";
proc sgpanel data=Walks;
   label r = "Step Length";
   panelby Walk;
   histogram r / binwidth=0.1 binstart=0.05;
run;

The graph shows that all step lengths for the Gaussian random walk are less than 0.4, which represents four standard deviations for the normal distribution. In contrast, the distribution of distances for the Lévy walk has a long tail. About 15% of the distances are greater than 0.75. About 9% of the step lengths are the maximum possible value, which is 2.

How does that difference affect the amount of territory covered by an animal that is foraging for food? The following statements show the paths of the four animals for each type of random walk:

title "Position of Subjects";
title2 "50 Steps of a Random Walk";
proc sgpanel data=Walks aspect=1;
   where t <= 50;
   panelby Walk / onepanel;
   series x=x y=y / group=Subject;
   colaxis grid; rowaxis grid;
run;

The differences in the paths are evident. For the Gaussian random walk, the simulated animals never leave a small neighborhood around their starting positions. After 50 steps, none of the animals who are taking a Gaussian walk have moved more than one unit away from their starting position. In contrast, the animals who take Lévy walk have explored a much greater region. You can see that the typical motion is several small steps followed by a larger step that brings the animal into a new region to explore.

Generalizing to higher dimensions

It is easy to extend the random walk to higher dimensions. The trick is to represent the "random direction" by a unit vector (u) instead of by an angle (θ). In a previous article about random point within a d-dimensional ball, I remark that "If Y is drawn from the uncorrelated multivariate normal distribution, then u = Y / ||Y|| has the uniform distribution on the unit d-sphere." You can use this trick to generate random unit vectors, then step a distance r in that direction, where r is from a specified distribution of distances.

Summary

An article in Quanta magazine describes how some animals use a swim/flight path that can be modeled by a (truncated) Lévy random walk. This type of movement is a great way to search for food in a vast food-scarce environment. This article defines the Lévy distribution as a sub-family of the inverse gamma distribution. It shows how to generate random variates from the Lévy distribution in SAS. Finally, it simulates a Gaussian random walk and a Lévy walk and visually compares the area explored by each.

The post Gaussian random walks and Levy flights appeared first on The DO Loop.

1月 272021
 

The inverse gamma distribution is a continuous probability distribution that is used in Bayesian analysis and in some statistical models. The inverse gamma distribution is closely related to the gamma distribution. For any probability distribution, it is essential to know how to compute four functions: the PDF function, which returns the probability density at a given point; the CDF function, which returns the probability that an observation is less than or equal to a particular value; the QUANTILE function, which is the inverse CDF function; and the RAND function, which generates a random variate.

For the gamma distribution, the four essential functions are supported directly in Base SAS. The functions for the inverse gamma distribution are not supported in the same way. However, you can exploit the relationship between the gamma and inverse gamma distributions to implement these four functions for the inverse gamma distribution. This article shows how to implement the PDF, CDF, QUANTILE, and RAND functions for the inverse gamma distribution in SAS.

What is the inverse gamma distribution?

If X is gamma distributed, then X > 0. As its name suggests, the inverse gamma distribution is the distribution of 1/X when X is gamma distributed. More precisely, let Gamma(α, β) be the gamma distribution with shape parameter α and scale parameter β. (This article always uses scale parameters, never a rate parameter.) Let IGamma(α, β) be the inverse gamma distribution with shape and scale parameters. Then if X ~ Gamma(α, β), the random variable 1/X ~ IGamma(α, 1/β). The documentation for PROC MCMC provides an additional discussion of the relationship between the gamma and inverse gamma distributions. There is also a Wikipedia page about the inverse gamma distribution.

A feature of the inverse gamma distribution is that it has a long tail for small values of the α shape parameter. In fact, that the expected value (mean) is undefined when α < 1 and the variance is undefined when α < 2.

Generate random variates from the inverse gamma distribution

It is easy to generate random values from the inverse gamma distribution because that is how the distribution is defined. To obtain a random variate from IGamma(α, β), simply take the reciprocal of a random variate from the Gamma(α, 1/β) distribution. It might be helpful to define a little macro function that makes it easier to generate a random sample. The following SAS program generates a large random sample and displays a histogram:

/* Random values: If X ~ Gamma(alpha, 1/beta), then 1/X ~ IGamma(alpha, beta) */
%macro RandIGamma(alpha, beta);
   (1 / rand('Gamma', &alpha, 1/(&beta)))
%mend;
data IGammaRand;
   call streaminit(12345);
   alpha = 3;
   beta = 0.5;
   do j = 1 to 10000;
      x = %RandIGamma(alpha, beta);
      output;
   end;
run;
 
title "Random Sample from the Inverse Gamma Distribution";
title2 "alpha=3, beta=0.5";
proc sgplot data=IGammaRand noautolegend;
   histogram x / binwidth=0.05 binstart=0.025; 
   xaxis values=(0 to 3 by 0.25);
run;

For these values of the parameters, the histogram shows that the distribution has a peak at x=1/8 and a very long tail. The tail is truncated at x=3, but you can use PROC MEANS to see that the maximum value in the sample is 11.69.

The PDF of the inverse gamma distribution

The Wikipedia article gives a formula for the PDF of the inverse gamma article. For x > 0, the PDF formula is
\(\mbox{IGamma PDF}(x; \alpha, \beta) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} (1/x)^{\alpha+1} \exp(-\beta/x)\)

If you compare this to the PDF for the gamma distribution, you will notice that they are very similar. If you evaluate the gamma PDF with scale 1/β at the point 1/x, you almost get the inverse gamma PDF. All you need to do is divide by x2, as follows:
IGamma_PDF(x; α, β) = Gamma_PDF(1/x; α, 1/β) / x2

You can use this relationship to evaluate the inverse gamma PDF in SAS. The following SAS DATA step evaluates the inverse gamma PDF for four choices of (α, β) parameters. These are the same choices that are used in the Wikipedia article for the inverse gamma distribution.

/* PDF : parameterize IGamma by using the scale parameter
   in terms of the Gamma PDF with shape parameter. */
%macro PDFIGamma(x, alpha, beta);
   (pdf("Gamma", 1/(&x), &alpha, 1/(&beta)) / ((&x)**2))
%mend;
 
data IGammaPDF;
array a[4] _temporary_ (1 2 3 3);
array b[4] _temporary_ (1 1 1 0.5);
do i = 1 to dim(a);
   alpha = a[i];
   beta  = b[i];
   group = cat('a=',alpha,', b=',beta);
   do x = 0.01 to 3 by 0.01;
      pdf = %PDFIGamma(x, alpha, beta);
      /* equivalent to 
      pdf = beta**alpha / Gamma(alpha) * x**(-alpha-1) * exp(-beta/x); */
      output;
   end;
end;
run;
 
title "PDF of the Inverse Gamma Distribution";
proc sgplot data=IGammaPDF;
   series x=x y=PDF / group=group lineattrs=(thickness=2);
   xaxis grid label="Density"; yaxis grid;
run;

You can see that the brown curve (the PDF for (α, β) = (3, 0.5)) has the same shape as the histogram of the random sample in the previous section.

The CDF and quantiles of the inverse gamma distribution

In a previous article, I show that there is a relationship between the CDF of the gamma distribution and a special function called the regularized upper incomplete gamma function. I also show how to compute the incomplete gamma function in SAS. In Wikipedia, the CDF of the inverse gamma distribution is given in terms of the incomplete gamma function. Consequently, we can compute the CDF in SAS without difficulty.

The quantile function is more difficult. The quantile function is the inverse CDF. There are a LOT of reciprocals to keep track of during the derivation! Nevertheless, you can evaluate the quantile function of the inverse gamma distribution by using an expression that includes the quantile function of the gamma distribution. The derivation is left as an exercise.

The following DATA step evaluates the CDF of the inverse gamma distribution for four sets of parameters. The quantile function is evaluated at the same time to ensure that Quantile(CDF(x)) = x. The CDF function is very flat for very small values of x, so the quantile function might fail to converge for very small values of the probability.

/* The CDF of the IGAMMA function is SDF('Gamma',1/x,alpha,1/beta) */
%macro CDFIGamma(x, alpha, beta);
   (SDF('Gamma', (&beta)/(&x), &alpha))
%mend;
%macro QuantileIGamma(p, alpha, beta);
   (1 / quantile('Gamma', 1-(&p), &alpha, 1/(&beta)))
%mend;
 
data IGammaCDF;
array a[4] _temporary_ (1 2 3 3);
array b[4] _temporary_ (1 1 1 0.5);
do i = 1 to dim(a);
   alpha = a[i];
   beta  = b[i];
   group = cat('a=',alpha,', b=',beta);
   do x = 0.05 to 3 by 0.01;
      CDF = %CDFIGamma(x, alpha, beta); 
      Quantile = %QuantileIGamma(CDF, alpha, beta); /* x - Quantile is approx 0 */
      output;
   end;
end;
run;
 
title "CDF of the Inverse Gamma Distribution";
proc sgplot data=IGammaCDF;
   series x=x y=CDF / group=group lineattrs=(thickness=2);
   xaxis grid; yaxis grid;
run;
 
title "Quantiles of the Inverse Gamma Distribution";
proc sgplot data=IGammaCDF;
   series x=CDF y=Quantile / group=group lineattrs=(thickness=2);
   xaxis grid label="Probability"; yaxis grid;
run;

Summary

This article shows how to compute the four essentials functions of the inverse gamma distribution: random variates, PDF, CDF, and quantiles. In every case, you can exploit the relationship between the gamma and inverse gamma distributions. The result is that you can use the RAND, PDF, CDF, and QUANTILE functions for the gamma distribution to evaluate similar quantities for the inverse gamma distribution. A contribution of this article is to write down the four formulas in one place and show how to compute them in SAS.

The post The inverse gamma distribution in SAS appeared first on The DO Loop.

1月 252021
 

Years ago, I wrote about how to compute the incomplete beta function in SAS. Recently, a SAS programmer asked about a similar function, called the incomplete gamma function. The incomplete gamma function is a "special function" that arises in applied math, physics, and statistics. You should not confuse the gamma function with the gamma distribution in probability, although they are related, as we will soon see.

There are two kinds of incomplete gamma functions: the "upper" and the "lower." This article shows how to compute the upper and lower incomplete gamma functions in SAS. A graph of the upper incomplete gamma function is shown to the right for several values of the parameter, α. A graph of the lower incomplete gamma function is shown later in this article.

The gamma function

Before discussing the incomplete gamma function, let's review the "complete" gamma function, which is usually called THE gamma function. The gamma function is defined by the following integral:
\(\Gamma(\alpha) = \int_0^{\infty} t^{\alpha-1} e^{-t} \, dt\)

Notice that the parameter to the function is α. The integration variable, t, is integrated out and does not appear in the answer. The integral is "complete" because the bounds of integration is the complete positive real line, (0, ∞).

The gamma function generalizes the factorial operation to include positive real numbers. In particular, Γ(1) = 1 and Γ(x+1) = x Γ(x) for any positive real number, x. If n is a positive integer, then Γ(n) = (n-1)!.

In SAS, you can use the GAMMA function to evaluate the (complete) gamma function at any positive real number:

/* The (complete) gamma function */
data Gamma;
do x = 0.5 to 3 by 0.5;
   gamma = gamma(x);
   output;
end;
run;
 
proc print; run;

The upper and lower incomplete gamma functions

An incomplete gamma function replaces the upper or lower limit of integration in the integral that defines the complete gamma function. If you replace the upper limit of integration (∞) by x, you get the lower incomplete gamma function:
\(p(\alpha, x) = \int_0^{x} t^{\alpha-1} e^{-t} \, dt\)
Here "lower" refers to integrating only the "left side", which means values of t < x.

If you replace the lower limit of integration (0) by x, you get the upper incomplete gamma function:
\(q(\alpha, x) = \int_x^{\infty} t^{\alpha-1} e^{-t} \, dt\)
Here "upper" refers to integrating only the "right side", which means values of t > x.

Notice that p(α, x) + q(α, x) = Γ(α) for all values of x. Furthermore, p(α, 0) = 0 and q(α, 0) = Γ(α).

Compute the incomplete gamma function in SAS

The lower incomplete gamma function looks a LOT like the definition of the cumulative distribution function (CDF) of the gamma distribution. The only difference is that the maximum value of a CDF is unity whereas the maximum value of the lower incomplete gamma function is Γ(α). Thus, Γ(α) is a constant scale factor that relates the lower incomplete gamma function and the CDF of the gamma distribution.

Specifically, the CDF function for the gamma distribution with shape parameter α and scale parameter 1 is
\(\mbox{CDF}('\mbox{Gamma}', x, \alpha) = \frac{1}{\Gamma(\alpha)} \int_0^x t^{\alpha-1} e^{-t} \, dt\)
Equivalently,
\(p(\alpha, x) = \Gamma(\alpha) * \mbox{CDF}('\mbox{Gamma}', x, \alpha)\)

In the same way, recall that the complementary CDF (sometimes called the survival function or the reliability function) is the quantity SDF(x) = 1 - CDF(x). The SDF of the gamma distribution is a scaled version of the upper incomplete gamma function, as follows:
\(q(\alpha, x) = \Gamma(\alpha) * \mbox{SDF}('\mbox{Gamma}', x, \alpha)\)

Consequently, you can compute the lower or upper incomplete gamma function in SAS by using the following macros. The SAS DATA step shows how to evaluate the functions on the domain (0, 4] for three values of the parameter, α = 1, 2, and 3. The graphs of the upper incomplete gamma function are shown:

%macro LowerIncGamma(x, alpha); /* lower incomplete gamma function */
   (Gamma(&alpha) * CDF('Gamma', &x, &alpha))
%mend;
%macro UpperIncGamma(x, alpha); /* upper incomplete gamma function */
   (Gamma(&alpha) * SDF('Gamma', &x, &alpha))
%mend;
data IncGamma;
do alpha = 1 to 3;
   do x = 0.01 to 4 by 0.01;
      LowerIncGamma = %LowerIncGamma(x, alpha);
      UpperIncGamma = %UpperIncGamma(x, alpha);
      output;
   end;
end;
run;
 
title "Lower Incomplete Gamma Function";
proc sgplot data=IncGamma;
   series x=x y=LowerIncGamma / group=alpha;
   xaxis grid; yaxis grid;
run;
title "Upper Incomplete Gamma Function";
proc sgplot data=IncGamma;
   series x=x y=UpperIncGamma / group=alpha;
   xaxis grid; yaxis grid;
run;

The graph of the upper incomplete gamma function is shown at the top of this article. The following graph shows the lower incomplete gamma function for several values of the α parameter.

By the way, some fields of study will use the term "regularized" incomplete gamma function. The regularized lower incomplete gamma function is simply the CDF of the gamma distribution. The regularized upper incomplete gamma function is the SDF of the gamma distribution.

Summary

As is sometimes the case, different areas of application use different names for essentially the same quantity. Although the SAS language does not provide a built-in function that evaluates the incomplete gamma function, it is easy to evaluate by using the GAMMA and CDF functions, as follows:

  • The complete gamma function, Γ(α), is computed by using the GAMMA function.
  • The lower/upper incomplete gamma function is a scaled version of the CDF and SDF (respectively) of the gamma distribution:
    • The lower incomplete gamma function is p(alpha,x) = GAMMA(alpha)*CDF('Gamma',x,alpha);
    • The upper incomplete gamma function is q(alpha,x) = GAMMA(alpha)*SDF('Gamma',x,alpha);
  • The lower/upper regularized incomplete gamma function is another named for the CDF and SDF, respectively.
    • The regularized lower incomplete gamma function is computed as CDF('Gamma',x,alpha);
    • The regularized upper incomplete gamma function is computed as SDF('Gamma',x,alpha);

The post How to compute the incomplete gamma function in SAS appeared first on The DO Loop.

1月 202021
 

This is the third and last introductory article about how to bootstrap time series in SAS. In the first article, I presented the simple block bootstrap and discussed why bootstrapping a time series is more complicated than for regression models that assume independent errors. Briefly, when you perform residual resampling on a time series, you need to resample in blocks to simulate the autocorrelation in the original series. In the second article, I introduced the moving-block bootstrap. For both methods, all blocks are the same size, and the block size must evenly divide the length of the series (n).

In contrast, the stationary block bootstrap uses blocks of random lengths. This article describes the stationary block bootstrap and shows how to implement it in SAS.

The "stationary" bootstrap gets its name from Politis and Romano (1992), who developed it. When applied to stationary data, the resampled pseudo-time series is also stationary.

Choosing the length of a block

In the moving-block bootstrap, the starting location for a block is chosen randomly, but all blocks have the same length. For the stationary block bootstrap, both the starting location and the length of the block are chosen at random.

The distribution for the block length is typically chosen from a geometric distribution with parameter p. If you want the average block length to be L, define p = 1/L because the expected value of a geometric distribution with parameter p is 1/p.

Why do we choose lengths from the geometric distribution? Well, because a length from the geometric distribution is easy to interpret in terms of a discrete process. Imagine performing the following process:

  • Choose an initial location, t, at random.
  • With probability p, choose a different location (at random) for the next residual and start a new block.
  • With probability (1-p), choose the location t+1 for the next residual and add that residual to the current block.
By definition, this process generates blocks whose lengths are geometrically distributed: L ~ Geom(p).

Generate a random block

There are several ways to implement the stationary block bootstrap in SAS. A straightforward method is to generate a starting integer, t, uniformly at random in the range [1, n], where n is the length of the series. Then, choose a length, L ~ Geom(p). If t + L exceeds n, truncate the block. (Some practitioners "wrap" the block to include residuals at the beginning of the series. This alternative method is called the circular block bootstrap.) Continue this process until the total length of the blocks is n. You will usually need to truncate the length of the last block so that the sum of the lengths is exactly n. This process is implemented in the following SAS/IML user-defined functions:

/* STATIONARY BLOCK BOOTSTRAP */
/* Use a random starting point, t, and a random block length, L ~ Geom(p), 
   - probability p of starting a new block at a new position
   - probability (1-p) of using the next residual from the current block
*/
%let p = (1/12);                /* expected block length is 1/p */ 
proc iml;
/* generate indices for a block of random length L~Geom(p)
   chosen from the sequence 1:n */
start RandBlockIdx(n, p);
   beginIdx = randfun(1, "Integer", n); /* random start: 1:n */
   L = randfun(1, "Geom", p);           /* random length */
   endIdx = min(beginIdx+L-1, n);       /* truncate into 1:n */
   return( beginIdx:endIdx );           /* return sequence of indices in i:n */
finish;
 
/* generate indices for one stationary bootstrap resample */
start StationaryIdx(n, p);
   bootIdx = j(n,1,.);
   s = 1;                       /* starting index for first block */
   do while( s <= n );
      idx = RandBlockIdx(n, p); /* get block of random length */
      if ncol(idx) > n-s then   /* truncate if block too long */
         idx = idx[, 1:n-s+1];
      L = ncol(idx);            /* length of this block */
      jdx = s:s+L-1;            /* all indices for this block */
      bootIdx[jdx] = idx;       /* assign this block */
      s = s + L;                /* starting index for next block */
   end;
   return bootIdx;
finish;

Let's run these functions on some sample data. The first article in this series shows how to fit a model to the Sashelp.Air data and write the predicted and residual values. The following statements use the stationary block bootstrap to generate one bootstrap sample of length n from the residuals. When you add the bootstrapped residuals to the predicted values, you get a "new" time series, which is graphed:

use OutReg;                     /* data to bootstrap: Y = Pred + Resid */
   read all var {'Time' 'Pred' 'Resid'};
close;
 
call randseed(1234);
n = nrow(Pred);                 /* length of series */
prob = &p;                      /* probability of jumping to a new block */
 
/*---- Visualization Example: generate one bootstrap resample ----*/
idx = StationaryIdx(n, prob);
YBoot = Pred + Resid[idx];
 
/* visualize the places where a sequence is not ascending:
https://blogs.sas.com/content/iml/2013/08/28/finite-diff-estimate-maxi.html */
blocksIdx = loc(dif(idx) <= 0); /* where is the sequence not ascending? */
b = Time[blocksIdx];
 
title "One Bootstrap Resample";
title2 "Stationary Block Bootstrap";
refs = "refline " + char(b) + " / axis=x;";
call series(Time, YBoot) other=refs;
/*---- End of visualization example ----*/

The graph shows a single bootstrap sample. I have added vertical reference lines to show the blocks of residuals. For this random sample, there are nine blocks whose lengths vary between 2 (the last block) and 28. If you choose another random sample, you will get a different set of blocks that have different lengths.

The stationary block bootstrap

In practice, the bootstrap method requires many bootstrap resamples. You can call the StationaryIdx function repeatedly in a loop to generate B=1000 bootstrap samples, as follows:

/* A complete stationary block bootstrap generates B resamples
   and usually writes the resamples to a SAS data set. */
B = 1000;
SampleID = j(n,1,.);
create BootOut var {'SampleID' 'Time' 'YBoot'};  /* create outside of loop */
do i = 1 to B;
   SampleId[,] = i;   /* fill array. See https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
   idx = StationaryIdx(n, prob);
   YBoot = Pred + Resid[idx];
   append;
end;
close BootOut;

The BootOut data set contains all the bootstrap samples. You can use a BY-group analysis to obtain the bootstrap estimates for almost any statistic. For example, the following scatter plot shows the distribution of 1000 bootstrap estimates of the slope and intercept parameters for an autoregressive model of the series. Reference lines indicate the estimates for the original data.

You can download the complete SAS program that performs all computations and creates all graphs in this article.

Summary

This article is the third of a series. It shows how to perform a stationary block bootstrap by randomly sampling blocks of residuals. Whereas the simple-block and the moving-block bootstrap methods use blocks that have a constant length, the stationary block bootstrap chooses blocks of random lengths.

In researching these blog posts, I read several articles. I recommend Bates (2019) for an introduction and Politis and Romano (1994) for details.

The post The stationary block bootstrap in SAS appeared first on The DO Loop.

1月 132021
 

As I discussed in a previous article, the simple block bootstrap is a way to perform a bootstrap analysis on a time series. The first step is to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L) that divides the total length of the series (n). Each bootstrap resample is generated by randomly choosing from among the non-overlapping n/L blocks of residuals, which are added to the predicted model.

The simple block bootstrap is not often used in practice. One reason is that the total number of blocks (k=n/L) is often small. If so, the bootstrap resamples do not capture enough variation for the bootstrap method to make correct inferences. This article describes a better alternative: the moving block bootstrap. In the moving block bootstrap, every block has the same block length but the blocks overlap. The following figure illustrates the overlapping blocks when L=3. The indices 1:L define the first block of residuals, the indices 2:L+1 define the second block, and so forth until the last block, which contains the residuals n-L+1:n.

To form a bootstrap resample, you randomly choose k=n/L blocks (with replacement) and concatenate them. You then add these residuals to the predicted values to create a "new" time series. Repeat the process many times and you have constructed a batch of bootstrap resamples. The process of forming one bootstrap sample is illustrated in the following figure. In the figure, the time series has been reshaped into a k x L matrix, where each row is a block.

The moving block bootstrap in SAS

To demonstrate the moving block bootstrap in SAS, let's use the same data that I analyzed in the previous article about the simple block bootstrap. The previous article extracted 132 observations from the Sashelp.Air data set and used PROC AUTOREG to form an additive model Predicted + Residuals. The OutReg data set contains three variables of interest: Time, Pred, and Resid.

As before, I will choose the block size to be L=12. The following SAS/IML program reads the data and defines a matrix (R) such that the i_th row contains the residuals with indices i:i+L-1. In total, the matrix R has n-L+1 rows.

/* MOVING BLOCK BOOTSTRAP */
%let L = 12;
proc iml;
call randseed(12345);
 
use OutReg;
   read all var {'Time' 'Pred' 'Resid'};
close;
 
/* Restriction for Simple Block Bootstrap: 
   The length of the series (n) must be divisible by the number of blocks (k)
   so that all blocks have the same length (L) */
n = nrow(Pred);          /* length of series */
L = &L;                  /* length of each block */
k = n / L;               /* number of random blocks to use */
if k ^= int(k) then 
   ABORT "The series length is not divisible by the block length";
 
/* Trick: Reshape data into k x L matrix. Each row is block of length L */
P = shape(Pred, k, L);   /* there are k rows for Pred */
J = n - L + 1;           /* total number of overlapping blocks to choose from */
R = j(J, L, .);          /* there are n-L+1 blocks of residuals */
Resid = rowvec(Resid);   /* make Resid into row vector so we don't need to transpose each row */
do i = 1 to J;
   R[i,] = Resid[ , i:i+L-1]; /* fill each row with a block of residuals */
end;

With this setup, the formation of bootstrap resamples is almost identical to the program in the previous article. The only difference is that the matrix R for the moving block bootstrap has more rows. Nevertheless, each resample is formed by randomly choosing k rows from R and adding them to a block of predicted values. The following statements generate B=1000 bootstrap resamples, which are written to a SAS data set (BootOut). The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.

/* The moving block bootstrap repeats this process B times
   and usually writes the resamples to a SAS data set. */
B = 1000;
SampleID = j(n,1,.);
create BootOut var {'SampleID' 'Time' 'YBoot'};  /* create outside of loop */
do i = 1 to B;
   SampleId[,] = i;
   idx = sample(1:J, k);  /* sample of size k from the set 1:J */
   YBoot = P + R[idx,];
   append;
end;
close BootOut;
QUIT;

The BootOut data set contains B=1000 bootstrap samples. The rest of the bootstrap analysis is exactly the same as in the previous article.

Summary

This article shows how to perform a moving block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which must divide the total length of the series (n), and form the n-L+1 overlapping blocks of residuals. Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.

You can download the SAS program that shows the complete analysis for both the simple block and moving block bootstrap.

The post The moving block bootstrap for time series appeared first on The DO Loop.

1月 112021
 

On The DO Loop blog, I write about a diverse set of topics, including statistical data analysis, machine learning, statistical programming, data visualization, simulation, numerical analysis, and matrix computations. In a previous article, I presented some of my most popular blog posts from 2020. The most popular articles often deal with elementary or familiar topics that are useful to almost every data analyst.

However, among last year's 100+ articles are many that discuss advanced topics. Did you make a New Year's resolution to learn something new this year? Here is your chance! The following articles were fun to write and deserve a second look.

Machine learning concepts

Relationship between a threshold value and true/false negatives and positives

Statistical smoothers

Bilinear interpolation of 12 data values

I write a lot about scatter plot smoothers, which are typically parametric or nonparametric regression models. But a SAS customer wanted to know how to get SAS to perform various classical interpolation schemes such as linear and cubic interpolations:

SAS Viya and parallel computing

SAS is devoting tremendous resources to SAS Viya, which offers a modern analytic platform that runs in the cloud. One of the advantages of SAS Viya is the opportunity to take advantage of distributed computational resources. In 2020, I wrote a series of articles that demonstrate how to use the iml action in Viya 3.5 to implement custom parallel algorithms that use multiple nodes and threads on a cluster of machines. Whereas many actions in SAS Viya perform one and only one task, the iml action supports a general framework for custom, user-written, parallel computations:

The map-reduce functionality in the iml action

  • The map-reduce paradigm is a two-step process for distributing a computation. Every thread runs a function and produces a result for the data that it sees. The results are aggregated and returned. The iml action supports the MAPREDUCE function, which implements the map-reduce paradigm.
  • The parallel-tasks paradigm is a way to run independent computations concurrently. The iml action supports the PARTASKS function, which implements the map-reduce paradigm.

Simulation and visualization

Decomposition of a convex polygon into triangles

Generate random points in a polygon

Your turn

Did I omit one of your favorite blog posts from The DO Loop in 2020? If so, leave a comment and tell me what topic you found interesting or useful. And if you missed some of these articles when they were first published, consider subscribing to The DO Loop in 2021.

The post Blog posts from 2020 that deserve a second look appeared first on The DO Loop.

12月 212020
 

When you perform a linear regression, you can examine the R-square value, which is a goodness-of-fit statistic that indicates how well the response variable can be represented as a linear combination of the explanatory variables. But did you know that you can also go the other direction? Given a set of explanatory variables and an R-square statistic, you can create a response variable, Y, such that a linear regression of Y on the explanatory variables produces exactly that R-square value.

The geometry of correlation and least-square regression

In a previous article, I showed how to compute a vector that has a specified correlation with another vector. You can generalize that situation to obtain a vector that has a specified relationship with a linear subspace that is spanned by multiple vectors.

Recall that the correlation is related to the angle between two vectors by the formula cos(θ) = ρ, where θ is the angle between the vectors and ρ is the correlation coefficient. Therefore, correlation and "angle between" measure similar quantities. It makes sense to define the angle between a vector and a linear subspace as the smallest angle the vector makes with any vector in the subspace. Equivalently, it is the angle between the vector and its (orthogonal) projection onto the subspace.

This is shown graphically in the following figure. The vector z is not in the span of the explanatory variables. The vector w is the projection of z onto the linear subspace. As explained in the previous article, you can find a vector y such that the angle between y and w is θ, where cos(θ) = ρ. Equivalently, the correlation between y and w is ρ.

Correlation between a response vector and a predicted vector

There is a connection between this geometry and the geometry of least-squares regression. In least-square regression, the predicted response is the projection of an observed response vector onto the span of the explanatory variables. Consequently, the previous article shows how to simulate an "observed" response vector that has a specified correlation with the predicted response.

For simple linear regression (one explanatory variable), textbooks often point out that the R-square statistic is the square of the correlation between the independent variable, X, and the response variable, Y. So, the previous article enables you to create a response variable that has a specified R-square value with one explanatory variable.

The generalization to multivariate linear regression is that the R-square statistic is the square of the correlation between the predicted response and the observed response. Therefore, you can use the technique in this article to create a response variable that has a specified R-square value in a linear regression model.

To be explicit, suppose you are given explanatory variables X1, X2, ..., Xk, and a correlation coefficient, ρ. The following steps generate a response variable, Y, such that the R-square statistic for the regression of Y onto the explanatory variables is ρ2:

  1. Start with an arbitrary guess, z.
  2. Use least-squares regression to find w = \(\hat{\mathbf{z}}\), which is the projection of z onto the subspace spanned by the explanatory variables and the 1 vector.
  3. Use the technique in the previous article to find Y such that corr(Y, w) = ρ

Create a response variable that has a specified R-square value in SAS

The following program shows how to carry out this algorithm in the SAS/IML language:

proc iml;
/* Define or load the modules from 
   https://blogs.sas.com/content/iml/2020/12/17/generate-correlated-vector.html
*/
load module=_all_;
/* read some data X1, X2, ... into columns of a matrix, X */
use sashelp.class;
   read all var {"Height" "Weight" "Age"} into X;  /* read data into (X1,X2,X3) */
close;
 
/* Least-squares fit = Project Y onto span(1,X1,X2,...,Xk) */
start OLSPred(y, _x);
   X = j(nrow(_x), 1, 1) || _x;
   b = solve(X`*X, X`*y);
   yhat = X*b;
   return yhat;
finish;
 
/* specify the desired correlation between Y and \hat{Y}. Equiv: R-square = rho^2 */
rho = 0.543;        
 
call randseed(123);
guess = randfun(nrow(X), "Normal");   /* 1. make random guess */
w = OLSPred(guess, X);                /* 2. w is in Span(1,X1,X2,...) */
Y = CorrVec1(w, rho, guess);          /* 3. Find Y such that corr(Y,w) = rho   */
/* optional: you can scale Y anyway you want ... */
/* in regression, R-square is squared correlation between Y and YHat */
corr = corr(Y||w)[2];
R2 = rho**2;                          
PRINT rho corr R2;

The program uses a random guess to generate a vector Y such that the correlation between Y and the least-squares prediction for Y is exactly 0.543. In other words, if you run a regression model where Y is the response and (X1, X2, X3) are the explanatory variables, the R-square statistic for the model will be ρ2 = 0.2948. Let's write the Y variable to a SAS data set and run PROC REG to verify this fact:

/* Write to a data set, then call PROC REG */
Z = Y || X;
create SimCorr from Z[c={Y X1 X2 X3}];
append from Z;
close;
QUIT;
 
proc reg data=SimCorr plots=none;
   model Y = X1 X2 X3;
   ods select FitStatistics ParameterEstimates;
quit;

The "FitStatistics" table that is created by using PROC REG verifies that the R-square statistic is 0.2948, which is the square of the ρ value that was specified in the SAS/IML program. The ParameterEstimates table from PROC REG shows the vector in the subspace that has correlation ρ with Y. It is -1.26382 + 0.04910*X1 - 0.00197*X2 - 0.12016 *X3.

Summary

Many textbooks point out that the R-square statistic in multivariable regression has a geometric interpretation: It is the squared correlation between the response vector and the projection of that vector onto the linear subspace of the explanatory variables (which is the predicted response vector). You can use the program in this article to solve the inverse problem: Given a set of explanatory variables and correlation, you can find a response variable for which the R-square statistic is exactly the squared correlation.

You can download the SAS program that computes the results in this article.

The post Create a response variable that has a specified R-square value appeared first on The DO Loop.

12月 172020
 

Do you know that you can create a vector that has a specific correlation with another vector? That is, given a vector, x, and a correlation coefficient, ρ, you can find a vector, y, such that corr(x, y) = ρ. The vectors x and y can have an arbitrary number of elements, n > 2. One application of this technique is to create a scatter plot that shows correlated data for any correlation in the interval (-1, 1). For example, you can create a scatter plot with n points for which the correlation is exactly a specified value, as shown at the end of this article.

The algorithm combines a mixture of statistics and basic linear algebra. The following facts are useful:

  • Statistical correlation is based on centered and normalized vectors. When you center a vector, it usually changes the direction of the vector. Therefore, the calculations use centered vectors.
  • Correlation is related to the angle between the centered vectors. If the angle is θ, the correlation between the vectors is cos(θ).
  • Projection is the key to finding a vector that has a specified correlation. In linear algebra, the projection of a vector w onto a unit vector u is given by the expression (w`u)*u.
  • Affine transformations do not affect correlation. For any real number, α, and for any β > 0, the vector α + β y has the same correlation with x as y does. For simplicity, the SAS program in this article returns a centered unit vector. You can scale and translate the vector to obtain other solutions.

The geometry of a correlated vector

Given a centered vector, u, there are infinitely-many vectors that have correlation ρ with u. Geometrically, you can choose any vector on a positive cone in the same direction as u, where the cone has angle θ and cos(θ)=ρ. This is shown graphically in the figure below. The plane marked \(\mathbf{u}^{\perp}\) is the orthogonal complement to the vector u. If you extend the cone through the plane, you obtain the cone of vectors that are negatively correlated with x

One way to obtain a correlated vector is to start with a guess, z. The vector z can be uniquely represented as the sum \(\mathbf{y} = \mathbf{w} + \mathbf{w}^{\perp}\), where w is the projection of z onto the span of u, and \(\mathbf{w}^{\perp}\) is the projection of z onto the orthogonal complement.

The following figure shows the geometry of the right triangle with angle θ such that cos(θ) = ρ. If you want the vector y to be unit length, you can read off the formula for y from the figure. The formula is
\(\mathbf{y} = \rho \mathbf{w} / \lVert\mathbf{w}\rVert + \sqrt{1 - \rho^2} \mathbf{w}^\perp / \lVert\mathbf{w}^\perp\rVert \)
In the figure, \(\mathbf{v}_1 = \mathbf{w} / \lVert\mathbf{w}\rVert\) and \(\mathbf{v}_2 = \mathbf{w}^\perp / \lVert\mathbf{w}^\perp\rVert\).

Compute a correlated vector

It is straightforward to implement this projection in a matrix-vector language such as SAS/IML. The following program defines two helper functions (Center and UnitVec) and uses them to implement the projection algorithm. The function CorrVec1 takes three arguments: the vector x, a correlation coefficient ρ, and an initial guess. The function centers and scales the vectors into the vectors u and z. The vector z is projected onto the span of u. Finally, the function uses trigonometry and the fact that cos(θ) = ρ to return a unit vector that has the required correlation with x.

/* Given a vector, x, and a correlation, rho, find y such that corr(x,y) = rho */
proc iml;
/* center a column vector by subtracting its mean */
start Center(v);
   return ( v - mean(v) );
finish;
/* create a unit vector in the direction of a column vector */
start UnitVec(v);
   return ( v / norm(v) );
finish;
 
/* Find a vector, y, such that corr(x,y) = rho. The initial guess can be almost 
   any vector that is not in span(x), orthog to span(x), and not in span(1) */
start CorrVec1(x, rho, guess);
   /* 1. Center the x and z vectors. Scale them to unit length. */
   u = UnitVec( Center(x)     );
   z = UnitVec( Center(guess) );
 
   /* 2. Project z onto the span(u) and the orthog complement of span(u) */
   w = (z`*u) * u;
   wPerp = z - w;       
 
   /* 3. The requirement that cos(theta)=rho results in a right triangle 
         where y (the hypotenuse) has unit length and the legs
         have lengths rho and sqrt(1-rho^2), respectively */
   v1 = rho * UnitVec(w);
   v2 = sqrt(1 - rho**2) * UnitVec(wPerp);
   y = v1 + v2;
 
   /* 4. Check the sign of y`*u. Flip the sign of y, if necessary */
   if sign(y`*u) ^= sign(rho) then 
      y = -y;
   return ( y );
finish;

The purpose of the function is to project the guess onto the green cone in the figure. However, if the guess is in the opposite direction from x, the algorithm will compute a vector, y, that has the opposite correlation. The function detects this case and flips y, if necessary.

The following statements call the function for a vector, x, and requests a unit vector that has correlation ρ = 0.543 with x:

/* Example: Call the CorrVec1 function */
x = {1,2,3};
rho = 0.543;
guess = {0, 1, -1};  
y = CorrVec1(x, rho, guess);
corr = corr(x||y);
print x y, corr;

As requested, the correlation coefficient between x and y is 0.543. This process will work provided that the guess satisfies a few mild assumptions. Specifically, the guess cannot be in the span of x or in the orthogonal complement of x. The guess also cannot be a multiple of the 1 vector. Otherwise, the process will work for positive and negative correlations.

The function returns a vector that has unit length and 0 mean. However, you can translate the vector and scale it by any positive quantity without changing its correlation with x, as shown by the following example:

/* because correlation is a relationship between standardized vectors, 
   you can translate and scale Y any way you want */
y2 = 100 + 23*y;      /* rescale and translate */
corr = corr(x||y2);   /* the correlation will not change */
print corr;

When y is a centered unit vector, the vector β*y has L2 norm β. If you want to create a vector whose standard deviation is β, use β*sqrt(n-1)*y, where n is the number of elements in y.

Random vectors with a given correlation

One application of this technique is to create a random vector that has a specified correlation with a given vector, x. For example, in the following program, the x vector contains the heights of 19 students in the Sashelp.Class data set. The program generates a random guess from the standard normal distribution and passes that guess to the CorrVec1 function and requests a vector that has the correlation 0.678 with x. The result is a centered unit vector.

use sashelp.class;
   read all var {"Height"} into X;
close;
 
rho = 0.678;
call randseed(123);
guess = randfun(nrow(x), "Normal");
y = CorrVec1(x, rho, guess);
 
mean = 100;
std = 23*sqrt(nrow(x)-1);
v = mean + std*y;
title "Correlation = 0.678";
title2 "Random Normal Vector";
call scatter(X, v) grid={x y};
A random vector that has a given correlation with a given vector. Computation by using SAS.

The graph shows a scatter plot between x and the random vector, v. The correlation in the scatter plot is 0.678. The sample mean of the vector v is 100. The sample standard deviation is 23.

If you make a second call to the RANDFUN function, you can get another random vector that has the same properties. Or you can repeat the process for a range of ρ values to visualize data that have a range of correlations. For example, the following graph shows a panel of scatter plots for ρ = -0.75, -0.25, 0.25, and 0.75. The X variable is the same for each plot. The Y variable is a random vector that was rescaled to have mean 100 and standard deviation 23, as above.

Random vectors that have a given correlation with a given vector. Computations in SAS.

The random guess does not need to be from the normal distribution. You can use any distribution.

Summary

This article shows how to create a vector that has a specified correlation with a given vector. That is, given a vector, x, and a correlation coefficient, ρ, find a vector, y, such that corr(x, y) = ρ. The algorithm in this article produces a centered vector that has unit length. You can multiply the vector by β > 0 to obtain a vector whose norm is β. You can multiply the vector by β*sqrt(n-1) to obtain a vector whose standard deviation is β.

There are infinitely-many vectors that have correlation ρ with x. The algorithm uses a guess to produce a particular vector for y. You can use a random guess to obtain a random vector that has a specified correlation with x.

The post Find a vector that has a specified correlation with another vector appeared first on The DO Loop.