rick wicklin

10月 222018
 

A SAS programmer asked how to rearrange elements of a matrix. The rearrangement he wanted was rather complicated: certain blocks of data needed to move relative to other blocks, but the values within each block were to remain unchanged. It turned out that the mathematical operation he needed is called a block transpose. The BTRAN function in SAS/IML performs block-transpose operations, so the complicated rearrangement was easy to implement with a judicious call to the BTRAN function.

This article discusses the block-transpose operation and gives an example. It also shows how a block transpose can conveniently transform wide data into long data and vice versa.

The block-transpose operation

In general, a block matrix can contain blocks of various sizes. However, the BTRAN function requires that all blocks be the same size. Specifically, suppose A is a block matrix where each block is an n x m matrix. That means that A is an (r n) x (c m) matrix for some whole numbers r and c. The BTRAN function transposes the blocks to create a (c n) x (r m) block matrix.

The idea is more easily explained by a picture. The following image shows a matrix A that is composed of six blocks (of the same size). The block-transpose operation rearranges the blocks but leaves the elements within the blocks unchanged.

A block-transpose operation transposes the blocks but leaves the contents of each block unchanged

The BTRAN function in SAS/IML

Let's see how the block transpose works on an example in the SAS/IML language. The following statements create a 4 x 12 matrix. I have overlaid some grid lines to help you visualize this matrix as a 2 x 3 block matrix, where each block is 2 x 4.

proc iml;
x = repeat(1:12,2);
x = x // (10+x);
print x;
A 4x12 matrix visualized as a 2x3 block matrix. Each block is 2x4.

You can use the BTRAN function to apply a block transpose. The first argument is the matrix that contains the data. The second and third arguments specify the size of the blocks:

y = btran(x, 2, 4);    /* block size is 2x4 */
print y;
A 6x6 matrix visualized as a 3x2 block matrix. Each block is 2x4.

The output is a 6 x 6 matrix, visualized as a 3 x 2 block matrix.

Block transpose for wide-to-long transforms

If the number of rows in the block equals the number of rows in the data, then the BTRAN function stacks columns. In particular, if the block size is n x 1, the BTRAN function stacks multiple variables into a single column, just like the SHAPECOL function. For example, the Sashelp.Iris data contains four variables named SepalLength, SepalWidth, PetalLength, and PetalWidth. The following SAS/IML statements stack the four variables into a single column and create a new column that identifies the name of the original variable:

proc iml;
/* wide to long */
varNames = {"SepalLength" "SepalWidth" "PetalLength" "PetalWidth"};
use Sashelp.Iris;
   read all var "Species";        /* 150 x 3 */
   read all var varNames into X;  /* 150 x 4 */
close;
n = nrow(X); m = ncol(X);         /* n = 150; m = 4 */
 
ID = repeat(Species, m);          /* repeat the vars that are not transposed */
Var = shapecol(repeat(varNames, n), n*m); /* create column that contains variable names */
Value = btran(X, n, 1);           /* vector with 150*4 rows */ 
 
create test var {ID "Var" "Value"}; append; close;

Block transpose for wide-to-long transforms

In a similar way, you can transpose balanced data from long form to wide form. (Recall that "balanced data" means that each group has the same number of observations.) For example, the Sashelp.Iris data contains a Species variable. The first 50 observations have the value 'Setosa', the next 50 have the value 'Versicolor', and the last 50 have the value 'Virginica'. You can use a WHERE clause or a BY statement to analyze each species separately, but suppose you want to create a new data set that contains only 50 observations and has variables named SepalLengthSetosa, SepalLengthVersicolor, SepalLengthVirginica, and so forth for the other variables. The BTRAN function makes this easy: just specify a block size of 50 x 4, as follows.

/* create pairwise combinations of var names and group levels */
BlockRows = 50;
w = btran(X, BlockRows, m);                  /* 50 x 12 matrix */
 
s = Species[ do(1, n, BlockRows) ];          /* s = {'Setosa' 'Versicolor', 'Virginica'} */
vNames = rowcatc(expandgrid(varNames, s));   /* combine var names and species */
print vNames, w[c=vNames L="Measurenents (cm)"];

In summary, the BTRAN function is a useful function when you need to rearrange blocks of data without changing the values in a block. For balanced designs, you can use the BTRAN function to convert data between wide and long formats.

The post Transpose blocks to reshape data appeared first on The DO Loop.

10月 172018
 

In a recent article about nonlinear least squares, I wrote, "you can often fit one model and use the ESTIMATE statement to estimate the parameters in a different parameterization." This article expands on that statement. It shows how to fit a model for one set of parameters and use the ESTIMATE statement to obtain estimates for a different set of parameters.

A canonical example of a probability distribution that has multiple parameterizations is the gamma distribution. You can parameterize the gamma distribution in terms of a scale parameter or in terms of a rate parameter, where the rate parameter is the reciprocal of the scale parameter. The density functions for the gamma distribution with scale parameter β or rate parameter c = 1 / β are as follows:
Shape α, scale β: f(x; α, β) = 1 / (βα Γ(α)) xα-1 exp(-x/β)
Shape α, rate c: f(x; α, c) = cα / Γ(α) xα-1 exp(-c x)

You can use PROC NLMIXED to fit either set of parameters and use the ESTIMATE statement to obtain estimates for the other set. In general, you can do this whenever you can explicitly write down an analytical formula that expresses one set of parameters in terms of another set.

Estimating the scale or rate parameter in a gamma distribution

Let's implement this idea on some simulated data. The following SAS DATA step simulates 100 observations from a gamma distribution with shape parameter α = 2.5 and scale parameter β = 1 / 10. A call to PROC UNIVARIATE estimates the parameters from the data and overlays a gamma density on the histogram of the data:

/* The gamma model has two different parameterizations: a rate parameter and a scale parameter.
   Generate gamma-distributed data and fit a model for each parameterization. Use the 
   ESTIMATE stmt to estimate the parameter in the model that was not fit. 
*/
%let N = 100;                                    /* N = sample size */
data Gamma;
aParm = 2.5; scaleParm = 1/10;  rateParm = 1/scaleParm;
do i = 1 to &N;
   x = rand("gamma", aParm, scaleParm); 
   output;
end;
run;
 
proc univariate data=Gamma;
   var x;
   histogram x / gamma;
run;
Distribution of gamma(2.5, 0.1) distributed data, overlaid with density estimate

The parameter estimates are (2.99, 0.09), which are close to the parameter values that were used to generate the data. Notice that PROC UNIVARIATE does not provide standard errors or confidence intervals for these point estimates. However, you can get those statistics from PROC NLMIXED. PROC NLMIXED also supports the ESTIMATE statement, which can estimate the rate parameter:

/*Use PROC NLMIXED to fit the shape and scale parameters in a gamma model */
title "Fit Gamma Model, SCALE Parameter";
proc nlmixed data=Gamma;
   parms a 1 scale 1;             * initial value for parameter;
   bounds 0 < a scale;
   model x ~ gamma(a, scale);
   rate = 1/scale;
   estimate 'rate' rate;
   ods output ParameterEstimates=PE;
   ods select ParameterEstimates ConvergenceStatus AdditionalEstimates;
run;

The parameter estimates are the same as from PROC UNIVARIATE. However, PROC NLMIXED also provides standard errors and confidence intervals for the parameters. More importantly, the ESTIMATE statement enables you to obtain an estimate of the rate parameter without needing to refit the model. The next section explains how the rate parameter is estimated.

New estimates and changing variables

You might wonder how the ESTIMATE statement computes the estimate for the rate parameter from the estimate of the scale parameter. The PROC NLMIXED documentation states that the procedure "computes approximate standard errors for the estimates by using the delta method (Billingsley 1986)." I have not studied the "delta method," but it seems to a way to approximate a nonlinear transformation by using the first two terms of its Taylor series, otherwise known as a linearization.

In the ESTIMATE statement, you supply a transformation, which is c = 1 / β for the gamma distribution. The derivative of that transformation is -1 / β2. The estimate for β is 0.09. Plugging that estimate into the transformations give 1/0.09 as the point estimate for c and 1/0.092 as a linear multiplier for the size of the standard error. Geometrically, the linearized transformation scales the standard error for the scale parameter into the standard error for the rate parameter. The following SAS/IML statements read in the parameter estimates for β. It then uses the transformation to produce a point estimate for the rate parameter, c, and uses the linearization of the transformation to estimate standard errors and confidence intervals for c.

proc iml;
start Xform(p);
   return( 1 / p );          /* estimate rate from scale parameter */
finish;
start JacXform(p);
   return abs( -1 / p##2 );  /* Jacobian of transformation */
finish;
 
use PE where (Parameter='scale');
read all var {Estimate StandardError Lower Upper};
close;
 
xformEst = Xform(Estimate);
xformStdErr = StandardError * JacXform(Estimate);
CI = Lower || Upper;
xformCI = CI * JacXform(Estimate);
AdditionalParamEst = xformEst || xformStdErr || xformCI;
print AdditionalParamEst[c={"Est" "StdErr" "LowerCL" "UpperCL"} F=D8.4];
quit;

The estimates and other statistics are identical to those produced by the ESTIMATE statement in PROC NLMIXED. For a general discussion of changing variables in probability and statistics, see these Penn State course notes. For a discussion about two different parameterizations of the lognormal distribution, see the blog post "Geometry, sensitivity, and parameters of the lognormal distribution."

In case you are wondering, you obtain exactly the same parameter estimates and standard errors if you fit the rate parameter directly. That is, the following call to PROC NLMIXED produces the same parameter estimates for c.

/*Use PROC NLMIXED to fit gamma model using rate parameter */
title "Fit Gamma Model, RATE Parameter";
proc nlmixed data=Gamma;
   parms a 1 rate 1;             * initial value for parameter;
   bounds 0 < a rate;
   model x ~ gamma(a, 1/rate);
   ods select ParameterEstimates;
run;

For this example, fitting the scale parameter is neither more nor less difficult than fitting the rate parameter. If I were faced with a distribution that has two different parameterizations, one simple and one more complicated, I would try to fit the simple parameters to data and use this technique to estimate the more complicated parameters.

The post Parameter estimates for different parameterizations appeared first on The DO Loop.

10月 152018
 

There are several ways to use SAS to get the unique values for a data variable. In Base SAS, you can use the TABLES statement in PROC FREQ to generate a table of unique values (and the counts). You can also use the DISTINCT function in PROC SQL to get the same information. In the SAS/IML language, you can use the UNIQUE function to find the unique values of a vector. In all these cases, the values are returned in sorted (alphanumeric) order. For example, here are three ways to obtain the unique values of the TYPE variable in the Sashelp.Cars data set. Only the result from PROC IML is shown:

/* Three ways to get unique values of a variable in sorted order */
proc freq data=sashelp.cars;
tables Type / missing;  /* MISSING option treats mising values as a valid category */
run;
 
proc sql;
   select distinct(Type) as uType
   from sashelp.cars;
quit;
 
proc iml;
use sashelp.cars;
   read all var "Type";
close;
 
uType = unique(Type);
print uType;

Unique values in data order

The FREQ procedure has a nice option that I use FREQently: you can use the ORDER=DATA option to obtain the unique categories in the order in which they appear in a data set. Over the years, I've used the ORDER=DATA option for many purposes, including a recent post that shows how to create a "Top 10" chart. Another useful option is ORDER=FREQ, which orders the categories in descending order by frequency.

Recently a colleague asked how to obtain the unique values of a SAS/IML vector in the order in which they appear in the vector. I thought it might also be useful to support an option to return the unique values in (descending) order of frequency, so I wrote the following function. The first argument to the UniqueOrder function is the data vector. The second (optional) argument is a string that determines the order of the unique values. If the second argument has the value "DATA", then the unique values are returned in data order. If the second argument has the value "FREQ", then the unique values are returned in descending order by frequency. The function is shown below. I test the function on a character vector (TYPE) and a numeric vector (CYLINDERS) that contains some missing values.

proc iml;
/* Return the unique values in a row vector. The second parameter
   determines the sort order of the result.
   "INTERNAL": Order by alphanumeric values (default)
   "DATA"    : Order in which elements appear in vector
   "FREQ"    : Order by frequency
*/
start UniqueOrder(x, order="INTERNAL");
   if upcase(order)="DATA" then do;
      u = unique(x);              /* unique values (sorted) */
      idx = j(ncol(u),1,0);       /* vector to store indices in data order */
      do i = 1 to ncol(u);
         idx[i] = loc(x=u[i])[1]; /* 1st location of i_th unique value */
      end;
      call sort(idx);             /* sort the indices ==> data order */
      return ( T(x[idx]) );       /* put the values in data order */
   end;
   else if upcase(order)="FREQ" then do;
      call tabulate(u, freq, x, "missing"); /* compute freqs; missing is valid level */
      call sortndx(idx, freq`) descend=1;   /* order descending by freq */
      return ( T(u[idx]) );
   end;
   else return unique(x);
finish;
 
/* test the function by calling it on a character and numeric vectors */
use sashelp.cars;
   read all var "Type";
   read all var "Cylinders";
close;
 
uData = UniqueOrder(Type, "data"); /* unique values in the order they appear in data */
uFreq = UniqueOrder(Type, "freq"); /* unique values in descending frequency order */
print uData, uFreq;
 
uData = UniqueOrder(Cylinders, "data");
uFreq = UniqueOrder(Cylinders, "freq");
print uData, uFreq;

The results of the function are similar to the results of PROC FREQ when you use the ORDER= option. There is a small difference when the data contain missing values. PROC FREQ always lists the missing value first, regardless of where the missing values appear in the data or how many missing values there are. In contrast, the SAS/IML function handles missing and nonmissing values equivalently.

In summary, whether you are implementing an analysis in Base SAS or SAS/IML, this article shows how to obtain the unique values of data in sorted order, in order of frequency, or in the order that the values appear in the data.

The post Get the unique values of a variable in data order appeared first on The DO Loop.

10月 102018
 

This article shows how to use SAS to fit a growth curve to data. Growth curves model the evolution of a quantity over time. Examples include population growth, the height of a child, and the growth of a tumor cell. This article focuses on using PROC NLIN to estimate the parameters in a nonlinear least squares model. PROC NLIN is my first choice for fitting nonlinear parametric models to data. Other ways to model growth curves include using splines, mixed models (PROC MIXED or NLMIXED), and nonparametric methods such as loess.

The SAS DATA step specifies the mean height (in centimeters) of 58 sunflowers at 7, 14, ..., 84 days after planting. The American naturalist H. S. Reed studied these sunflowers in 1919 and used the mean height values to formulate a model for growth. Unfortunately, I do not have access to the original data for the 58 plants, but using the mean values will demonstrate the main ideas of fitting a growth model to data.

/* Mean heights of 58 sunflowers:
 Reed, H. S. and Holland, R. H. (1919), "Growth of sunflower seeds" 
 Proceedings of the National Academy of Sciences, volume 5, p. 140.
 http://www.pnas.org/content/pnas/5/4/135.full.pdf
*/
data Sunflower;
input Time Height;
label Time = "Time (days)"
      Height="Sunflower Height (cm)";
datalines;
7  17.93
14 36.36
21 67.76
28 98.1
35 131
42 169.5
49 205.5
56 228.3
63 247.1
70 250.5
77 253.8
84 254.5
;

Fit a logistic growth model to data

A simple mathematical model for population growth that is constrained by resources is the logistic growth model, which is also known as the Verhulst growth model. (This should not be confused with logistic regression, which predicts the probability of a binary event.) The Verhulst equation can be parameterized in several ways, but a popular parameterization is
Y(t) = K / (1 + exp(-r*(t – b)))
where

  1. K is the theoretical upper limit for the quantity Y. It is called the carrying capacity in population dynamics.
  2. r is the rate of maximum growth.
  3. b is a time offset. The time t = b is the time at which the quantity is half of its maximum value.

The model has three parameters, K, r, and b. When you use PROC NLIN in SAS to fit a model, you need to specify the parametric form of the model and provide an initial guess for the parameters, as shown below:

 
proc nlin data=Sunflower list noitprint;
   parms K 250 r 1 b 40;                         /* initial guess */
   model Height = K / (1 + exp(-r*(Time - b)));  /* model to fit; Height and Time are variables in data */
   output out=ModelOut predicted=Pred lclm=Lower95 uclm=Upper95;
   estimate 'Dt' log(81) / r;                    /* optional: estimate function of parameters */
run;
 
title "Logistic Growth Curve Model of a Sunflower";
proc sgplot data=ModelOut noautolegend;
   band x=Time lower=Lower95 upper=Upper95;         /* confidence band for mean */
   scatter x=Time y=Height;                         /* raw observations */
   series x=Time y=Pred;                            /* fitted model curve */
   inset ('K' = '261'  'r' = '0.088'  'b' = '34.3') / border opaque; /* parameter estimates */
   xaxis grid; yaxis grid;
run;

The output from PROC NLIN includes an ANOVA table (not shown), a parameter estimates table (shown below), and an estimate for the correlation of the parameters. The parameter estimates include estimates, standard errors, and 95% confidence intervals for the parameters. The OUTPUT statement creates a SAS data set that contains the original data, the predicted values from the model, and a confidence interval for the predicted mean. The output data is used to create a "fit plot" that shows the model and the original data.

Logistic growth model (Verhulst model) for sunflower growth
Parameter estimates for a Verhults (logitic) growth model of a sunflower

Articles about the Verhulst model often mention that the "maximum growth rate" parameter, r, is sometimes replaced by a parameter that specifies the time required for the population to grow from 10% to 90% of the carrying capacity, K. This time period is called the "characteristic duration" and denoted as Δt. You can show that Δt = log(81)r. The ESTIMATE statement in PROC NLIN produces a table (shown below) that estimates the characteristic duration.

Estimate of a physically relevant parameter in interpreting a logistic growth model

The value 50.1 tells you that, on average, it takes about 50 days for a sunflower to grow from 10 percent of its maximum height to 90 percent of its maximum height. By looking at the graph, you can see that most growth occurs during the 50 days between Day 10 and Day 60.

This use of the ESTIMATE statement can be very useful. Some models have more than one popular parameterization. You can often fit the model for one parameterization and use the ESTIMATE statement to estimate the parameters for a different parameterization.

The post Fit a growth curve in SAS appeared first on The DO Loop.

10月 082018
 

This article compares several ways to find the elements that are common to multiple sets. I test which method is the fastest in the SAS/IML language. However, all algorithms are intrinsically fast, which raises an important question: when is it worth the time and effort to optimize an algorithm?

The idea for this topic came from reading a blog post by 'mikefc' at the "Cool but Useless" blog about the relative performance of various ways to intersect vectors in R. Mikefc used vectors that contained between 10,000 and 100,000 elements and whose intersection was only a few dozen elements. In this article, I increase the sizes of the vectors by an order of magnitude and time the performance of the intersection function in SAS/IML. I also ensure that the sets share a large set of common elements so that the intersection is sizeable.

The intersection of large sets

In general, finding the intersection between sets is a fast operation. In languages such as R, MATLAB, and SAS/IML, you can store the sets in vectors and use built-in functions to form the intersection. Because the intersection of two sets is always smaller than (or equal to) the size of the original sets, computing the "intersection of intersections" is usually faster than computing the intersection of the original (larger) sets.

The following SAS/IML program creates eight SAS/IML vectors that have between 200,000 and 550,000 elements. The elements in the vectors are positive integers, although they could also be character strings or non-integers. The vectors contain a large subset of common elements so that the intersection is sizeable. The vectors A, B, ..., H are created as follows:

proc iml;
call randseed(123);
NIntersect = 5e4; 
common = sample(1:NIntersect, NIntersect, "replace"); /* elements common to all sets */
 
source = 1:1e6;        /* elements are positive integers less than 1 million */
BaseN = 5e5;           /* approx magnitude of vectors (sets) */
A = sample(source,     BaseN, "replace") || common; /* include common elements */
B = sample(source, 0.9*BaseN, "replace") || common;
C = sample(source, 0.8*BaseN, "replace") || common;
D = sample(source, 0.7*BaseN, "replace") || common;
E = sample(source, 0.6*BaseN, "replace") || common;
F = sample(source, 0.5*BaseN, "replace") || common;
G = sample(source, 0.4*BaseN, "replace") || common;
H = sample(source, 0.3*BaseN, "replace") || common;

The intersection of these vectors contains about 31,600 elements. You can implement the following methods that find the intersection of these vectors:

  1. The XSECT function in SAS/IML accepts up to 15 arguments. You can therefore make a single call to find the intersection.
  2. If you have dozens or hundreds of sets to intersect, you can loop over the sets and call the XSECT function multiple times. (Recall that the VALUE function enables you to loop over the names of the sets and access the vectors by names.) For example, you can let W1 be the intersection of the first two sets, let W2 be the intersection of W1 with the third set, and so forth. Because the size of the intersection is typically smaller than the size of the sets, later intersections require less time to compute than earlier intersections. (Although I only implement pairwise intersections, you could conceivably intersect more than two sets at a time.)
  3. As 'mikefc' mentions, it is quicker to intersect small sets than to intersect large sets. Thus you can preprocess the names of the sets and sort them by size. This heuristic might make a difference when the set sizes vary greatly.

The following SAS/IML statements implement these three methods and compute the time required to intersect each:

/* 1. one call to the built-in method */
t0  = time();
   w = xsect(a,b,c,d,e,f,g,h);
tBuiltin = time() - t0;
 
/* 2. loop over variable names and form pairwise intersections */
varName = "a":"h";
t0  = time();
   w = value(varName[1]);
   do i = 2 to ncol(varName);
     w = xsect(w, value(varName[i]));  /* compute pairwise intersection */
   end;
tPairwise = time() - t0;
 
/* 3. Sort by size of sets, then loop */
varName = "a":"h";
t0  = time();
   len = j(ncol(varName), 1);    
   do i = 1 to ncol(varName);
     len[i] = ncol(value(varName[i])); /* number of elements in each set */
   end;
   call sortndx(idx, len);             /* sort smallest to largest */
   sortName = varName[,idx];
   w = value(sortName[1]);
   do i = 2 to ncol(sortName);
     w = xsect(w, value(sortName[i])); /* compute pairwise intersection */
   end;
tSort = time() - t0;
 
print tBuiltin tPairwise tSort;

For this example data, each method takes about 0.5 seconds to find the intersection of eight sets that contain a total of 3,000,000 elements. If you rerun the analysis, the times will vary by a few hundredths of a second, so the times are essentially equal. ('mikefc' also discusses a few other methods, including a "tally method." The tally method is fast, but it relies on the elements being positive integers.)

Whether to optimize or not

There is a quote that I like: "In theory, there is no difference between theory and practice. But, in practice, there is." (Published by Walter J. Savitch, who claims to have overheard it at a computer science conference.) For the intersection problem, you can argue theoretically that pre-sorting by length is an inexpensive way to potentially obtain a boost in performance. However, in practice (for this example data), pre-sorting improves the performance by less than 1%. Furthermore, the absolute times for this operation are short, so saving a 1% of 0.5 seconds might not be worth the effort.

Even if you never compute the intersection of sets in SAS, there are two take-aways from this exercise:

  • When you measure the performance of algorithms, use example data that is similar to the real data that you expect to encounter. The performance of this problem depends on the size of the sets, the number of sets, and the contents of the sets. The results that I show here are based on my choices for these parameters. Different parameters might yield different results.
  • Always consider the total time for an algorithm before you start to optimize it. If your analysis takes 30 seconds, there is no need to optimize a step in the analysis that takes 0.5 seconds. No matter how well you optimize that step, it is not going to substantially shrink the total run time of your analysis.

The post The intersection of multiple sets appeared first on The DO Loop.

10月 032018
 

It is sometimes necessary for researchers to simulate data with thousands of variables. It is easy to simulate thousands of uncorrelated variables, but more difficult to simulate thousands of correlated variables. For that, you can generate a correlation matrix that has special properties, such as a Toeplitz matrix or a first-order autoregressive (AR(1)) correlation matrix. I have previously written about how to generate a large Toeplitz matrix in SAS. This article describes three useful results for an AR(1) correlation matrix:

  • How to generate an AR(1) correlation matrix in the SAS/IML language
  • How to use a formula to compute the explicit Cholesky root of an AR(1) correlation matrix.
  • How to efficiently simulate multivariate normal variables with AR(1) correlation.

Generate an AR(1) correlation matrix in SAS

The AR(1) correlation structure is used in statistics to model observations that have correlated errors. (For example, see the documentation of PROC MIXED in SAS.) If Σ is AR(1) correlation matrix, then its elements are constant along diagonals. The (i,j)th element of an AR(1) correlation matrix has the form Σ[i,j] = ρ|i – j|, where ρ is a constant that determines the geometric rate at which correlations decay between successive time intervals. The exponent for each term is the L1 distance between the row number and the column number. As I have shown in a previous article, you can use the DISTANCE function in SAS/IML 14.3 to quickly evaluate functions that depend on the distance between two sets of points. Consequently, the following SAS/IML function computes the correlation matrix for a p x p AR(1) matrix:

proc iml;
/* return p x p matrix whose (i,j)th element is rho^|i - j| */
start AR1Corr(rho, p); 
   return rho##distance(T(1:p), T(1:p), "L1");
finish;
 
/* test on 10 x 10 matrix with rho = 0.8 */
rho = 0.8;  p = 10;
Sigma = AR1Corr(rho, p);
print Sigma[format=Best7.];

A formula for the Cholesky root of an AR(1) correlation matrix

Every covariance matrix has a Cholesky decomposition, which represents the matrix as the crossproduct of a triangular matrix with itself: Σ = RTR, where R is upper triangular. In SAS/IML, you can use the ROOT function to compute the Cholesky root of an arbitrary positive definite matrix. However, the AR(1) correlation matrix has an explicit formula for the Cholesky root in terms of ρ. This explicit formula does not appear to be well known by statisticians, but it is a special case of a general formula developed by V. Madar (Section 5.1, 2016), who presented a poster at a Southeast SAS Users Group (SESUG) meeting a few years ago. An explicit formula means that you can compute the Cholesky root matrix, R, in a direct and efficient manner, as follows:

/* direct computation of Cholesky root for an AR(1) matrix */
start AR1Root(rho, p);
   R = j(p,p,0);                /* allocate p x p matrix */
   R[1,] = rho##(0:p-1);        /* formula for 1st row */
   c = sqrt(1 - rho**2);        /* scaling factor: c^2 + rho^2 = 1 */
   R2 = c * R[1,];              /* formula for 2nd row */
   do j = 2 to p;               /* shift elements in 2nd row for remaining rows */
      R[j, j:p] = R2[,1:p-j+1]; 
   end;
   return R;
finish;
 
R = AR1Root(rho, p);   /* compare to R = root(Sigma), which requires forming Sigma */
print R[L="Cholesky Root" format=Best7.];

You can compute an AR(1) covariance matrix from the correlation matrix by multiplying the correlation matrix by a positive scalar, σ2.

Efficient simulation of multivariate normal variables with AR(1) correlation

An efficient way to simulate data from a multivariate normal population with covariance Σ is to use the Cholesky decomposition to induce correlation among a set of uncorrelated normal variates. This is the technique used by the RandNormal function in SAS/IML software. Internally, the RandNormal function calls the ROOT function, which can compute the Cholesky root of an arbitrary positive definite matrix.

When there are thousands of variables, the Cholesky decomposition might take a second or more. If you call the RandNormal function thousands of times during a simulation study, you pay that one-second penalty during each call. For the AR(1) covariance structure, you can use the explicit formula for the Cholesky root to save a considerable amount of time. You also do not need to explicitly form the p x p matrix, Σ, which saves RAM. The following SAS/IML function is an efficient way to simulate N observations from a p-dimensional multivariate normal distribution that has an AR(1) correlation structure with parameter ρ:

/* simulate multivariate normal data from a population with AR(1) correlation */
start RandNormalAR1( N, Mean, rho );
    mMean = rowvec(Mean);
    p = ncol(mMean);
    U = AR1Root(rho, p);      /* use explicit formula instead of ROOT(Sigma) */
    Z = j(N,p);
    call randgen(Z,'NORMAL');
    return (mMean + Z*U);
finish;
 
call randseed(12345);
p = 1000;                  /* big matrix */
mean = j(1, p, 0);         /* mean of MVN distribution */
 
/* simulate data from MVN distribs with different values of rho */
v = do(0.01, 0.99, 0.01);  /* choose rho from list 0.01, 0.02, ..., 0.99 */
t0 = time();               /* time it! */
do i = 1 to ncol(v);
   rho = v[i];
   X = randnormalAR1(500, mean, rho);  /* simulate 500 obs from MVN with p vars */
end;
t_SimMVN = time() - t0;     /* total time to simulate all data */
print t_SimMVN;

The previous loop generates a sample that contains N=500 observations and p=1000 variables. Each sample is from a multivariate normal distribution that has an AR(1) correlation, but each sample is generated for a different value of ρ, where ρ = 0.01. 0.02, ..., 0.99. On my desktop computer, this simulation of 100 correlated samples takes about 4 seconds. This is about 25% of the time for the same simulation that explicitly forms the AR(1) correlation matrix and calls RandNormal.

In summary, the AR(1) correlation matrix is an easy way to generate a symmetric positive definite matrix. You can use the DISTANCE function in SAS/IML 14.3 to create such a matrix, but for some applications you might only require the Cholesky root of the matrix. The AR(1) correlation matrix has an explicit Cholesky root that you can use to speed up simulation studies such as generating samples from a multivariate normal distribution that has an AR(1) correlation.

The post Fast simulation of multivariate normal data with an AR(1) correlation structure appeared first on The DO Loop.

10月 012018
 

Programmers on a SAS discussion forum recently asked about the chi-square test for proportions as implemented in PROC FREQ in SAS. One person asked the basic question, "how do I test the null hypothesis that the observed proportions are equal to a set of known proportions?" Another person said that the null hypothesis was rejected for his data, and he wanted to know which categories were "responsible for the rejection." This article answers both questions and points out a potential pitfall when you specify the proportions for a chi-square goodness-of-fit test in PROC FREQ.

The basic idea: The proportion of party affiliations for a group of voters

To make these questions concrete, let's look at some example data. According to a 2016 Pew research study, the party affiliation of registered voters in the US in 2016 was as follows: 33% of voters registered as Democrats, 29% registered as Republicans, 34% were Independents, and 4% registered as some other party. If you have a sample of registered voters, you might want to ask whether the observed proportion of affiliations matches the national averages. The following SAS data step defines the observed frequencies for a hypothetical sample of 300 voters:

data Politics;
length Party $5;
input Party $ Count;
datalines;
Dem   125
Repub  79
Indep  86
Other  10
;

You can use the TESTP= option on the TABLES statement in PROC FREQ to compare the observed proportions with the national averages for US voters. You might assume that the following statements perform the test, but there is a potential pitfall. The following statements contain a subtle error:

proc freq data=Politics;
/* National Pct:    D=33%, R=29%, I=34%, Other=04% */
tables Party / TestP=(0.33 0.29 0.34 0.04) nocum; /* WARNING: Contains an error! */
weight Count;
run;

If you look carefully at the OneWayFreqs table that is produced, you will see that the test proportions that appear in the fourth column are not the proportions that we intended to specify! The problem is that order of the categories in the table is alphabetical whereas the proportions in the LISTP= option correspond to the order that the categories appear in the data. In an effort to prevent this mistake, the documentation for the TESTP= option warns you to "order the values to match the order in which the corresponding variable levels appear in the one-way frequency table." The order of categories is important in many SAS procedures, so always think about the order! (The ESTIMATE and CONTRAST statements in linear regression procedures are other statements where order is important.)

Specify the test proportions correctly

To specify the correct order, you have two options: (1) list the proportions for the TESTP= option according to the alphabetical order of the categories, or (2) use the ORDER=DATA option on the PROC FREQ statement to tell the procedure to use the order of the categories as they appear in the data. The following statement uses the ORDER=DATA option to specify the proportions:

proc freq data=Politics ORDER=DATA;   /* list proportions in DATA order */
/*                 D=33%, R=29%, I=34%, Other=04% */
tables Party / TestP=(0.33 0.29 0.34 0.04);  /* cellchi2 not available for one-way tables */
weight Count;
ods output OneWayFreqs=FreqOut;
output out=FreqStats N ChiSq;
run;

The analysis is now correctly specified. The chi-square table indicates that the observed proportions are significantly different from the national averages at the α = 0.05 significance level.

Which categories are "responsible" for rejecting the null hypothesis?

A SAS programmer posted a similar analysis on a discussion and asked whether it was possible to determine which categories were the most different from the specified proportions. The analysis shows that the chi-square test rejects the null hypothesis, but does not indicate whether only one category is different than expected or whether many categories are different.

Interestingly, PROC FREQ supports such an option for two-way tables when the null hypothesis is the independence of the two variables. Recall that the chi-square statistic is a sum of squares, where each cell in the table contributes one squared value to the sum. The CELLCHI2 option on the TABLES statement "displays each table cell’s contribution to the Pearson chi-square statistic.... The cell chi-square is computed as
(frequencyexpected)2 / expected
where frequency is the table cell frequency (count) and expected is the expected cell frequency" under the null hypothesis.

Although the option is not supported for one-way tables, it is straightforward to use the DATA step to compute each cell's contribution. The previous call to PROC FREQ used the ODS OUTPUT statement to write the OneWayFreqs table to a SAS data set. It also wrote a data set that contains the sample size and the chi-square statistic. You can use these statistics as follows:

/* create macro variables for sample size and chi-square statistic */
data _NULL_;
   set FreqStats;
   call symputx("NumObs", N);         
   call symputx("TotalChiSq", _PCHI_);
run;
 
/* compute the proportion of chi-square statistic that is contributed
   by each cell in the one-way table */
data Chi2;
   set FreqOut;
   ExpectedFreq = &NumObs * TestPercent / 100;
   Deviation = Frequency - ExpectedFreq;
   ChiSqContrib = Deviation**2 / ExpectedFreq;  /* (O - E)^2 / E */
   ChiSqPropor = ChiSqContrib / &TotalChiSq;    /* proportion of chi-square contributed by this cell */
   format ChiSqPropor 5.3;
run;
 
proc print data=Chi2; 
   var Party Frequency TestPercent ExpectedFreq Deviation ChiSqContrib ChiSqPropor; 
run;

The table shows the numbers used to compute the chi-square statistic. For each category of the PARTY variable, the table shows the expected frequencies, the deviations from the expected frequencies, and the chi-square term for each category. The last column is the proportion of the total chi-square statistic for each category. You can see that the 'Dem' category contributes the greatest proportion. The interpretation is that the observed count of the 'Dem' group is much greater than expected and this is the primary reason why the null hypothesis is rejected.

You can also create a bar chart that shows the contributions to the chi-square statistic. You can create the "chi-square contribution plot" by using the following statements:

title "Proportion of Chi-Square Statistic for Each Category";
proc sgplot data=Chi2;
   vbar Party / response=ChiSqPropor datalabel=ChiSqPropor;
   xaxis discreteorder=data;
   yaxis label="Proportion of Chi-Square Statistic" grid;
run;
Contribution of each cell to the chi-square statistic

The bar chart makes it clear that the frequency of the 'Dem' group is the primary factor in the size of the chi-square statistic. The "chi-square contribution plot" is a visual companion to the Deviation Plot, which is produced automatically by PROC FREQ when you specify the PLOTS=DEVIATIONPLOT option. The Deviation Plot shows whether the counts for each category are more than expected or less than expected. When you combine the two plots, you can make pronouncements like Goldilocks:

  • The 'Dem' group contributes the most to the chi-square statistic because the observed counts are "too big."
  • The 'Indep' group contributes a moderate amount because the counts are "too small."
  • The remaining groups do not contribute much because their counts are "just right."

Summary

In summary, this article addresses three topics related to testing the proportions of counts in a one-way frequency table. You can use the TESTP= option to specify the proportions for the null hypothesis. Be sure that you specify the proportions in the same order that they appear in the OneWayFreqs table. (The ORDER=DATA option is sometimes useful for this.) If the data proportions do not fit the null hypothesis, you might want to know why. One way to answer this question is to compute the contributions of each category to the total chi-square computation. This article shows how to display that information in a table or in a bar chart.

The post Chi-square tests for proportions in one-way tables appeared first on The DO Loop.

9月 262018
 

A radial basis function is a scalar function that depends on the distance to some point, called the center point, c. One popular radial basis function is the Gaussian kernel φ(x; c) = exp(-||xc||2 / (2 σ2)), which uses the squared distance from a vector x to the center c to assign a weight. The weighted sum of Gaussian kernels, Σ wi φ(x; c) arises in many applications in statistics, including kernel density estimation, kernel smoothing, and machine learning algorithms such as support vector machines. It is therefore important to be able to efficiently evaluate a radial basis function and compute a weighted sum of several such kernel functions.

One of the many useful features of the SAS/IML language is its ability to compactly represent matrix and vector expressions. The expression ||xc|| looks like the distance between two vectors, but in the SAS/IML language the DISTANCE function can handle multiple sets of vectors:

  • The DISTANCE function can compute the distance between two vectors of arbitrary dimensions. Thus when x and c are both d-dimensional row vectors, you can compute the distance by using r = DISTANCE(x, c). The result is a scalar distance.
  • The DISTANCE function can compute the distance between multiple points and a center. Thus when x is an m x d matrix that contains m points, you can compute the m distances between the points and c by using r = DISTANCE(x, c). Again, the syntax is the same, but now r is an m x 1 vector of distances.
  • The DISTANCE function in SAS/IML 14.3 can compute the distance between multiple points and multiple centers. Thus when x is an m x d matrix that contains m points and c is a p x d matrix that contains p centers, you can compute the m*p distances between the points and c by using r = DISTANCE(x, c). The syntax is the same, but now r is an m x p matrix of distances.

A SAS/IML function that evaluates a Gaussian kernel function

The following SAS/IML statements define a Gaussian kernel function. Notice that the function is very compact! To test the function, define one center at C = (2.3, 3.2). Because SAS/IML is a matrix language, you can evaluate the Gaussian kernel on a grid of integer coordinates (x,y) where x is an integer in the range [1,5] and y is in the range [1,8]. Let Z be the matrix of the 40 ordered pairs. The following call evaluates the Gaussian kernel at the grid of points:

proc iml;
/* Radial basis function (Gaussian kernel). If z is m x d and c is n x d, this function
   returns the mxn matrix of values exp( -||z[i,] - c[j,]||**2 / (2*sigma**2) ) */
start GaussKernel(z, c, sigma=1);
   return exp( -distance(z,c)##2 / (2*sigma**2) );
finish;
 
/* test on small data: Z is an 5 x 8 grid and C = {2.3 3.2} */
xPts = 1:5; yPts = 1:8;
Z = expandgrid(xPts, yPts);              /* expand into (8*5) x 2 matrix */
C = {2.3 3.2};                           /* the center */ 
phi = GaussKernel(Z, C);                 /* phi is 40 x 1 vector */
 
print Z phi;                             /* print in expanded form */
phi_Grid = shapecol(phi, ncol(yPts));    /* reshape into grid (optional) */
print phi_Grid[c=(char(xPts)) r=(char(yPts)) F=4.2];

The table shows the Gaussian kernel evaluated at the grid points. The columns represent the values at the X locations and the rows indicate the Y locations. The function is largest at the value (x,y)=(2,3) because (2,3) is the grid point closest to the center (2.3, 3.2). The largest value 0.94. Notice that the function is essentially zero at points that are more than 3 units from the center, which you would expect from a Gaussian distribution with σ = 1.

You can use the HEATMAPCONT subroutine to make a heat map of the function values. However, notice that in the matrix the rows increase in the downward direction, whereas in the usual Cartesian coordinate system the Y direction increases upward. Consequently, you need to reverse the rows and the Y-axis labels when you create a heat map:

start PlotValues( v, xPts, yPts );
   G = shapecol(v, ncol(yPts));      /* reshape vector into grid */
   M =  G[nrow(G):1, ];              /* flip Y axis (rows) */
   yRev = yPts[, ncol(yPts):1];      /* reverse the Y-axis labels */
   call heatmapcont(M) xvalues=xPts yValues=yRev;
finish;
run PlotValues(phi, xPts, yPts);
Evaluate Gaussian kernel on grid of points. Visualize with a heat map.

Sums of radial basis functions

Locations of 86 US cities with population greater than 200,000

Often the "centers" are the locations of some resource such as a warehouse, a hospital, or an ATM. Let's use the locations of 86 large US cities, which I used in a previous article about spatial data analysis. A graph of the locations of the cities is shown to the right. (Click to enlarge.) The locations are in a standardized coordinate system, so they do not directly correspond to longitudes and latitudes.

If there are multiple centers, the GaussKernel function returns a column for every center. Many applications require a weighted sum of the columns. You can achieve a weighted sum by using a matrix-vector product A*w, where w is a column vector of weights. If you want an unweighted sum, you can use the SAS/IML subscript reduction operator to sum across the columns: A[,+].

For example, the following statements evaluate the Gaussian kernel function at each value in a grid (the Z matrix) and for each of 86 cities (the C matrix). The result is a 3726 x 86 matrix of values. You can use the subscript reduction operator to sum the kernel evaluations over the cities, as shown:

use BigCities;
   read all var {x y} into C;     /* C = (x,y) locations of centers */
   read all var "City";
close;
 
/* Z = a regular grid in (x,y) coordinates that contains the data */
XGridPts = round( do(-0.4, 0.4, 0.01), 0.001);
YGridPts = round( do(-0.2, 0.25, 0.01), 0.001);
Z = expandgrid( XGridPts, YGridPts );  /* 3,726 points on a 81x46 grid */
 
phi = GaussKernel(X, C, 0.025);   /* use smaller bandwidth */
sumPhi = phi[,+];                 /* for each grid point, add sum of kernel evaluations */
Sum of 86 Gaussian kernels evaluated on a regular grid

The resulting heat map shows blobs centered at each large city in the data. Locations near isolated cities (such as Oklahoma City) are lighter in color than locations near multiple nearby cities (such as southern California and the New York area) because the image shows the superposition of the kernel functions. At points that are far from any large city, the sum of the Gaussian kernel functions is essentially zero.

In summary, if you work with algorithms that use radial basis functions such as Gaussian kernels, you can use the SAS/IML language to evaluate these functions. By using the matrix features of the language and the fact that the DISTANCE function supports matrices as arguments, you can quickly and efficiently evaluate weighted sums of these kernel functions.

The post Radial basis functions and Gaussian kernels in SAS appeared first on The DO Loop.

9月 242018
 

Last week I compared the overhand shuffle to the riffle shuffle. I used random operations to simulate both kinds of shuffles and then compared how well they mix cards. The article caused one my colleague and fellow blogger, Rob Pratt, to ask if I was familiar with a bit of shuffling trivia: if you perform a perfect riffle shuffle, the cards return to their original order after exactly eight perfect shuffles! This mathematical curiosity is illustrated in a beautiful animation by James Miles, which shows the results of eight perfect shuffles.

After being reminded of this interesting fact, I wondered how the result generalizes to decks of various sizes. That is, if you use a deck with N cards, what is the minimum number of perfect riffle shuffles (call it P(N)) that you need to restore the deck to its original order? I decided to run a SAS program to discover the answer. The result is summarized in the following graph, which plots P(N) versus the number of cards in a deck. All points are below the identity line, which implies that at most N – 1 shuffles are required for a deck that contains N cards. If you want to learn more about the graph and its interesting patterns, read on.

Number of perfect shuffles for deck of N cards to restore the original order

Coding the perfect riffle shuffle in SAS

The perfect riffle shuffle is easier to program than the Gilbert-Shannon-Reeds (GSR) model, which uses randomness to model how real people shuffle real cards. In a perfect riffle shuffle, you cut the deck exactly two equal parts. Then you interlace the cards from the top stack with the cards from the bottom stack. The new deck ordering is TBTBTB..., where T indicates a card from the top half of the original deck and B indicates a card from the bottom half.

There are actually two perfect riffle shuffles for an even number of cards: you can interlace the cards TBTBTB..., which is called an "out shuffle," or you can interlace the cards BTBTBT..., which is called an "in shuffle." For odd-numbered decks, there is also a choice of where to cut the deck: does the top part of the deck get the extra card or does the bottom part? My program always uses the "out shuffle" convention (the top card stays on top) and gives the top stack the extra card when there is an odd number of cards. Therefore, the shuffled deck looks like TBTB...TB for even-numbered decks and TBTB...TBT for odd-numbered decks. The following SAS/IML function takes a row vector that represents a card deck and performs a perfect riffle shuffle to reorder the cards in the deck:

proc iml;
start PerfectRiffle(deck);
   N = ncol(deck);                        /* send in deck as a row vector */
   shuffle = j(1,N,.);                    /* allocate new deck */
   nT = ceil(N/2);                        /* cut into two stacks; "Top" gets extra card when N odd */
   nB = N - nT;                           /* nT = size of Top; nB = size of Bottom */
   shuffle[, do(1,N,2)] = deck[,1:nT];   /* top cards go into odd positions */
   shuffle[, do(2,N,2)] = deck[,nT+1:N]; /* bottom cards go into even positions */
   return(shuffle);
finish;
 
/* test the algorithm on a deck that has N = 10 cards */
origDeck = 1:10;
deck = PerfectRiffle(origDeck);
print (origdeck//deck)[r={"Orig Deck" "Perfect Riffle"}];
Perfect shuffle on deck of N=10 cards

The output shows one perfect riffle of a deck that has 10 cards that are originally in the order 1, 2, ..., 10.

How many perfect riffle shuffles to restore order?

If you call the PerfectSuffle function repeatedly on the same deck, you can simulate perfect riffle shuffles. After each shuffle, you can test whether the order of the deck is in the original order. If so, you can stop shuffling. If not, you can shuffle again. The following SAS/IML loops test decks that contain up to 1,000 cards. The array nUntilRestore keeps track of how many perfect riffle shuffles were done for each deck size.

decksize = 1:1000;                       /* N = 1, 2, 3, ..., 1000 */
nUntilRestore = j(1, ncol(decksize), .); /* the results vector */
do k = 2 to ncol(decksize);              /* for the deck of size N */
   origDeck = 1:decksize[k];             /* original order is 1:N */
   deck = OrigDeck;                      /* set order of deck */
   origOrder = 0;                        /* flag variable */
   do nShuffles = 1 to decksize[k] until (origOrder); /* repeat shuffling */
      deck = PerfectRiffle( deck );      /* shuffle deck */
      origOrder = all(deck = origDeck);  /* until the order of the cards are restored */
   end;
   nUntilRestore[k] = nShuffles;         /* store the number of shuffles */
end;
 
/* print the results for N=1,..,99 */
x = j(10, 10, .);
x[2:100] = nUntilRestore[1:99];          /* present in 10x10 array */
rr = putn( do(0, 90, 10), "Z2.");        /* each row has 10 elements */
cc = ("0":"9");
print x[r=rr c=cc L="Number of Perfect Shuffles to Restore Order"];
 
title "Number of Perfect Shuffles to Restore Deck Order";
call scatter(deckSize, nUntilRestore) grid={x y} other="lineparm x=0 y=0 slope=1;"
     label={"N: Number of Cards in Deck" "P(N): Number of Perfect Shuffles to Restore Order"}
     procopt="noautolegend";
Number of perfect shuffles required to restore order in a deck of N cards, N-1..99

The table shows the number of perfect riffle shuffles required for decks that have fewer than 99 cards. The results are in a 10x10 grid. The first row shows decks that have less than 10 cards, the second row shows sizes 10-19, and so on. For example, to find the result for a 52-card deck, move down to the row labeled "50" and over to the column labeled "2". The result in that cell is 8, which is the number of perfect shuffles required to restore a 52-card deck to its original order.

Remarks on the number of perfect riffle shuffles

The graph at the top of this article shows the number of shuffles for decks that contain up to 1,000 cards. To me, the most surprising feature of the graph is the number of points that fall near diagonal lines of various rational slopes. The most apparent diagonal features are the lines whose slopes are approximately equal to 1, 1/2, and 1/3.

A complete mathematical description of this problem is beyond the scope of this blog article, but here are a few hints and links to whet your appetite.

  • The integer sequence P(N) is related to the "multiplicative order of 2 mod 2n+1" in the On-Line Encyclopedia of Integer Sequences (OEIS). The encyclopedia includes a graph that is similar to the one here, but is computed for N ≤ 10,000. That integer sequence applies to the number of perfect riffle shuffles of an EVEN number of cards.
  • The link to the OEIS shows a quick way to explicitly find P(N) for even values of N. P(N) is the smallest value of k for which 2k is congruent to 1 (mod N–1). For example, 8 is the smallest number for which 28 is congruent to 1 (mod 51) as you can compute explicitly: the value of mod(2**8, 51) is 1.
  • The points in the graph for which P(N) = N-1 are all prime numbers: 2, 3, 5, 11, 13, 19, 29, 37, 53, 59, 61, 67, .... However, notice that not all prime numbers are on this list.

There is much more to be said about this topic, but I'll stop here. If you have something to add, please leave a comment.

The post How many perfect riffle shuffles are required to restore a deck to its initial order? appeared first on The DO Loop.

9月 192018
 
Every day I’m shufflin'.
Shufflin', shufflin'.
      -- "Party Rock Anthem," LMFAO

The most popular way to mix a deck of cards is the riffle shuffle, which separates the deck into two pieces and interleaves the cards from each piece. Besides being popular with card players, the riffle shuffle is well-known among statisticians because Bayer and Diaconis (1992) showed that it takes seven riffle shuffles to randomize a 52-card deck. This result annoyed many recreational card players because shuffling seven times "slows down the game".

The second most popular shuffling technique is the overhand shuffle. The overhand shuffle is much less efficient at mixing the cards than a riffle shuffle. A result by Pemantle (1989) indicates that at least 50 overhand shuffles are required to adequately mix the deck, but computer experiments (Pemantle, 1989, p. 49) indicate that in many situations 1,000 or more overhand shuffles are required to randomize a deck. Talk about slowing down the game! This article shows how to simulate an overhand shuffle in SAS and visually compares the mixing properties of the riffle and overhand shuffles.

The overhand shuffle

An overhand shuffle transfers a few cards at a time from the dealer's right hand to his left. The dealer slides a few cards from the top of the deck in his right hand into his left hand. The process is repeated until all cards in his right hand are transferred. Consequently, cards that started near the top of the deck end up near the bottom, and vice versa.

Mathematically, you can model an overhand shuffle by randomly choosing k cut points that separate the deck into k+1 "packets" of contiguous cards. The size of a packet can vary, as can the number of packets. The overhand shuffle reverses the order of the packets while preserving the order of the cards within each packet. When the packet sizes are small, the cards are mixed better than when the packets are bigger.

For example, a deck that contains 10 cards (enumerated 1-10) might be split into five packets such as (1)(234)(5)(678)(9 10), where the parentheses show the cards in each of the packets. In this example, the cut points are the locations after cards 1, 4, 5, and 8. The packets contain 1, 3, 1, 3, and 2 cards. The overhand shuffle reverses the order of the packets, so the deck is reordered as (9 10)(678)(5)(234)(1). This is shown in the following graphic.

In a statistical model of the overhand shuffle, the cut points are assumed to be randomly chosen with probability p from among the N-1 possible locations between adjacent cards. An ideal overhand shuffle uses p = 1/2, which results in packets that are typically sized, 1, 2, and 3 (the packet size is geometrically distributed). However, among amateurs, the probability is much lower and an overhand shuffle reorders only a few large packets, which can lead to poor mixing of the cards.

Simulate an overhand shuffle

I previously showed how to simulate a riffle shuffle in the SAS/IML language. The following program shows how to simulate an overhand shuffle. The deck of cards (represented as a vector of integers with the values 1 to N) is passed in, as is the probability of choosing a cut point. To test the program, I pass in a deck that has 10 cards enumerated by the integers 1–10. The new order of the deck was shown in the previous section.

proc iml;
call randseed(1234);
 
/* Simulate overhand shuffle. Randomly split deck into k packets.
   Reverse the order of the packets. Each position between cards
   is chosen to be a cut point with probability p. (Default: p=0.5)
*/
start Overhand(_deck, p=0.5);
   deck = colvec(_deck);              /* make  sure input is column vector */
   N = nrow(deck);                    /* number of cards in deck */
   b = randfun(N-1, "Bernoulli", p);  /* randomly choose cut points */
   if all(b=0) then return _deck;     /* no cut points were chosen */
   cutpt = loc(b=1);                  /* cut points in range 1:n-1 */
   cutpt = 0 || cutpt || N;           /* append 0 and n */
   rev = n - cutpt + 1;               /* reversed positions */
 
   newDeck = j(N, 1, .);
   do i = 1 to ncol(cutpt)-1;         /* for each packet... */
      R = (1 + cutpt[i]):cutpt[i+1];  /* positions of cards in right hand */
      L = rev[i+1]:(rev[i]-1);        /* new positions in left hand */
      newDeck[L] = deck[R];           /* transfer packet from right to left hand */
   end;
   return shape(newDeck, nrow(_deck), ncol(_deck)); /* return same shape as input */
finish;
 
deck = 1:10;
OverhandDeck = Overhand(deck, 0.5);
print deck;
print OverhandDeck;

Simulate many overhand shuffles

Let's now consider the standard 52-card deck, represented by the integers 1–52. The following loop performs 10 overhand shuffles (p = 0.5) and uses a heat map to show the order of the cards after each shuffle:

N = 52;               /* standard 52-card deck */
deck = 1:N;           /* initial order */
numShuffles = 10;     /* number of times to shuffle cards */
 
OverDeck = j(numShuffles+1, N);
OverDeck[1,] = deck;
do i = 2 to numShuffles + 1;
   OverDeck[i,] = Overhand(OverDeck[i-1,]);
end;
 
ramp = palette("YLORBR",9);
call heatmapcont(OverDeck) colorramp=ramp yvalues=0:numShuffles
     title="Positions of 52 Cards After Overhand Shuffles";
Visualization of 10 overhand shuffles on a deck of 52 cards

The cell colors in the heat map correspond to the initial position of the cards: light colors indicate cards that are initially near the top of the deck and dark colors represent cards that are initially near the bottom of the deck. The first row shows the initial positions of the cards in the deck. The second row shows the cards after one overhand shuffle. The third row shows the cards after two overhand shuffles, and so on. It is interesting to see how the pattern changes if you use p=0.25 or p=0.1. Try it!

One feature that is apparent in the image is that the relative positions of cards depend on whether you have performed an even or an odd number of overhand shuffles. After an odd number of shuffles, the top of the deck contains many cards that were initially at the bottom, and the bottom of the deck contains cards that were initially at the top. After an even number of shuffles, the top of the deck contains many cards that were initially at the top. Another apparent feature is that this process does not mix the cards very well. The last row in the heat map indicates that after 10 overhand shuffles, many cards are not very far from their initial positions. Furthermore, many cards that were initially next to each other are still within a few positions of each other.

Maybe additional overhand shuffles will address these deficiencies? The following program simulates 100 overhand shuffles. To aid in the visualization, the heat map shows only the results after 2, 4, 6, ..., and 100 shuffles (the even numbers):

deck = 1:N;        /* reset initial order */
numShuffles = 100; /* number of times to shuffle cards */
numEvenShuffles = numShuffles / 2;
OverDeck = j(numEvenShuffles+1, N);
 
OverDeck[1,] = deck;
/* save only the EVEN shuffles */
do i = 2 to numEvenShuffles + 1;
   OverDeck[i,] = Overhand(Overhand( OverDeck[i-1,] )); /* 2 calls */
end;
 
call heatmapcont(OverDeck) colorramp=ramp yvalues=0:numEvenShuffles
     title="Positions of 52 Cards After EVEN Overhand Shuffles";
Positions of playing cards after k overhand shuffles, k=0, 2, 4,..., 100

The heat map indicates that the deck is not well-mixed even after 100 overhand shuffles. Look at the last row in the heat map, which shows the positions of the cards in the deck colored by their initial position. There are a small number of light-colored cells on the right side of the heat map (cards that were initially on top; now near the bottom), and a small number of dark-colored cells on the left side of the heat map (cards that were initially on the bottom; now near the top). However, in general, the left side of the heat map contains mostly light-colored cells and the right side contains mostly dark-colored cells, which indicates that cards that have not moved far from their initial positions.

If you use smaller values of p (that is, the average packet size is larger), you will observe different patterns. The theory says that large packets cause nearby cards to tend to stay together for many shuffles. Jonasson (2006) was able to prove that if a deck has N cards, then O(N2 log(N)) overhand shuffles are required to randomize the deck. This compares with (3/2) log2(N) for the riffle shuffle, which is presented in the next section.

A comparison with the riffle shuffle

The riffle shuffle, which randomizes a deck after about seven shuffles. You can simulate a riffle shuffle in SAS and create a similar heat map. The following heat map shows the order of a 52-card deck after each of 10 riffle shuffles:

Position of cards after k riffle shuffles, k=0, 1, 2, ..., 10

The heat map shows how quickly a riffle shuffle randomizes a deck. The last row (after 10 riffles) shows that light-colored cells are distributed throughout the deck, as are dark-colored cells and medium-colored cells. Furthermore, this behavior seems to occur after six or seven rows (at least to my untrained eye). I cannot see much difference between the "randomness" in row 6 and row 10. However, theoretically, a card in an earlier row (say, row 5) is more likely to be near cards that were initially close to it. For example, a dark-colored cell in row 5 is more likely to be near another dark-colored cell than would be expected in a randomized deck. However, by the time you perform 7 or more riffle shuffles, two cards that started near each other might be located far apart or might be close to each other. You cannot predict which will occur because the cards have been randomized.

In summary, the overhand shuffle does not mix cards very well. The simulation indicates that an overhand shuffle is not effective at separating cards that begin near each other. Even after 100 overhand shuffles (with p=0.5), the positions of many cards are only slightly different from their initial positions. In contrast, a riffle shuffle is an effective way to randomize a deck in approximately seven shuffles.

The post Shuffling smackdown: Overhand shuffle versus riffle shuffle appeared first on The DO Loop.