Bootstrap and Resampling

1月 032018
 

I wrote more than 100 posts for The DO Loop blog in 2017. The most popular articles were about SAS programming tips, statistical data analysis, and simulation and bootstrap methods. Here are the most popular articles from 2017 in each category.

General SAS programming techniques

Statistics and Data Analysis

Observed and Expected proportions of M&M color (2017)
  • M&M Colors: It's no surprise that a statistical analysis of the color distribution of M&M candies was one of the most popular articles. Some people are content to know that the candies are delicious, but thousands wanted to read about whether blue and orange candies occur more often than brown.
  • Interpretation of Correlation: Correlation is one of the simplest multivariate statistics, but it can be interpreted in many ways: algebraic, geometric, in terms of regression, and more. This article describes seven ways to view correlation?
  • Winsorize Data: Before you ask "how can I Winsorize data" to eliminate outliers, you should ask "what is Winsorization" and "what are the pitfalls?" This article presents the advantages and disadvantages of Winsorizing data.

Simulation and Bootstrapping

Was you New Year's resolution to learn more about SAS? Did you miss any of these popular posts? Take a moment to read (or re-read!) one of these top 10 posts from the past year.

The post The top 10 posts from <em>The DO Loop</em> in 2017 appeared first on The DO Loop.

7月 122017
 

I recently showed how to compute a bootstrap percentile confidence interval in SAS. The percentile interval is a simple "first-order" interval that is formed from quantiles of the bootstrap distribution. However, it has two limitations. First, it does not use the estimate for the original data; it is based only on bootstrap resamples. Second, it does not adjust for skewness in the bootstrap distribution. The so-called bias-corrected and accelerated bootstrap interval (the BCa interval) is a second-order accurate interval that addresses these issues. This article shows how to compute the BCa bootstrap interval in SAS. You can download the complete SAS program that implements the BCa computation.

As in the previous article, let's bootstrap the skewness statistic for the petal widths of 50 randomly selected flowers of the species Iris setosa. The following statements create a data set called SAMPLE and the rename the variable to analyze to 'X', which is analyzed by the rest of the program:

data sample;
   set sashelp.Iris;     /* <== load your data here */
   where species = "Setosa";
   rename PetalWidth=x;  /* <== rename the analyzes variable to 'x' */
run;

BCa interval: The main ideas

The main advantage to the BCa interval is that it corrects for bias and skewness in the distribution of bootstrap estimates. The BCa interval requires that you estimate two parameters. The bias-correction parameter, z0, is related to the proportion of bootstrap estimates that are less than the observed statistic. The acceleration parameter, a, is proportional to the skewness of the bootstrap distribution. You can use the jackknife method to estimate the acceleration parameter.

Assume that the data are independent and identically distributed. Suppose that you have already computed the original statistic and a large number of bootstrap estimates, as shown in the previous article. To compute a BCa confidence interval, you estimate z0 and a and use them to adjust the endpoints of the percentile confidence interval (CI). If the bootstrap distribution is positively skewed, the CI is adjusted to the right. If the bootstrap distribution is negatively skewed, the CI is adjusted to the left.

Estimate the bias correction and acceleration

The mathematical details of the BCa adjustment are provided in Chernick and LaBudde (2011) and Davison and Hinkley (1997). My computations were inspired by Appendix D of Martinez and Martinez (2001). To make the presentation simpler, the program analyzes only univariate data.

The bias correction factor is related to the proportion of bootstrap estimates that are less than the observed statistic. The acceleration parameter is proportional to the skewness of the bootstrap distribution. You can use the jackknife method to estimate the acceleration parameter. The following SAS/IML modules encapsulate the necessary computations. As described in the jackknife article, the function 'JackSampMat' returns a matrix whose columns contain the jackknife samples and the function 'EvalStat' evaluates the statistic on each column of a matrix.

proc iml;
load module=(JackSampMat);             /* load helper function */
 
/* compute bias-correction factor from the proportion of bootstrap estimates 
   that are less than the observed estimate 
*/
start bootBC(bootEst, Est);
   B = ncol(bootEst)*nrow(bootEst);    /* number of bootstrap samples */
   propLess = sum(bootEst < Est)/B;    /* proportion of replicates less than observed stat */
   z0 = quantile("normal", propLess);  /* bias correction */
   return z0;
finish;
 
/* compute acceleration factor, which is related to the skewness of bootstrap estimates.
   Use jackknife replicates to estimate.
*/
start bootAccel(x);
   M = JackSampMat(x);                 /* each column is jackknife sample */
   jStat = EvalStat(M);                /* row vector of jackknife replicates */
   jackEst = mean(jStat`);             /* jackknife estimate */
   num = sum( (jackEst-jStat)##3 );
   den = sum( (jackEst-jStat)##2 );
   ahat = num / (6*den##(3/2));        /* ahat based on jackknife ==> not random */
   return ahat;
finish;

Compute the BCa confidence interval

With those helper functions defined, you can compute the BCa confidence interval. The following SAS/IML statements read the data, generate the bootstrap samples, compute the bootstrap distribution of estimates, and compute the 95% BCa confidence interval:

/* Input: matrix where each column of X is a bootstrap sample. 
   Return a row vector of statistics, one for each column. */
start EvalStat(M); 
   return skewness(M);               /* <== put your computation here */
finish;
 
alpha = 0.05;
B = 5000;                            /* B = number of bootstrap samples */
use sample; read all var "x"; close; /* read univariate data into x */
 
call randseed(1234567);
Est = EvalStat(x);                   /* 1. compute observed statistic */
s = sample(x, B // nrow(x));         /* 2. generate many bootstrap samples (N x B matrix) */
bStat = T( EvalStat(s) );            /* 3. compute the statistic for each bootstrap sample */
bootEst = mean(bStat);               /* 4. summarize bootstrap distrib, such as mean */
z0 = bootBC(bStat, Est);             /* 5. bias-correction factor */
ahat = bootAccel(x);                 /* 6. ahat = acceleration of std error */
print z0 ahat;
 
/* 7. adjust quantiles for 100*(1-alpha)% bootstrap BCa interval */
zL = z0 + quantile("normal", alpha/2);    
alpha1 = cdf("normal", z0 + zL / (1-ahat*zL));
zU = z0 + quantile("normal", 1-alpha/2);
alpha2 = cdf("normal", z0 + zU / (1-ahat*zU));
call qntl(CI, bStat, alpha1//alpha2); /* BCa interval */
 
R = Est || BootEst || CI`;          /* combine results for printing */
print R[c={"Obs" "BootEst" "LowerCL" "UpperCL"} format=8.4 L="95% Bootstrap Bias-Corrected CI (BCa)"];
Bias=corrected and accelerated BCa interval for a dootstrap distribution

The BCa interval is [0.66, 2.29]. For comparison, the bootstrap percentile CI for the bootstrap distribution, which was computed in the previous bootstrap article, is [0.49, 1.96].

Notice that by using the bootBC and bootAccel helper functions, the program is compact and easy to read. One of the advantages of the SAS/IML language is the ease with which you can define user-defined functions that encapsulate sub-computations.

You can visualize the analysis by plotting the bootstrap distribution overlaid with the observed statistic and the 95% BCa confidence interval. Notice that the BCa interval is not symmetric about the bootstrap estimate. Compared to the bootstrap percentile interval (see the previous article), the BCa interval is shifted to the right.

Bootstrap distribution and bias-corrected and accelerated BCa confidence interval

There is another second-order method that is related to the BCa interval. It is called the ABC method and it uses an analytical expression to approximate the endpoints of the BCa interval. See p. 214 of Davison and Hinkley (1997).

In summary, bootstrap computations in the SAS/IML language can be very compact. By writing and re-using helper functions, you can encapsulate some of the tedious calculations into a high-level function, which makes the resulting program easier to read. For univariate data, you can often implement bootstrap computations without writing any loops by using the matrix-vector nature of the SAS/IML language.

If you do not have access to SAS/IML software or if the statistic that you want to bootstrap is produced by a SAS procedure, you can use SAS-supplied macros (%BOOT, %JACK,...) for bootstrapping. The macros include the %BOOTCI macro, which supports the percentile interval, the BCa interval, and others. For further reading, the web page for the macros includes a comparison of the CI methods.

The post The bias-corrected and accelerated (BCa) bootstrap interval appeared first on The DO Loop.

7月 102017
 

I previously wrote about how to compute a bootstrap confidence interval in Base SAS. As a reminder, the bootstrap method consists of the following steps:

  1. Compute the statistic of interest for the original data
  2. Resample B times from the data to form B bootstrap samples. B is usually a large number, such as B = 5000.
  3. Compute the statistic on each bootstrap sample. This creates the bootstrap distribution, which approximates the sampling distribution of the statistic.
  4. Use the bootstrap distribution to obtain bootstrap estimates such as standard errors and confidence intervals.

In my book Simulating Data with SAS, I describe efficient ways to bootstrap in the SAS/IML matrix language. Whereas the Base SAS implementation of the bootstrap requires calls to four or five procedure, the SAS/IML implementation requires only a few function calls. This article shows how to compute a bootstrap confidence interval from percentiles of the bootstrap distribution for univariate data. How to bootstrap multivariate data is discussed on p. 189 of Simulating Data with SAS.

Skewness of univariate data

Let's use the bootstrap to find a 95% confidence interval for the skewness statistic. The data are the petal widths of a sample of 50 randomly selected flowers of the species Iris setosa. The measurements (in mm) are contained in the data set Sashelp.Iris. So that you can easily generalize the code to other data, the following statements create a data set called SAMPLE and the rename the variable to analyze to 'X'. If you do the same with your data, you should be able to reuse the program by modifying only a few statements.

The following DATA step renames the data set and the analysis variable. A call to PROC UNIVARIATE graphs the data and provides a point estimate of the skewness:

data sample;
   set sashelp.Iris;     /* <== load your data here */
   where species = "Setosa";
   rename PetalWidth=x;  /* <== rename the analyzes variable to 'x' */
run;
 
proc univariate data=sample;
   var x;
   histogram x;
   inset N Skewness (6.3) / position=NE;
run;
Distribution of petal length for 50 random Iris setosa flowers

The petal widths have a highly skewed distribution, with a skewness estimate of 1.25.

A bootstrap analysis in SAS/IML

Running a bootstrap analysis in SAS/IML requires only a few lines to compute the confidence interval, but to help you generalize the problem to statistics other than the skewness, I wrote a function called EvalStat. The input argument is a matrix where each column is a bootstrap sample. The function returns a row vector of statistics, one for each column. (For the skewness statistic, the EvalStat function is a one-liner.) The EvalStat function is called twice: once on the original column vector of data and again on a matrix that contains bootstrap samples in each column. You can create the matrix by calling the SAMPLE function in SAS/IML, as follows:

/* Basic bootstrap percentile CI. The following program is based on 
   Chapter 15 of Wicklin (2013) Simulating Data with SAS, pp 288-289. 
*/
proc iml;
/* Function to evaluate a statistic for each column of a matrix.
   Return a row vector of statistics, one for each column. */
start EvalStat(M); 
   return skewness(M);               /* <== put your computation here */
finish;
 
alpha = 0.05;
B = 5000;                            /* B = number of bootstrap samples */
use sample; read all var "x"; close; /* read univariate data into x */
 
call randseed(1234567);
Est = EvalStat(x);                   /* 1. compute observed statistic */
s = sample(x, B // nrow(x));         /* 2. generate many bootstrap samples (N x B matrix) */
bStat = T( EvalStat(s) );            /* 3. compute the statistic for each bootstrap sample */
bootEst = mean(bStat);               /* 4. summarize bootstrap distrib such as mean, */
SE = std(bStat);                             /* standard deviation,                  */
call qntl(CI, bStat, alpha/2 || 1-alpha/2);  /* and 95% bootstrap percentile CI      */
 
R = Est || BootEst || SE || CI`;     /* combine results for printing */
print R[format=8.4 L="95% Bootstrap Pctl CI"  
        c={"Obs" "BootEst" "StdErr" "LowerCL" "UpperCL"}];

The SAS/IML program for the bootstrap is very compact. It is important to keep track of the dimensions of each variable. The EST, BOOTEST, and SE variables are scalars. The S variable is a B x N matrix, where N is the sample size. The BSTAT variable is a column vector with N elements. The CI variable is a two-element column vector.

The output summarizes the bootstrap analysis. The estimate for the skewness of the observed data is 1.25. The bootstrap distribution (the skewness of the bootstrap samples) enables you to estimate three common quantities:

  • The bootstrap estimate of the skewness is 1.18. This value is computed as the mean of the bootstrap distribution.
  • The bootstrap estimate of the standard error of the skewness is 0.38. This value is computed as the standard deviation of the bootstrap distribution.
  • The bootstrap percentile 95% confidence interval is computed as the central 95% of the bootstrap estimates, which is the interval [0.49, 1.96].

It is important to realize that these estimate will vary slightly if you use different random-number seeds or a different number of bootstrap iterations (B).

You can visualize the bootstrap distribution by drawing a histogram of the bootstrap estimates. You can overlay the original estimate (or the bootstrap estimate) and the endpoints of the confidence interval, as shown below.

In summary, you can implement the bootstrap method in the SAS/IML language very compactly. You can use the bootstrap distribution to estimate the parameter and standard error. The bootstrap percentile method, which is based on quantiles of the bootstrap distribution, is a simple way to obtain a confidence interval for a parameter. You can download the full SAS program that implements this analysis.

The post Bootstrap estimates in SAS/IML appeared first on The DO Loop.

6月 212017
 

One way to assess the precision of a statistic (a point estimate) is to compute the standard error, which is the standard deviation of the statistic's sampling distribution. A relatively large standard error indicates that the point estimate should be viewed with skepticism, either because the sample size is small or because the data themselves have a large variance. The jackknife method is one way to estimate the standard error of a statistic.

Some simple statistics have explicit formulas for the standard error, but the formulas often assume normality of the data or a very large sample size. When your data do not satisfy the assumptions or when no formula exists, you can use resampling techniques to estimate the standard error. Bootstrap resampling is one choice, and the jackknife method is another. Unlike the bootstrap, which uses random samples, the jackknife is a deterministic method.

This article explains the jackknife method and describes how to compute jackknife estimates in SAS/IML software. This is best when the statistic that you need is also implemented in SAS/IML. If the statistic is computed by a SAS procedure, you might prefer to download and use the %JACK macro, which does not require SAS/IML.

The jackknife method: Leave one out!

The jackknife method estimates the standard error (and bias) of statistics without making any parametric assumptions about the population that generated the data. It uses only the sample data.

The jackknife method manufactures jackknife samples from the data. A jackknife sample is a "leave-one-out" resample of the data. If there are n observations, then there are n jackknife samples, each of size n-1. If the original data are {x1, x2,..., xn}, then the i_th jackknife sample is
{x1,..., xi-1,xi+1,..., xn}
You then compute n jackknife replicates. A jackknife replicate is the statistic of interest computed on a jackknife sample. You can obtain an estimate of the standard error from the variance of the jackknife replicates. The jackknife method is summarized by the following:

  1. Compute a statistic, T, on the original sample of size n.
  2. For i = 1 to n, repeat the following:
    • Leave out the i_th observation to form the i_th jackknife sample.
    • Compute the i_th jackknife replicate statistic, Ti, by computing the statistic on the i_th jacknife sample.
  3. Compute the average (mean) of the jackknife replicates: Tavg = Σi Ti / n.
  4. (Optional) Estimate the bias as BiasTjack = (n-1)(Tavg - T)
  5. Estimate the standard error as SEjack = sqrt( (n-1)/n (Σi Ti - Tavg)**2 )

Data for a jackknife example

Resampling methods are not hard, but the notation in some books can be confusing. To clarify the method, let's choose a particular statistic and look at example data. The following example is from Martinez and Martinez (2001, 1st Ed, p. 241), which is also the source for this article. The data are the LSAT scores and grade-point averages (GPAs) for 15 randomly chosen students who applied to law school.

data law;
input lsat gpa @@;
datalines;
653 3.12   576 3.39   635 3.30   661 3.43   605 3.13
578 3.03   572 2.88   545 2.76   651 3.36   555 3.00
580 3.07   594 2.96   666 3.44   558 2.81   575 2.74
;

The statistic of interest (T) will be the correlation coefficient between the LSAT and the GPA variables for the n=15 observations. The observed correlation is TData = 0.776. The standard error of T helps us understand how much T would change if we took a different random sample of 15 students. The next sections show how to implement the jackknife analysis in the SAS/IML language.

Construct a jackknife sample in SAS

The SAS/IML matrix language is the simplest way to perform a general jackknife estimates. If X is an n x p data matrix, you can obtain the i_th jackknife sample by excluding the i_th row of X. The following two helper functions encapsulate some of the computations. The SeqExclude function returns the index vector {1, 2, ..., i-1, i+1, ..., n}. The JackSample function returns the data matrix without the i_th row:

proc iml;
/* return the vector {1,2,...,i-1, i+1,...,n}, which excludes the scalar value i */ 
start SeqExclude(n,i);
   if i=1 then return 2:n;
   if i=n then return 1:n-1;
   return (1:i-1) || (i+1:n);
finish;
 
/* return the i_th jackknife sample for (n x p) matrix X */
start JackSamp(X,i);
   n = nrow(X);
   return X[ SeqExclude(n, i), ];  /* return data without i_th row */
finish;

The jackknife method for multivariate data in SAS

By using the helper functions, you can carry out each step of the jackknife method. To make the method easy to modify for other statistics, I've written a function called EvalStat which computes the correlation coefficient. This function is called on the original data and on each jackknife sample.

/* compute the statistic in this function */
start EvalStat(X);
   return corr(X)[2,1];  /* <== Example: return correlation between two variables */
finish;
 
/* read the data into a (n x 2) data matrix */
use law;  read all var {"gpa" "lsat"} into X;  close;
 
/* 1. compute statistic on observed data */
T = EvalStat(X);
 
/* 2. compute same statistic on each jackknife sample */
n = nrow(X);
T_LOO = j(n,1,.);             /* LOO = "Leave One Out" */
do i = 1 to n;
   Y = JackSamp(X,i);
   T_LOO[i] = EvalStat(Y); 
end;
 
/* 3. compute mean of the LOO statistics */
T_Avg = mean( T_LOO );  
 
/* 4 & 5. compute jackknife estimates of bias and standard error */
biasJack = (n-1)*(T_Avg - T);
stdErrJack = sqrt( (n-1)/n * ssq(T_LOO - T_Avg) );
result = T || T_Avg || biasJack || stdErrJack;
print result[c={"Estimate" "Mean Jackknife Estimate" "Bias" "Std Error"}];
Jackknife estimate of standard error and bias for a correlation coefficient of multivariate data

The output shows that the estimate of bias for the correlation coefficient is very small. The standard error of the correlation coefficient is estimated as 0.14, which is about 18% of the estimate.

To use this code yourself, simply modify the EvalStat function. The remainder of the program does not need to change.

The jackknife method in SAS/IML: Univariate data

When the data are univariate, you can sometimes eliminate the loop that computes jackknife samples and evaluates the jackknife replicates. If X is column vector, you can computing the (n-1) x n matrix whose i_th column represents the i_th jackknife sample. (To prevent huge matrices, this method is best for n < 20000.) Because many statistical functions in SAS/IML operate on the columns of a matrix, you can often compute the jackknife replicates in a vectorized manner.

In the following program, the JackSampMat function returns the matrix of jackknife samples for univariate data. The function calls the REMOVE function in SAS/IML, which deletes specified elements of a matrix and returns the results in a row vector. The EvalStatMat function takes the matrix of jackknife samples and returns a row vector of statistics, one for each column. In this example, the function returns the sample standard deviation.

/* If x is univariate, you can construct a matrix where
   each column contains a jackknife sample.
   Use for univariate column vector x when n < 20000 */
start JackSampMat(x); 
   n = nrow(x);
   B = j(n-1, n,0);
   do i = 1 to n;
      B[,i] = remove(x, i)`;   /* transpose to column vevtor */
   end;
   return B;
finish;
 
/* Input: matrix where each column of X is a bootstrap sample. 
   Return a row vector of statistics, one for each column. */
start EvalStatMat(x); 
   return std(x);   /* <== Example: return std dev of each sample */
finish;

Let's use these functions to get a jackknife estimate of the standard error for the statistic (the standard deviation). The data (from Martinez and Martinez, p. 246) have been studied by many researchers and represent the weight gain in grams for 10 rats who were fed a low-protein diet of cereal:

x = {58,67,74,74,80,89,95,97,98,107}; /* Weight gain (g) for 10 rats */
 
/* optional: visualize the matrix of jackknife samples */
*M = JackSampMat(x);
*print M[c=("S1":"S10") r=("1":"9")];
 
/* Jackknife method for univariate data */
/* 1. compute observed statistic */
T = EvalStatMat(x);    
/* 2. compute same statistic on each jackknife sample */
T_LOO = EvalStatMat( JackSampMat(x) ); /* LOO = "Leave One Out" */
/* 3. compute mean of the LOO statistics */
T_Avg = mean( T_LOO` );                /* transpose T_LOO */
/* 4 & 5. compute jackknife estimates of bias and standard error */
biasJack = (n-1)*(T_Avg - T);
stdErrJack = sqrt( (n-1)/n * ssq(T_LOO - T_Avg) );
result = T || T_Avg || biasJack || stdErrJack;
print result[c={"Estimate" "Mean Jackknife Estimate" "Bias" "Std Error"}];
Jackknife estimate of standard error and bias for the standard deviation of univariate data

The output shows that the standard deviation of these data is about 15.7 grams. The jackknife method computes that the standard error for this statistic about 2.9 grams, which is about 18% of the estimate.

In summary, jackknife estimates are straightforward to implement in SAS/IML. This article shows a general implementation that works for all data and a specialized implementation that works for univariate data. In both cases, you can adapt the code for your use by modifying the function that computes the statistic on a data set. This approach is useful and efficient when the statistic is implemented in SAS/IML.

The post The jackknife method to estimate standard errors in SAS appeared first on The DO Loop.

8月 172016
 

Last week I showed how to use the simple bootstrap to randomly resample from the data to create B bootstrap samples, each containing N observations. The simple bootstrap is equivalent to sampling from the empirical cumulative distribution function (ECDF) of the data. An alternative bootstrap technique is called the smooth bootstrap. In the smooth bootstrap you add a small amount of random noise to each observation that is selected during the resampling process. This is equivalent to sampling from a kernel density estimate, rather than from the empirical density.

The example in this article is adapted from Chapter 15 of Wicklin (2013), Simulating Data with SAS.

The bootstrap method in SAS/IML

My previous article used the bootstrap method to investigate the sampling distribution of the skewness statistic for the SepalWidth variable in the Sashelp.Iris data. I used PROC SURVEYSELECT to resample the data and used PROC MEANS to analyze properties of the bootstrap distribution. You can also use SAS/IML to implement the bootstrap method.

The following SAS/IML statements creates 5000 bootstrap samples of the SepalWidth data. However, instead of computing a bootstrap distribution for the skewness statistic, this program computes a bootstrap distribution for the median statistic. The SAMPLE function enables you to resample from the data.

data sample(keep=x);
   set Sashelp.Iris(where=(Species="Virginica") rename=(SepalWidth=x));
run;
 
/* Basic bootstrap confidence interval for median  */
%let NumSamples = 5000;                 /* number of bootstrap resamples */
proc iml;
use Sample;  read all var {x};  close Sample;        /* read data */
call randseed(12345);                   /* set random number seed */
obsStat = median(x);                    /* compute statistic on original data */
s = sample(x, &NumSamples // nrow(x));  /* bootstrap samples: 50 x NumSamples */
D = T( median(s) );                     /* bootstrap distribution for statistic */
call qntl(q, D, {0.025 0.975});         /* basic 95% bootstrap CI */
results = obsStat || q`;
print results[L="Bootstrap Median" c={"obsStat" "P025" "P975"}];
Simple bootstrap confidence interval for median

The SAS/IML program is very compact. The MEDIAN function computes the median for the original data. The SAMPLE function generates 5000 resamples; each bootstrap sample is a column of the s matrix. The MEDIAN function then computes the median of each column. The QNTL subroutine computes a 95% confidence interval for the median as [28, 30]. (Incidentally, you can use PROC UNIVARIATE to compute distribution-free confidence intervals for standard percentiles such as the median.)

The bootstrap distribution of the median

The following statement create a histogram of the bootstrap distribution of the median:

title "Bootstrap Distribution for Median";
call histogram(D) label="Median";           /* create histogram in SAS/IML */
Bootstrap distribution of the median. The data are rounded values.

I was surprised when I first saw a bootstrap distribution like this. The distribution contains discrete values. More than 80% of the bootstrap samples have a median value of 30. The remaining samples have values that are integers or half-integers.

This distribution is typical of the bootstrap distribution for a percentile. Three factors contribute to the shape:

  • The measurements are rounded to the nearest millimeter. Thus the data are discrete integers.
  • The sample median is always a data value or (for N even) the midpoint between two data values. In fact, this statement is true for all percentiles.
  • In the Sample data, the value 30 is not only the median, but is the mode. Consequently, many bootstrap samples will have 30 as the median value.

The smooth bootstrap can analyze percentiles of rounded data #StatWisdom #SASTip
Click To Tweet


Smooth bootstrap

Although the bootstrap distribution for the median is correct, it is somewhat unsatisfying. Widths and lengths represent continuous quantities. Consequently, the true sampling distribution of the median statistic is continuous.

The bootstrap distribution would look more continuous if the data had been measured with more precision. Although you cannot change the data, you can change the way that you create bootstrap samples. Instead of drawing resamples from the (discrete) ECDF, you can randomly draw samples from a kernel density estimate (KDE) of the data. The resulting samples will not contain data values. Instead, they will contains values that are randomly drawn from a continuous KDE.

You have to make two choices for the KDE: the shape of the kernel and the bandwidth. This article explores two possible choices:

  • Uniform kernel: A recorded measurement of 30 mm means that the true value of the sepal was in the interval [29.5, 30.5). In general, a recorded value of x means that the true value is in [x-0.5, x+0.5). Assuming that any point in that interval is equally likely leads to a uniform kernel with bandwidth 0.5.
  • Normal kernel: You can assume that the true measurement is normally distributed, centered on the measured value, and is very likely to be within [x-0.5, x+0.5). For a normal distribution, 95% of the probability is contained in ±2σ of the mean, so you could choose σ=0.25 and assume that the true value for the measurement x is in the distribution N(x, 0.25).

For more about the smooth bootstrap, see Davison and Hinkley (1997) Bootstrap Methods and their Application.

Smooth bootstrap in SAS/IML

For the iris data, the uniform kernel seems intuitively appealing. The following SAS/IML program defines a function named SmoothUniform that randomly chooses B samples and adds a random U(-h, h) variate to each data point. The medians of the columns form the bootstrap distribution.

/* randomly draw a point from x. Add noise from U(-h, h) */
start SmoothUniform(x, B, h);
   N = nrow(x) * ncol(x);
   s = Sample(x, N // B);               /* B x N matrix */
   eps = j(B, N);                       /* allocate vector */
   call randgen(eps, "Uniform", -h, h); /* fill vector */
   return( s + eps );                   /* add random uniform noise */
finish;
 
s = SmoothUniform(x, &NumSamples, 0.5); /* columns are bootstrap samples from KDE */
D = T( median(s) );                     /* median of each col is bootstrap distrib */
BSEst = mean(D);                        /* bootstrap estimate of median */
call qntl(q, D, {0.025 0.975});         /* basic 95% bootstrap CI */
results = BSEst || q`;
print results[L="Smooth Bootstrap (Uniform Kernel)" c={"Est" "P025" "P975"}];
Smooth bootstrap estimate and confidence interval for the median

The smooth bootstrap distribution (not shown) is continuous. The mean of the distribution is the bootstrap estimate for the median. The estimate for this run is 29.8. The central 95% of the smooth bootstrap distribution is [29.77, 29.87]. The bootstrap estimate is close to the observed median, but the CI is much smaller than the earlier simple CI. Notice that the observed median (which is computed on the rounded data) is not in the 95% CI from the smooth bootstrap distribution.

Many researchers in density estimation state that the shape of the kernel function does not have a strong impact on the density estimate (Scott (1992), Multivariate Density Estimation, p. 141). Nevertheless, the following SAS/IML statements define a function called SmoothNormal that implements a smoothed bootstrap with a normal kernel:

/* Smooth bootstrap with normal kernel and sigma = h */
start SmoothNormal(x, B, h);
   N = nrow(x) * ncol(x);
   s = Sample(x, N // B);                /* B x N matrix */
   eps = j(B, N);                        /* allocate vector */
   call randgen(eps, "Normal", 0, h);    /* fill vector */
   return( s + eps );                    /* add random normal variate */
finish;
 
s = SmoothNormal(x, &NumSamples, 0.25);  /* bootstrap samples from KDE */

The mean of this smooth bootstrap distribution is 29.89. The central 95% interval is [29.86, 29.91]. As expected, these values are similar to the values obtained by using the uniform kernel.

Summary of the smooth bootstrap

In summary, the SAS/IML language provides a compact and efficient way to implement the bootstrap method for a univariate statistic such as the skewness or median. A visualization of the bootstrap distribution of the median reveals that the distribution is discrete due to the rounded data values and the statistical properties of percentiles. If you choose, you can "undo" the rounding by implementing the smooth bootstrap method. The smooth bootstrap is equivalent to drawing bootstrap samples from a kernel density estimate of the data. The resulting bootstrap distribution is continuous and gives a smaller confidence interval for the median of the population.

tags: Bootstrap and Resampling

The post The smooth bootstrap method in SAS appeared first on The DO Loop.

8月 102016
 

A common question is "how do I compute a bootstrap confidence interval in SAS?" As a reminder, the bootstrap method consists of the following steps:

  • Compute the statistic of interest for the original data
  • Resample B times from the data to form B bootstrap samples. How you resample depends on the null hypothesis that you are testing.
  • Compute the statistic on each bootstrap sample. This creates the bootstrap distribution, which approximates the sampling distribution of the statistic under the null hypothesis.
  • Use the approximate sampling distribution to obtain bootstrap estimates such as standard errors, confidence intervals, and evidence for or against the null hypothesis.

The papers by Cassell ("Don't be Loopy", 2007; "BootstrapMania!", 2010) describe ways to bootstrap efficiently in SAS. The basic idea is to use the DATA step, PROC SURVEYSELECT, or the SAS/IML SAMPLE function to generate the bootstrap samples in a single data set. Then use a BY statement to carry out the analysis on each sample. Using the BY-group approach is much faster than using macro loops.

To illustrate bootstrapping in Base SAS, this article shows how to compute a simple bootstrap confidence interval for the skewness statistic by using the bootstrap percentile method. The example is adapted from Chapter 15 of Simulating Data with SAS, which discusses resampling and bootstrap methods in SAS. SAS also provides the %BOOT and %BOOTCI macros, which provide bootstrap methods and several kinds of confidence intervals.

The accuracy of a statistical point estimate

The following statements define a data set called Sample. The data are measurements of the sepal width for 50 randomly chosen iris flowers of the species iris Virginica. The call to PROC MEANS computes the skewness of the sample:

data sample(keep=x);
   set Sashelp.Iris(where=(Species="Virginica"));
   rename SepalWidth=x;
run;
 
/* 1. compute value of the statistic on original data: Skewness = 0.366 */
proc means data=sample nolabels Skew;  var x;  run;

The sample skewness for these data is 0.366. This estimates the skewness of sepal widths in the population of all i. Virginica. You can ask two questions: (1) How accurate is the estimate, and (2) Does the data indicate that the distribution of the population is skewed? An answer to (1) is provided by the standard error of the skewness statistic. One way to answer question (2) is to compute a confidence interval for the skewness and see whether it contains 0.

PROC MEANS does not provide a standard error or a confidence interval for the skewness, but the next section shows how to use bootstrap methods to estimate these quantities.

Resampling with PROC SURVEYSELECT

For many resampling schemes, PROC SURVEYSELECT is the simplest way to generate bootstrap samples. The following statements generate 5000 bootstrap samples by repeatedly drawing 50 random observations (with replacement) from the original data:

%let NumSamples = 5000;       /* number of bootstrap resamples */
/* 2. Generate many bootstrap samples */
proc surveyselect data=sample NOPRINT seed=1
     out=BootSSFreq(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                 option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;

The output data set represents 5000 samples of size 50, but the output data set contains fewer than 250,000 observations. That is because the SURVEYSELECT procedure generates a variable named NumberHits that records the frequency of each observation in each sample. You can use this variable on the FREQ statement of many SAS procedures, including PROC MEANS. If the SAS procedure that you are using does not support a frequency variable, you can use the OUTHITS option on the PROC SURVEYSELECT statement to obtain a data set that contains 250,000 observations.

BY group analysis of bootstrap samples

The following call to PROC MEANS computes 5000 skewness statistics, one for each of the bootstrap samples. The NOPRINT option is used to suppress the results from appearing on your monitor. (You can read about why it is important to suppress ODS during a bootstrap computation.) The 5000 skewness statistics are written to a data set called OutStats for subsequent analysis:

/* 3. Compute the statistic for each bootstrap sample */
proc means data=BootSSFreq noprint;
   by SampleID;
   freq NumberHits;
   var x;
   output out=OutStats skew=Skewness;  /* approx sampling distribution */
run;

Visualize the bootstrap distribution

The bootstrap distribution tells you how the statistic (in this case, the skewness) might vary due to random sampling variation. You can use a histogram to visualize the bootstrap distribution of the skewness statistic:

title "Bootstrap Distribution";
%let Est = 0.366;
proc sgplot data=OutStats;
   label Skewness= ;
   histogram Skewness;
   /* Optional: draw reference line at observed value and draw 95% CI */
   refline &Est / axis=x lineattrs=(color=red) 
                  name="Est" legendlabel="Observed Statistic = &Est";
   refline -0.44737 0.96934  / axis=x lineattrs=(color=blue) 
                  name="CI" legendlabel="95% CI";
   keylegend "Est" "CI";
run;
Bootstrap distribution with 95% bootstrap confidence interval in SAS

In this graph, the REFLINE statement is used to display (in red) the observed value of the statistic for the original data. A second REFLINE statement plots (in blue) an approximate 95% confidence interval for the skewness parameter, which is computed in the next section. The bootstrap confidence interval contains 0, thus you cannot conclude that the skewness parameter is significantly different from 0.

Compute a bootstrap confidence interval in SAS

The standard deviation of the bootstrap distribution is an estimate for the standard error of the statistic. If the sampling distribution is approximately normal, you can use this fact to construct the usual Wald confidence interval about the observed value of the statistic. That is, if T is the observed statistic, then the endpoints of the 95% two-sided confidence interval are T ± 1.96 SE. (Or use the so-called bootstrap t interval by replacing 1.96 with tα/2, n-1.) The following call to PROC MEANS produces the standard error (not shown):

proc means data=OutStats nolabels N StdDev;
   var Skewness;
run;

However, since the bootstrap distribution is an approximate sampling distribution, you don't need to rely on a normality assumption. Instead, you can use percentiles of the bootstrap distribution to estimate a confidence interval. For example, the following call to PROC UNIVARIATE computes a two-side 95% confidence interval by using the lower 2.5th percentile and the upper 97.5th percentile of the bootstrap distribution:

/* 4. Use approx sampling distribution to make statistical inferences */
proc univariate data=OutStats noprint;
   var Skewness;
   output out=Pctl pctlpre =CI95_
          pctlpts =2.5  97.5       /* compute 95% bootstrap confidence interval */
          pctlname=Lower Upper;
run;
 
proc print data=Pctl noobs; run;
95% bootstrap confidence interval

As mentioned previously, the 95% bootstrap confidence interval contains 0. Although the observed skewness value (0.366) might not seem very close to 0, the bootstrap distribution shows that there is substantial variation in the skewness statistic in small samples.

The bootstrap percentile method is a simple way to obtain a confidence interval for many statistics. There are several more sophisticated methods for computing a bootstrap confidence interval, but this simple method provides an easy way to use the bootstrap to assess the accuracy of a point estimate. For an overivew of bootstrap methods, see Davison and Hinkley (1997) Bootstrap Methods and their Application.

tags: Bootstrap and Resampling

The post Compute a bootstrap confidence interval in SAS appeared first on The DO Loop.

2月 152016
 

Many simulation and resampling tasks use one of four sampling methods. When you draw a random sample from a population, you can sample with or without replacement. At the same time, all individuals in the population might have equal probability of being selected, or some individuals might be more likely than others. Consequently, the four common sampling methods are shown in the following 2 x 2 table.

Four sampling methods in SAS: Sampling with and without replacement, with equal and unequal probability

In SAS, the SURVEYSELECT procedure is a standard way to generate random samples. The previous table names of the four sampling methods, summarizes how to generate samples by using the SURVEYSELECT procedure in SAS/STAT, and shows how to use the SAMPLE function in SAS/IML.

The documentation for the SURVEYSELECT procedure uses terms that might not be familiar to programmers who are not survey statisticians. To help eliminate any confusion, the following sections describe the four common sampling methods and the corresponding METHOD= option in PROC SURVEYSELECT.

Sampling without replacement

When you sample without replacement, the probability of choosing each item changes after each draw. The size of the sample cannot exceed the number of items.

Simple random sampling (SRS) means sampling without replacement and with equal probability. Dealing cards from a 52-card deck is an example of SRS. Use the METHOD=SRS option in PROC SURVEYSELECT to request simple random sampling.

Probability proportional to size (PPS) means sampling without replacement and with unequal probability. The classic example is counting the colors in a sample of colored marbles drawn from an urn that contains colors in different proportions. Use the METHOD=PPS option in PROC SURVEYSELECT to request PPS sampling and specify the relative sizes (or the probability vector) of items by using the SIZE statement.

Sampling with replacement

When you sample with replacement, the probability of choosing each item does not change. The size of the sample can be arbitrarily large.

Unrestricted random sampling (URS) means sampling with replacement and with equal probability. Rolling a six-sided die and recording the face that appears is an example of URS. Use the METHOD=URS option in PROC SURVEYSELECT to request unrestricted random sampling.

Probability proportional to size with replacement means sampling with replacement and with unequal probability. An example is tossing two dice and recording the sum of the faces. Use the METHOD=PPS_WR option in PROC SURVEYSELECT to request PPS sampling with replacement. Use the the SIZE statement to specify the relative sizes or the probability vector for each item.

These four sampling methods are useful to the statistical programmer because they are often used in simulation studies. For more information about using the SAS DATA step and PROC SURVEYSELECT for basic sampling, see "Selecting Unrestricted and Simple Random with Replacement Samples Using Base SAS and PROC SURVEYSELECT (Chapman 2012)." PROC SURVEYSELECT contains many other useful sampling methods. For an overview of more advanced methods, see "PROC SURVEYSELECT as a Tool for Drawing Random Samples" (Lewis 2013).

tags: Bootstrap and Resampling, Simulation

The post Four essential sampling methods in SAS appeared first on The DO Loop.

2月 102016
 

How do you sample with replacement in SAS when the probability of choosing each observation varies? I was asked this question recently. The programmer thought he could use PROC SURVEYSELECT to generate the samples, but he wasn't sure which sampling technique he should use to sample with unequal probability. This article describes how to sample with replacement and unequal probability in SAS.

Sample with replacement and unequal probability with PROC SURVEYSELECT

The programmer's confusion is completely understandable. The SURVEYSELECT documentation is written for survey statisticians, who use specialized language to describe sampling methods. To a survey statistician, sampling with unequal probability is known as sampling with probability proportional to size (PPS), which is often abbreviated as PPS sampling. The SURVEYSELECT procedure has many methods for PPS sampling. For PPS sampling with replacement, specify the METHOD=PPS_WR option.

Sampling in proportion to size has many applications. For example, if you want to survey people at your company, you could randomly select a certain number of people in each building, where the probability of selection is proportional to the number of people who work in each building. Or you could use PPS sampling to obtain a representative sample across departments.

The following example demonstrates sampling with replacement and with unequal probability. Suppose a small town has five busy intersections. The town planners believe that the probability of an accident at an intersection is proportional to the traffic volume. They want to simulate the locations of 500 accidents by using PPS sampling with replacement, where the relative traffic volumes determine the probability of an accident's location.

The following data shows the annual average daily traffic data for each intersection. The call to the SURVEYSELECT procedure uses METHOD=PPS_WR and N=500 to simulate 500 accidents for these intersections. The the SIZE statement specifies the relative traffic, which determines the probability of an accident in each intersection.

data Traffic;
label VehiclesPerDay = "Average Annualized Daily Traffic";
input Intersection $21. VehiclesPerDay;
format VehiclesPerDay comma.;
datalines;
Crash Pkwy/Danger Rd  25000
Fast Dr/Danger Rd     20000
Crazy St/Smash Blvd   17000
Crazy St/Collision Dr 14000
Crash Pkwy/Dent Dr    10000
;
 
/* sample with replacement, probability proportional to size */
proc surveyselect noprint data=Traffic out=Sample
     method=PPS_WR 
     seed=123 N=500;       /* use OUTHITS option if you want 500 obs */
   size VehiclesPerDay;    /* specify the probability variable */
run;
 
proc print data=Sample noobs; 
   var Intersection VehiclesPerDay NumberHits ExpectedHits; 
run;
Sample with replacement and unequal probability

As you can see, the counts of crashes in the simulated sample are close to their expected values. Each time you run a simulation you will get slightly different values. You can use the REPS= option in the PROC SURVEYSELECT statement to generate multiple samples.

I have blogged about this technique before, but my previous focus was on simulating data from a multinomial distribution. The METHOD=PPS_WR option generates counts that follow a multinomial distribution with proportion equal to the standardized "size" variable (which is VehiclesPerDay / sum(VehiclesPerDay)).

If you want a data set that contains all 500 draws, rather than the counts, you can add the OUTHITS option to the PROC SURVEYSELECT statement.

Sample with replacement and unequal probability with PROC IML

For SAS/IML programmers, it might be more convenient to simulate data within PROC IML. The SAS/IML language provides two routines for sampling with replacement and with unequal probability. If you want only the counts (as shown in the previous example) you can use the RANDMULTINOMIAL function and specify the proportion of the traffic that passes through each intersection, as follows:

proc iml;
call randseed(54321);
use Traffic;
   read all var {"Intersection" "VehiclesPerDay"};
close;
Proportion = VehiclesPerDay / sum(VehiclesPerDay);   
Counts = RandMultinomial(1, 500, Proportion);    /* 1 sample of 500 events */

If instead you want the full sample (all 500 values in a random ordering), you can use the SAMPLE function. The third argument to the SAMPLE function enables you to specify whether the sampling is done with or without replacement. The fourth argument enables you to specify the unequal sampling probabilities, as follows:

Sample = sample(Intersection, 500, "Replace", Proportion);

In summary, when you want to sample with replacement and with unequal probabilities, use the METHOD=PPS_WR option in PROC SURVEYSELECT or use the SAMPLE function in SAS/IML.

tags: Bootstrap and Resampling, Simulation

The post Sample with replacement and unequal probability in SAS appeared first on The DO Loop.

11月 212014
 

My colleagues at the SAS & R blog recently posted an example of how to program a permutation test in SAS and R. Their SAS implementation used Base SAS and was "relatively cumbersome" (their words) when compared with the R code. In today's post I implement the permutation test in SAS/IML. This provides an apples-to-apples comparison because both SAS/IML and R are matrix-vector languages.

This permutation test is a simple resampling exercise that could be assigned as a homework problem in a classroom. If you are at a college or university, remember that SAS/IML is available for free for all academic users through the SAS University Edition.

Permutation tests in SAS/IML

The analysis was motivated by a talk about using computational methods to illuminate statistical analyses. The data are the number of mosquitoes that were attracted to human volunteers in an experiment after each volunteer had consumed either a liter of beer (n=25) or water (n=18). The following statements assign the experimental data to two SAS/IML vectors and compute the observed difference between the means of the two groups:

proc iml;
G1 = {27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20};
G2 = {21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20};
obsdiff = mean(G1) - mean(G2);
print obsdiff;
t_permutationtest

The experimenters observed that, on average, people who drank beer attracted 4.4 more mosquitoes than people who drank water. The statistical question is, "What is the probability of observing a difference of this magnitude (or bigger) by chance if the beverages have no effect?" You can answer this question by using a permutation test to perform a nonparametric version of the t test. The null hypothesis is that there is no difference between the mean number of mosquitoes that were attracted to each experimental group (beer or water).

The permutation test enables you to generate the null distribution. Draw 25 random observations from the data and assign them to Group 1; assign the other 18 observations to Group 2. Compute the difference between the means of each group. Repeat these two steps many times to approximate the null distribution. The following SAS/IML statements use the SAMPLE function in SAS/IML to permute the data. The permutation step is repeated 9,999 times so that (adding in the original data order) there are a total of 10,000 permutations of the data:

call randseed(12345);                             /* set random number seed */
alldata = G1 // G2;                        /* stack data in a single vector */
N1 = nrow(G1);  N = N1 + nrow(G2);
NRepl = 9999;                                     /* number of permutations */
nulldist = j(NRepl,1);                   /* allocate vector to hold results */
do k = 1 to NRepl;
   x = sample(alldata, N, "WOR");                       /* permute the data */
   nulldist[k] = mean(x[1:N1]) - mean(x[(N1+1):N]);  /* difference of means */
end;
 
title "Histogram of Null Distribution";
refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);";
call Histogram(nulldist) other=refline;
permutationtest

The histogram shows the distribution of mean differences that were computed under the assumption of the null hypothesis. The observed difference between the beer and water groups (the vertical red line at 4.38) is way off in the tail. Since the null hypothesis is not a likely explanation for the observed difference, we reject it. We conclude that mosquitoes are attracted differently to the two groups (beer and water).

If you would like to compute the empirical p-value for the null distribution, that is easily accomplished:

pval = (1 + sum(abs(nulldist) >= abs(obsdiff))) / (NRepl+1);
print pval;
t_permutationtest2

Vectorization for permutation tests

Regular readers of my blog know that I advocate vectorizing programs whenever possible. Matrix-vector languages such as SAS/IML, R, and MATLAB work more efficiently when computations inside loops are replaced by vector or matrix computations.

Because of the way that SAS/IML loops are compiled and optimized, using loops in the SAS/IML language is not as detrimental to performance as in some other languages. For example, the previous permutation test code runs in about 0.04 seconds on my PC from 2009. Still, I like to promote vectorization because it can be important to performance.

The following statements eliminate the DO loop and implement the resampling and permutation test in two lines of SAS/IML code. The vectorized computation runs in about one-fourth the time:

x = sample(alldata, N//NRepl, "WOR");                 /* create all resamples */
nulldist = x[,1:N1][,:] - x[,(N1+1):N][,:];   /* compute all mean differences */

The vectorized computation uses the colon (:) subscript reduction operator in SAS/IML to compute the mean of the first 25 and the last 18 elements for each set of permuted data.

Additional references for resampling in SAS

To learn more about efficient ways to implement resampling methods such as bootstrapping and permutation tests, consult the following references:

  • For information about bootstrapping in SAS/IML, see pages 14–17 of Wicklin (2008).
  • For another permutation test example, see pages 11–14 of Wicklin (2012).
  • Chapter 15 of my book Simulating Data with SAS describes resampling methods in both Base SAS and SAS/IML. I include useful tips and techniques that make bootstrapping in Base SAS less cumbersome, including a mention of the %BOOT and %BOOTCI macros, which enable you to implement bootstrap methods in Base SAS by using only a few program statements.
  • For an excellent introduction to resampling methods in Base SAS, see Cassell (2007).
tags: Bootstrap and Resampling, Data Analysis, vectorization
5月 292014
 

Bootstrap methods and permutation tests are popular and powerful nonparametric methods for testing hypotheses and approximating the sampling distribution of a statistic. I have described a SAS/IML implementation of a bootstrap permutation test for matched pairs of data (an alternative to a matched-pair t test) in my paper "Modern Data Analysis for the Practicing Statistician" (Wicklin, 2010, pp 11–14).

The matched-pair permutation test enables you to determine whether the means of the two groups are significantly different. Recently, a SAS user asked how to create a permutation test that compares the means of k groups. An excellent overview of permutation tests in the ANOVA context is provided by M. J. Anderson (2001), who says (p. 628):

Suppose the null hypothesis is true and the groups are not really different (in terms of the measured variable). If this is the case, then the observations are exchangeable between the different groups. That is, the labels that are associated with particular values, identifying them as belonging to a particular group, could be randomly shuffled (permuted) and a new value of [a test statistic] could be obtained.

In a matrix language such as SAS/IML, data is often packed into a matrix with n rows and k columns. (That is, the data are stored in "wide form," as opposed to the "long form" that would be used by the ANOVA or GLM procedures.) One way to implement a permutation test for ANOVA is to apply a permutation to the k elements in each row. The purpose of this article is to provide an efficient way to permute elements within each row of a matrix.

How to permute elements within rows?

Let's start by defining some data and reading the data into a SAS/IML matrix:

data test;
input t1-t3;
datalines;
45 50 55
42 42 45
36 41 43
39 35 40
51 55 59
44 49 56
;
 
proc iml;
use test; read all into x; close test;

One approach to permute the elements of each row would be to loop over the rows and apply the RANPERM function to each row. That approach is fine for small data sets, but it is not a vectorized operation. To efficiently permute elements within each row, I will use three facts:

  • The RANPERM function can generate n independent random permutations of a set of k elements. Consequently, the RANPERM function can generate a permutation of the column subscripts {1 2 3} for each row.
  • The ROW function returns the row number for each element in a matrix. (If you do not have SAS/IML 12.3 or later, I defined the ROW function in a previous blog.)
  • A SAS/IML matrices is stored in row-major order. In an n x k matrix, the element with subscript (i,j) is stored in position k(i – 1) + j.

By using these three facts, you can construct a function that independently permutes elements of the rows of a matrix:

/* independently permute elements of each row of a matrix */
start PermuteWithinRows(m);
   n = nrow(m);  k = ncol(m); 
   j = ranperm(1:k, n);          /* each row is permutation of {1 2 ... k} */
   matIdx = k*(row(m) - 1)  + j;      /* matrix position; ROW fcn in 12.3  */
   return( shape(m[matIdx], n) );     /* permute elements of m and reshape */
finish;
 
/* call the function on example data */
call randseed(1234);
p = PermuteWithinRows(x);
print x p;
t_permutewithinrows

The PermuteWithinRows function is very efficient and can be used inside a loop as part of a permutation test. You can also use this technique to implement random permutations in an experimental design.

tags: Bootstrap and Resampling, Tips and Techniques, vectorization