Bootstrap and Resampling

1月 202021
 

This is the third and last introductory article about how to bootstrap time series in SAS. In the first article, I presented the simple block bootstrap and discussed why bootstrapping a time series is more complicated than for regression models that assume independent errors. Briefly, when you perform residual resampling on a time series, you need to resample in blocks to simulate the autocorrelation in the original series. In the second article, I introduced the moving-block bootstrap. For both methods, all blocks are the same size, and the block size must evenly divide the length of the series (n).

In contrast, the stationary block bootstrap uses blocks of random lengths. This article describes the stationary block bootstrap and shows how to implement it in SAS.

The "stationary" bootstrap gets its name from Politis and Romano (1992), who developed it. When applied to stationary data, the resampled pseudo-time series is also stationary.

Choosing the length of a block

In the moving-block bootstrap, the starting location for a block is chosen randomly, but all blocks have the same length. For the stationary block bootstrap, both the starting location and the length of the block are chosen at random.

The distribution for the block length is typically chosen from a geometric distribution with parameter p. If you want the average block length to be L, define p = 1/L because the expected value of a geometric distribution with parameter p is 1/p.

Why do we choose lengths from the geometric distribution? Well, because a length from the geometric distribution is easy to interpret in terms of a discrete process. Imagine performing the following process:

  • Choose an initial location, t, at random.
  • With probability p, choose a different location (at random) for the next residual and start a new block.
  • With probability (1-p), choose the location t+1 for the next residual and add that residual to the current block.
By definition, this process generates blocks whose lengths are geometrically distributed: L ~ Geom(p).

Generate a random block

There are several ways to implement the stationary block bootstrap in SAS. A straightforward method is to generate a starting integer, t, uniformly at random in the range [1, n], where n is the length of the series. Then, choose a length, L ~ Geom(p). If t + L exceeds n, truncate the block. (Some practitioners "wrap" the block to include residuals at the beginning of the series. This alternative method is called the circular block bootstrap.) Continue this process until the total length of the blocks is n. You will usually need to truncate the length of the last block so that the sum of the lengths is exactly n. This process is implemented in the following SAS/IML user-defined functions:

/* STATIONARY BLOCK BOOTSTRAP */
/* Use a random starting point, t, and a random block length, L ~ Geom(p), 
   - probability p of starting a new block at a new position
   - probability (1-p) of using the next residual from the current block
*/
%let p = (1/12);                /* expected block length is 1/p */ 
proc iml;
/* generate indices for a block of random length L~Geom(p)
   chosen from the sequence 1:n */
start RandBlockIdx(n, p);
   beginIdx = randfun(1, "Integer", n); /* random start: 1:n */
   L = randfun(1, "Geom", p);           /* random length */
   endIdx = min(beginIdx+L-1, n);       /* truncate into 1:n */
   return( beginIdx:endIdx );           /* return sequence of indices in i:n */
finish;
 
/* generate indices for one stationary bootstrap resample */
start StationaryIdx(n, p);
   bootIdx = j(n,1,.);
   s = 1;                       /* starting index for first block */
   do while( s <= n );
      idx = RandBlockIdx(n, p); /* get block of random length */
      if ncol(idx) > n-s then   /* truncate if block too long */
         idx = idx[, 1:n-s+1];
      L = ncol(idx);            /* length of this block */
      jdx = s:s+L-1;            /* all indices for this block */
      bootIdx[jdx] = idx;       /* assign this block */
      s = s + L;                /* starting index for next block */
   end;
   return bootIdx;
finish;

Let's run these functions on some sample data. The first article in this series shows how to fit a model to the Sashelp.Air data and write the predicted and residual values. The following statements use the stationary block bootstrap to generate one bootstrap sample of length n from the residuals. When you add the bootstrapped residuals to the predicted values, you get a "new" time series, which is graphed:

use OutReg;                     /* data to bootstrap: Y = Pred + Resid */
   read all var {'Time' 'Pred' 'Resid'};
close;
 
call randseed(1234);
n = nrow(Pred);                 /* length of series */
prob = &p;                      /* probability of jumping to a new block */
 
/*---- Visualization Example: generate one bootstrap resample ----*/
idx = StationaryIdx(n, prob);
YBoot = Pred + Resid[idx];
 
/* visualize the places where a sequence is not ascending:
https://blogs.sas.com/content/iml/2013/08/28/finite-diff-estimate-maxi.html */
blocksIdx = loc(dif(idx) <= 0); /* where is the sequence not ascending? */
b = Time[blocksIdx];
 
title "One Bootstrap Resample";
title2 "Stationary Block Bootstrap";
refs = "refline " + char(b) + " / axis=x;";
call series(Time, YBoot) other=refs;
/*---- End of visualization example ----*/

The graph shows a single bootstrap sample. I have added vertical reference lines to show the blocks of residuals. For this random sample, there are nine blocks whose lengths vary between 2 (the last block) and 28. If you choose another random sample, you will get a different set of blocks that have different lengths.

The stationary block bootstrap

In practice, the bootstrap method requires many bootstrap resamples. You can call the StationaryIdx function repeatedly in a loop to generate B=1000 bootstrap samples, as follows:

/* A complete stationary block bootstrap generates B resamples
   and usually writes the resamples to a SAS data set. */
B = 1000;
SampleID = j(n,1,.);
create BootOut var {'SampleID' 'Time' 'YBoot'};  /* create outside of loop */
do i = 1 to B;
   SampleId[,] = i;   /* fill array. See https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
   idx = StationaryIdx(n, prob);
   YBoot = Pred + Resid[idx];
   append;
end;
close BootOut;

The BootOut data set contains all the bootstrap samples. You can use a BY-group analysis to obtain the bootstrap estimates for almost any statistic. For example, the following scatter plot shows the distribution of 1000 bootstrap estimates of the slope and intercept parameters for an autoregressive model of the series. Reference lines indicate the estimates for the original data.

You can download the complete SAS program that performs all computations and creates all graphs in this article.

Summary

This article is the third of a series. It shows how to perform a stationary block bootstrap by randomly sampling blocks of residuals. Whereas the simple-block and the moving-block bootstrap methods use blocks that have a constant length, the stationary block bootstrap chooses blocks of random lengths.

In researching these blog posts, I read several articles. I recommend Bates (2019) for an introduction and Politis and Romano (1994) for details.

The post The stationary block bootstrap in SAS appeared first on The DO Loop.

1月 132021
 

As I discussed in a previous article, the simple block bootstrap is a way to perform a bootstrap analysis on a time series. The first step is to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L) that divides the total length of the series (n). Each bootstrap resample is generated by randomly choosing from among the non-overlapping n/L blocks of residuals, which are added to the predicted model.

The simple block bootstrap is not often used in practice. One reason is that the total number of blocks (k=n/L) is often small. If so, the bootstrap resamples do not capture enough variation for the bootstrap method to make correct inferences. This article describes a better alternative: the moving block bootstrap. In the moving block bootstrap, every block has the same block length but the blocks overlap. The following figure illustrates the overlapping blocks when L=3. The indices 1:L define the first block of residuals, the indices 2:L+1 define the second block, and so forth until the last block, which contains the residuals n-L+1:n.

To form a bootstrap resample, you randomly choose k=n/L blocks (with replacement) and concatenate them. You then add these residuals to the predicted values to create a "new" time series. Repeat the process many times and you have constructed a batch of bootstrap resamples. The process of forming one bootstrap sample is illustrated in the following figure. In the figure, the time series has been reshaped into a k x L matrix, where each row is a block.

The moving block bootstrap in SAS

To demonstrate the moving block bootstrap in SAS, let's use the same data that I analyzed in the previous article about the simple block bootstrap. The previous article extracted 132 observations from the Sashelp.Air data set and used PROC AUTOREG to form an additive model Predicted + Residuals. The OutReg data set contains three variables of interest: Time, Pred, and Resid.

As before, I will choose the block size to be L=12. The following SAS/IML program reads the data and defines a matrix (R) such that the i_th row contains the residuals with indices i:i+L-1. In total, the matrix R has n-L+1 rows.

/* MOVING BLOCK BOOTSTRAP */
%let L = 12;
proc iml;
call randseed(12345);
 
use OutReg;
   read all var {'Time' 'Pred' 'Resid'};
close;
 
/* Restriction for Simple Block Bootstrap: 
   The length of the series (n) must be divisible by the number of blocks (k)
   so that all blocks have the same length (L) */
n = nrow(Pred);          /* length of series */
L = &L;                  /* length of each block */
k = n / L;               /* number of random blocks to use */
if k ^= int(k) then 
   ABORT "The series length is not divisible by the block length";
 
/* Trick: Reshape data into k x L matrix. Each row is block of length L */
P = shape(Pred, k, L);   /* there are k rows for Pred */
J = n - L + 1;           /* total number of overlapping blocks to choose from */
R = j(J, L, .);          /* there are n-L+1 blocks of residuals */
Resid = rowvec(Resid);   /* make Resid into row vector so we don't need to transpose each row */
do i = 1 to J;
   R[i,] = Resid[ , i:i+L-1]; /* fill each row with a block of residuals */
end;

With this setup, the formation of bootstrap resamples is almost identical to the program in the previous article. The only difference is that the matrix R for the moving block bootstrap has more rows. Nevertheless, each resample is formed by randomly choosing k rows from R and adding them to a block of predicted values. The following statements generate B=1000 bootstrap resamples, which are written to a SAS data set (BootOut). The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.

/* The moving block bootstrap repeats this process B times
   and usually writes the resamples to a SAS data set. */
B = 1000;
SampleID = j(n,1,.);
create BootOut var {'SampleID' 'Time' 'YBoot'};  /* create outside of loop */
do i = 1 to B;
   SampleId[,] = i;
   idx = sample(1:J, k);  /* sample of size k from the set 1:J */
   YBoot = P + R[idx,];
   append;
end;
close BootOut;
QUIT;

The BootOut data set contains B=1000 bootstrap samples. The rest of the bootstrap analysis is exactly the same as in the previous article.

Summary

This article shows how to perform a moving block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which must divide the total length of the series (n), and form the n-L+1 overlapping blocks of residuals. Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.

You can download the SAS program that shows the complete analysis for both the simple block and moving block bootstrap.

The post The moving block bootstrap for time series appeared first on The DO Loop.

1月 062021
 

For ordinary least squares (OLS) regression, you can use a basic bootstrap of the residuals (called residual resampling) to perform a bootstrap analysis of the parameter estimates. This is possible because an assumption of OLS regression is that the residuals are independent. Therefore, you can reshuffle the residuals to get each bootstrap sample.

For a time series, the residuals are not independent. Rather, if you fit a model to the data, the residuals at time t+i are often close to the residual at time t for small values of i. This is known as autocorrelation in the error component. Accordingly, if you want to bootstrap the residuals of a time series, it is not correct to randomly shuffle the residuals, which would destroy the autocorrelation. Instead, you need to randomly choose a block of residuals (for example, at times t, t+1, ..., and t+L) and use those blocks of residuals to create bootstrap resamples. You repeatedly choose random blocks until you have enough residuals to create a bootstrap resample.

There are several ways to choose blocks:

  • The simplest way is to choose from non-overlapping blocks of a fixed length, L. This is called the simple block bootstrap and is described in this article.
  • Only slightly more complicated is to allow the blocks of length L to overlap. This is called the moving block bootstrap and is described in a separate article.
  • A more complicated scheme (but which has superior statistical properties) is to allow the length of the blocks to vary. This is called the stationary block bootstrap and is described in a third article.

Create the residuals

There are many ways to fit a model to a time series and to obtain the model residuals. Trovero and Leonard (2018) discuss several modern methods to fit trends, cycles, and seasonality by using SAS 9.4 or SAS Viya. To get the residuals, you will want to fit an additive model. In this article, I will use the Sashelp.Air data and will fit a simple additive model (trend plus noise) by using the AUTOREG procedure in SAS/ETS software.

The Sashelp.Air data set has 144 months of data. The following SAS DATA step drops the first year of data, which leaves 11 years of 12 months. I am doing this because I am going to use blocks of size L=12, and I think the example will be clearer if there are 11 blocks of size 12 (rather than 12 blocks).

data Air;
set Sashelp.Air;
if Date >= '01JAN1950'd;   /* exclude first year of data */
Time = _N_;                /* the observation number */
run;
 
title "Original Series: Air Travel";
proc sgplot data=Air;
   series x=Time y=Air;
   xaxis grid; yaxis grid;
run;

The graph suggests that the time series has a linear trend. The following call to PROC AUTOREG fits a linear model to the data. The predicted mean and residuals are output to the OutReg data set as the PRED and RESID variables, respectively. The call to PROC SGPLOT overlays a graph of the trend and a graph of the residuals.

/* Similar to Getting Started example in PROC AUTOREG */
proc autoreg data=Air plots=none outest=RegEst;
   AR12: model Air = Time / nlag=12;
   output out=OutReg pm=Pred rm=Resid;  /* mean prediction and residuals */
   ods select FinalModel.ParameterEstimates ARParameterEstimates;
run;
 
title "Mean Prediction and Residuals from AR Model";
proc sgplot data=OutReg;
   series x=Time y=Pred;
   series x=Time y=Resid;
   refline 0 / axis=y;
   xaxis values=(24 to 144 by 12) grid valueshint;
run;

The parameter estimates are shown for the linear model. On average, airlines carried an additional 2.8 thousand passengers per month during this time period. The graph shows the decomposition of the series into a linear trend and residuals. I added vertical lines to indicate the blocks of residuals that are used in the next section. The first block contains the residuals for times 13-24. The second block contains the residuals for times 25-36, and so forth until the 11th block, which contains the residuals for times 133-144.

Set up the simple block bootstrap

For the simple bootstrap, the length of the blocks (L) must evenly divide the length of the series (n), which means that k = n / L is an integer. Because I dropped the first year of observations from Sashelp.Air, there are n=132 observations. I will choose the block size to be L=12, which means that there are k=11 non-overlapping blocks.

Each bootstrap resample is formed by randomly choosing k blocks (with replacement) and add those residuals to the predicted values. Think about putting the n predicted values and residuals into a matrix in row-wise order. The first L observations are in the first row, the next L are in the second row, and so forth. Thus, the matrix has k rows and L columns. The original series is of the form Predicted + Residuals, where the plus sign represents matrix addition. For the simple block bootstrap, each bootstrap resample is obtained by resampling the rows of the residual array and adding the rows together to obtain a new series of the form Predicted + (Random Residuals). This process is shown schematically in the following figure.

You can use the SAS/IML language to implement the simple block bootstrap. The following call to PROC IML reads in the original predicted and residual values and reshapes then vectors into k x L matrices (P and R, respectively). The SAMPLE function generates a sample (with replacement) of the vector 1:k, which is used to randomly select rows of the R matrix. To make sure that the process is working as expected, you can create one bootstrap resample and graph it. It should resemble the original series:

/* SIMPLE BLOCK BOOTSTRAP */
%let L = 12;
proc iml;
call randseed(12345);
/* the original series is Y = Pred + Resid */
use OutReg;
   read all var {'Time' 'Pred' 'Resid'};
close;
 
/* For the Simple Block Bootstrap, the length of the series (n) 
   must be divisible by the block length (L). */
n = nrow(Pred);          /* length of series */
L = &L;                  /* length of each block */
k = n / L;               /* number of non-overlapping blocks */
if k ^= int(k) then 
   ABORT "The series length is not divisible by the block length";
 
/* Trick: reshape data into k x L matrix. Each row is block of length L */
P = shape(Pred, k, L);
R = shape(Resid, k, L); /* non-overlapping residuals (also k x L) */
 
/* Example: Generate one bootstrap resample by randomly selecting from the residual blocks */
idx = sample(1:nrow(R), k);     /* sample (w/ replacement) of size k from the set 1:k */
YBoot = P + R[idx,];
title "One Bootstrap Resample";
title2 "Simple Block Bootstrap";
refs = "refline " + char(do(12,nrow(Pred),12)) + " / axis=x;";
call series(Time, YBoot) other=refs;

The graph shows one bootstrap resample. The residuals from arbitrary blocks are concatenated until there are n residuals. These are added to the predicted value to create a "new" series, which is a bootstrap resample. You can generate a large number of bootstrap resamples and use them to perform inferences for time series statistics.

Implement the simple block bootstrap in SAS

You can repeat the process in a loop to generate more resamples. The following statements generate B=1000 bootstrap resamples. These are written to a SAS data set (BootOut). The program uses a technique in which the results of each computation are immediately written to a SAS data set, which is very efficient. The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.

/* The simple block bootstrap repeats this process B times
   and usually writes the resamples to a SAS data set. */
B = 1000;
J = nrow(R);              /* J=k for non-overlapping blocks, but prepare for moving blocks */
SampleID = j(n,1,.);
create BootOut var {'SampleID' 'Time' 'YBoot'};  /* open data set outside of loop */
do i = 1 to B;
   SampleId[,] = i;       /* fill array: https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
   idx = sample(1:J, k);  /* sample of size k from the set 1:k */
   YBoot = P + R[idx,];
   append;                /* append each bootstrap sample */
end;
close BootOut;
QUIT;

The BootOut data set contains B=1000 bootstrap samples. You can efficiently analyze the samples by using a BY statement. For example, suppose that you want to investigate how the parameter estimates for the trend line vary among the bootstrap samples. You can run PROC AUTOREG on each bootstrap sample by using BY-group processing. Be sure to suppress ODS output during the BY-group analysis, and write the desired statistics to an output data set (BootEst), as follows:

/* Analyze the bootstrap samples by using a BY statement. See
   https://blogs.sas.com/content/iml/2012/07/18/simulation-in-sas-the-slow-way-or-the-by-way.html
*/
proc autoreg data=BootOut plots=none outest=BootEst noprint;
   by SampleID;
   AR12: model YBoot = Time / nlag=12;
run;
 
/* OPTIONAL: Use PROC MEANS or PROC UNIVARIATE to estimate standard errors and CIs */
proc means data=BootEst mean stddev P5 P95;
   var Intercept Time _A:;
run;
 
title "Distribution of Parameter Estimates";
proc sgplot data=BootEst;
   scatter x=Intercept y=Time;
   xaxis grid; yaxis grid; 
   refline 77.5402 / axis=x;
   refline 2.7956  / axis=y;
run;

The scatter plot shows the bootstrap distribution of the parameter estimates of the linear trend. The reference lines indicate the parameter estimates for the original data. You can use the bootstrap distribution for inferential statistics such as estimation of standard errors, confidence intervals, the covariance of estimates, and more.

You can perform a similar bootstrap analysis for any other statistic that is generated by any time series analysis. The important thing is that the block bootstrap is performed on some sort of residual or "noise" component, so be sure to remove the trend, seasonality, cycles, and so forth and then bootstrap the remainder.

Summary

This article shows how to perform a simple block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which (for the simple block bootstrap) must divide the total length of the series (n). Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.

In practice, the simple block bootstrap is rarely used. However, it illustrates the basic ideas for bootstrapping a time series, and it provides a foundation for more sophisticated bootstrap methods.

The post The simple block bootstrap for time series in SAS appeared first on The DO Loop.

1月 062021
 

For ordinary least squares (OLS) regression, you can use a basic bootstrap of the residuals (called residual resampling) to perform a bootstrap analysis of the parameter estimates. This is possible because an assumption of OLS regression is that the residuals are independent. Therefore, you can reshuffle the residuals to get each bootstrap sample.

For a time series, the residuals are not independent. Rather, if you fit a model to the data, the residuals at time t+i are often close to the residual at time t for small values of i. This is known as autocorrelation in the error component. Accordingly, if you want to bootstrap the residuals of a time series, it is not correct to randomly shuffle the residuals, which would destroy the autocorrelation. Instead, you need to randomly choose a block of residuals (for example, at times t, t+1, ..., and t+L) and use those blocks of residuals to create bootstrap resamples. You repeatedly choose random blocks until you have enough residuals to create a bootstrap resample.

There are several ways to choose blocks:

  • The simplest way is to choose from non-overlapping blocks of a fixed length, L. This is called the simple block bootstrap and is described in this article.
  • Only slightly more complicated is to allow the blocks of length L to overlap. This is called the moving block bootstrap and is described in a separate article.
  • A more complicated scheme (but which has superior statistical properties) is to allow the length of the blocks to vary. This is called the stationary block bootstrap and is described in a third article.

Create the residuals

There are many ways to fit a model to a time series and to obtain the model residuals. Trovero and Leonard (2018) discuss several modern methods to fit trends, cycles, and seasonality by using SAS 9.4 or SAS Viya. To get the residuals, you will want to fit an additive model. In this article, I will use the Sashelp.Air data and will fit a simple additive model (trend plus noise) by using the AUTOREG procedure in SAS/ETS software.

The Sashelp.Air data set has 144 months of data. The following SAS DATA step drops the first year of data, which leaves 11 years of 12 months. I am doing this because I am going to use blocks of size L=12, and I think the example will be clearer if there are 11 blocks of size 12 (rather than 12 blocks).

data Air;
set Sashelp.Air;
if Date >= '01JAN1950'd;   /* exclude first year of data */
Time = _N_;                /* the observation number */
run;
 
title "Original Series: Air Travel";
proc sgplot data=Air;
   series x=Time y=Air;
   xaxis grid; yaxis grid;
run;

The graph suggests that the time series has a linear trend. The following call to PROC AUTOREG fits a linear model to the data. The predicted mean and residuals are output to the OutReg data set as the PRED and RESID variables, respectively. The call to PROC SGPLOT overlays a graph of the trend and a graph of the residuals.

/* Similar to Getting Started example in PROC AUTOREG */
proc autoreg data=Air plots=none outest=RegEst;
   AR12: model Air = Time / nlag=12;
   output out=OutReg pm=Pred rm=Resid;  /* mean prediction and residuals */
   ods select FinalModel.ParameterEstimates ARParameterEstimates;
run;
 
title "Mean Prediction and Residuals from AR Model";
proc sgplot data=OutReg;
   series x=Time y=Pred;
   series x=Time y=Resid;
   refline 0 / axis=y;
   xaxis values=(24 to 144 by 12) grid valueshint;
run;

The parameter estimates are shown for the linear model. On average, airlines carried an additional 2.8 thousand passengers per month during this time period. The graph shows the decomposition of the series into a linear trend and residuals. I added vertical lines to indicate the blocks of residuals that are used in the next section. The first block contains the residuals for times 13-24. The second block contains the residuals for times 25-36, and so forth until the 11th block, which contains the residuals for times 133-144.

Set up the simple block bootstrap

For the simple bootstrap, the length of the blocks (L) must evenly divide the length of the series (n), which means that k = n / L is an integer. Because I dropped the first year of observations from Sashelp.Air, there are n=132 observations. I will choose the block size to be L=12, which means that there are k=11 non-overlapping blocks.

Each bootstrap resample is formed by randomly choosing k blocks (with replacement) and add those residuals to the predicted values. Think about putting the n predicted values and residuals into a matrix in row-wise order. The first L observations are in the first row, the next L are in the second row, and so forth. Thus, the matrix has k rows and L columns. The original series is of the form Predicted + Residuals, where the plus sign represents matrix addition. For the simple block bootstrap, each bootstrap resample is obtained by resampling the rows of the residual array and adding the rows together to obtain a new series of the form Predicted + (Random Residuals). This process is shown schematically in the following figure.

You can use the SAS/IML language to implement the simple block bootstrap. The following call to PROC IML reads in the original predicted and residual values and reshapes then vectors into k x L matrices (P and R, respectively). The SAMPLE function generates a sample (with replacement) of the vector 1:k, which is used to randomly select rows of the R matrix. To make sure that the process is working as expected, you can create one bootstrap resample and graph it. It should resemble the original series:

/* SIMPLE BLOCK BOOTSTRAP */
%let L = 12;
proc iml;
call randseed(12345);
/* the original series is Y = Pred + Resid */
use OutReg;
   read all var {'Time' 'Pred' 'Resid'};
close;
 
/* For the Simple Block Bootstrap, the length of the series (n) 
   must be divisible by the block length (L). */
n = nrow(Pred);          /* length of series */
L = &L;                  /* length of each block */
k = n / L;               /* number of non-overlapping blocks */
if k ^= int(k) then 
   ABORT "The series length is not divisible by the block length";
 
/* Trick: reshape data into k x L matrix. Each row is block of length L */
P = shape(Pred, k, L);
R = shape(Resid, k, L); /* non-overlapping residuals (also k x L) */
 
/* Example: Generate one bootstrap resample by randomly selecting from the residual blocks */
idx = sample(1:nrow(R), k);     /* sample (w/ replacement) of size k from the set 1:k */
YBoot = P + R[idx,];
title "One Bootstrap Resample";
title2 "Simple Block Bootstrap";
refs = "refline " + char(do(12,nrow(Pred),12)) + " / axis=x;";
call series(Time, YBoot) other=refs;

The graph shows one bootstrap resample. The residuals from arbitrary blocks are concatenated until there are n residuals. These are added to the predicted value to create a "new" series, which is a bootstrap resample. You can generate a large number of bootstrap resamples and use them to perform inferences for time series statistics.

Implement the simple block bootstrap in SAS

You can repeat the process in a loop to generate more resamples. The following statements generate B=1000 bootstrap resamples. These are written to a SAS data set (BootOut). The program uses a technique in which the results of each computation are immediately written to a SAS data set, which is very efficient. The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.

/* The simple block bootstrap repeats this process B times
   and usually writes the resamples to a SAS data set. */
B = 1000;
J = nrow(R);              /* J=k for non-overlapping blocks, but prepare for moving blocks */
SampleID = j(n,1,.);
create BootOut var {'SampleID' 'Time' 'YBoot'};  /* open data set outside of loop */
do i = 1 to B;
   SampleId[,] = i;       /* fill array: https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
   idx = sample(1:J, k);  /* sample of size k from the set 1:k */
   YBoot = P + R[idx,];
   append;                /* append each bootstrap sample */
end;
close BootOut;
QUIT;

The BootOut data set contains B=1000 bootstrap samples. You can efficiently analyze the samples by using a BY statement. For example, suppose that you want to investigate how the parameter estimates for the trend line vary among the bootstrap samples. You can run PROC AUTOREG on each bootstrap sample by using BY-group processing. Be sure to suppress ODS output during the BY-group analysis, and write the desired statistics to an output data set (BootEst), as follows:

/* Analyze the bootstrap samples by using a BY statement. See
   https://blogs.sas.com/content/iml/2012/07/18/simulation-in-sas-the-slow-way-or-the-by-way.html
*/
proc autoreg data=BootOut plots=none outest=BootEst noprint;
   by SampleID;
   AR12: model YBoot = Time / nlag=12;
run;
 
/* OPTIONAL: Use PROC MEANS or PROC UNIVARIATE to estimate standard errors and CIs */
proc means data=BootEst mean stddev P5 P95;
   var Intercept Time _A:;
run;
 
title "Distribution of Parameter Estimates";
proc sgplot data=BootEst;
   scatter x=Intercept y=Time;
   xaxis grid; yaxis grid; 
   refline 77.5402 / axis=x;
   refline 2.7956  / axis=y;
run;

The scatter plot shows the bootstrap distribution of the parameter estimates of the linear trend. The reference lines indicate the parameter estimates for the original data. You can use the bootstrap distribution for inferential statistics such as estimation of standard errors, confidence intervals, the covariance of estimates, and more.

You can perform a similar bootstrap analysis for any other statistic that is generated by any time series analysis. The important thing is that the block bootstrap is performed on some sort of residual or "noise" component, so be sure to remove the trend, seasonality, cycles, and so forth and then bootstrap the remainder.

Summary

This article shows how to perform a simple block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which (for the simple block bootstrap) must divide the total length of the series (n). Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.

In practice, the simple block bootstrap is rarely used. However, it illustrates the basic ideas for bootstrapping a time series, and it provides a foundation for more sophisticated bootstrap methods.

The post The simple block bootstrap for time series in SAS appeared first on The DO Loop.

10月 262020
 

A fundamental principle of data analysis is that a statistic is an estimate of a parameter for the population. A statistic is calculated from a random sample. This leads to uncertainty in the estimate: a different random sample would have produced a different statistic. To quantify the uncertainty, SAS procedures often support options that estimate standard errors for statistics and confidence intervals for parameters.

Of course, if statistics have uncertainty, so, too, do functions of the statistics. For complicated functions of the statistics, the bootstrap method might be the only viable technique for quantifying the uncertainty.

Graphical comparison of two methods for estimating confidence intervals of eigenvalues of a correlation matrix

This article shows how to obtain confidence intervals for the eigenvalues of a correlation matrix. The eigenvalues are complicated functions of the correlation estimates. The eigenvalues are used in a principal component analysis (PCA) to decide how many components to keep in a dimensionality reduction. There are two main methods to estimate confidence intervals for eigenvalues: an asymptotic (large sample) method, which assumes that the eigenvalues are multivariate normal, and a bootstrap method, which makes minimal distributional assumptions.

The following sections show how to compute each method. A graph of the results is shown to the right. For the data in this article, the bootstrap method generates confidence intervals that are more accurate than the asymptotic method.

This article was inspired by Larsen and Warne (2010), "Estimating confidence intervals for eigenvalues in exploratory factor analysis." Larsen and Warne discuss why confidence intervals can be useful when deciding how many principal components to keep.

Eigenvalues in principal component analysis

To demonstrate the techniques, let's perform a principal component analysis (PCA) on the four continuous variables in the Fisher Iris data. In SAS, you can use PROC PRINCOMP to perform a PCA, as follows:

%let DSName = Sashelp.Iris;
%let VarList = SepalLength SepalWidth PetalLength PetalWidth;
 
/* 1. compute value of the statistic on original data */
proc princomp data=&DSName STD plots=none;     /* stdize PC scores to unit variance */
   var &VarList;
   ods select ScreePlot Eigenvalues NObsNVar;
   ods output Eigenvalues=EV0 NObsNVar=NObs(where=(Description="Observations"));
run;
 
proc sql noprint;                              
 select nValue1 into :N from NObs;   /* put the number of obs into macro variable, N */
quit;
%put &=N;
Output from PROC CORR that shows eigenvalues of the correlation matrix

The first table shows that there are 150 observations. The second table displays the eigenvalues for the sample, which are 2.9, 0.9, 0.15, and 0.02. If you want a graph of the eigenvalues, you can use the PLOTS(ONLY)=SCREE option on the PROC PRINCOMP statement. The ODS output statement creates SAS data set from the tables. The PROC SQL call creates a macro variable, N, that contains the number of observations.

Asymptotic confidence intervals

If a sample size, n, is large enough, the sampling distribution of the eigenvalues is approximately multivariate normal (Larsen and Ware (2010, p. 873)). If g is an eigenvalue for a correlation matrix, then an asymptotic confidence interval is
     g ± z* sqrt( 2 g2 / n )
where z* is the standard normal quantile, as computed in the following program:

/* Asymptotic CIs for eigenvalues (Larsen and Warne (2010, p. 873) */
data AsympCI;
set EV0;
alpha = 0.05;
z = quantile("Normal", 1 - alpha/2);  /* = 1.96 */
SE = sqrt(2*Eigenvalue**2 / &N);
Normal_LCL = Eigenvalue - SE;         /* g +/- z* sqrt(2 g^2 / n) */
Normal_UCL = Eigenvalue + SE;
drop alpha z SE;
run;
 
proc print data=AsympCI noobs;
   var Number Eigenvalue Normal_LCL Normal_UCL;
run;
Comparison of two methods for estimating confidence intervals of eigenvalues of a correlation matrix

The lower and upper confidence limits are shown for each eigenvalue. The advantage of this method is its simplicity. The intervals assume that the distribution of the eigenvalues is multivariate normal, which will occur when the sample size is very large. Since N=150 does not seem "very large," it is not clear whether these confidence intervals are valid. Therefore, let's estimate the confidence intervals by using the bootstrap method and compare the bootstrap intervals to the asymptotic intervals.

Bootstrap confidence intervals for eigenvalues

The bootstrap computations in this section follow the strategy outlined in the article "Compute a bootstrap confidence interval in SAS." (For additional bootstrap tips, see "The essential guide to bootstrapping in SAS.") The main steps are:

  1. Compute a statistic for the original data. This was done in the previous section.
  2. Use PROC SURVEYSELECT to resample (with replacement) B times from the data. For efficiency, put all B random bootstrap samples into a single data set.
  3. Use BY-group processing to compute the statistic of interest on each bootstrap sample. The union of the statistics is the bootstrap distribution, which approximates the sampling distribution of the statistic. Remember to turn off ODS when you run BY-group processing!
  4. Use the bootstrap distribution to obtain confidence intervals for the parameters.

The steps are implemented in the following SAS program:

/* 2. Generate many bootstrap samples */
%let NumSamples = 5000;       /* number of bootstrap resamples */
proc surveyselect data=&DSName NOPRINT seed=12345
     out=BootSamp(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                 option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;
 
/* 3. Compute the statistic for each bootstrap sample */
/* Suppress output during this step:
   https://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations.html
*/
%macro ODSOff();   ods graphics off; ods exclude all;  ods noresults;   %mend;
%macro ODSOn();    ods graphics on;  ods exclude none; ods results;     %mend;
 
%ODSOff;
proc princomp data=BootSamp STD plots=none;
   by SampleID;
   freq NumberHits;
   var &VarList;
   ods output Eigenvalues=EV(keep=SampleID Number Eigenvalue);
run;
%ODSOn;
 
/* 4. Estimate 95% confidence interval as the 2.5th through 97.5th percentiles of boostrap distribution */
proc univariate data=EV noprint;
   class Number;
   var EigenValue;
   output out=BootCI pctlpre=Boot_ pctlpts=2.5  97.5  pctlname=LCL UCL;
run;
 
/* merge the bootstrap CIs with the normal CIs for comparison */
data AllCI;
   merge AsympCI(keep=Number Eigenvalue Normal:) BootCI(keep=Number Boot:);
   by Number;
run;
 
proc print data=AllCI noobs;
   format Eigenvalue Normal: Boot: 5.3;
run;
Asymptotic (normal) confidence intervals for eigenvalues

The table displays the bootstrap confidence intervals (columns 5 and 6) next to the asymptotic confidence intervals (columns 3 and 4). It is easier to compare the intervals if you visualize them graphically, as follows:

/* convert from wide to long */
data CIPlot;
   set AllCI;
   Method = "Normal   "; LCL = Normal_LCL; UCL = Normal_UCL; output;
   Method = "Bootstrap"; LCL = Boot_LCL; UCL = Boot_UCL; output;
   keep Method Eigenvalue Number LCL UCL;
run;
 
title "Comparison of Normal and Bootstrap Confidence Intervals";
title2 "Eigenvalues of the Correlation Matrix for the Iris Data";
ods graphics / width=480px height=360px;
proc sgplot data=CIPlot;
   scatter x=Eigenvalue y=Number / group=Method clusterwidth=0.4
           xerrorlower=LCL xerrorupper=UCL groupdisplay=cluster;
   yaxis grid type=discrete colorbands=even colorbandsattrs=(color=gray transparency=0.9);
   xaxis grid;
run;

The graph is shown at the top of this article. The graph nicely summarizes the comparison. For the first (largest) eigenvalue, the bootstrap confidence interval is about half as wide as the normal confidence interval. Thus, the asymptotic result seems too wide for these data. For the other eigenvalues, the normal confidence intervals appear to be too narrow.

If you graph the bootstrap distribution, you can see that the bootstrap distribution does not appear to be multivariate normal. This presumably explains why the asymptotic intervals are so different from the bootstrap intervals. For completeness, the following graph shows a matrix of scatter plots and marginal histograms for the bootstrap distribution. The histograms indicate skewness in the bootstrap distribution.

Bootstrap distribution for eigenvalues (5000 samples)

Summary

This article shows how to compute confidence intervals for the eigenvalues of an estimated correlation matrix. The first method uses a formula that is valid when the sampling distribution of the eigenvalues is multivariate normal. The second method uses bootstrapping to approximate the distribution of the eigenvalues, then uses percentiles of the distribution to estimate the confidence intervals. For the Iris data, the bootstrap confidence intervals are substantially different from the asymptotic formula.

The post Confidence intervals for eigenvalues of a correlation matrix appeared first on The DO Loop.

6月 032020
 

I recently read an article that describes ways to compute confidence intervals for the difference in a percentile between two groups. In Eaton, Moore, and MacKenzie (2019), the authors describe a problem in hydrology. The data are the sizes of pebbles (grains) in rivers at two different sites. The authors ask whether the p_th percentile of size is significantly different between the two sites. They also want a confidence interval for the difference. The median (50th percentile) and 84th percentile were the main focus of their paper.

For those of us who are not hydrologists, consider a sample from group 'A' and a sample from group 'B'. The problem is to test whether the p_th percentile is different in the two groups and to construct a confidence interval for the difference.

The authors show two methods: a binomial-based probability model (Section 2) and a bootstrap confidence interval (Section 3). However, there is another option: use quantile regression to estimate the difference in the quantiles and to construct a confidence interval. In this article, I show the computation for the 50th and 84th percentiles and compare the results to a bootstrap estimate. You can download the SAS program that creates the analyses and graphs in this article.

If you are not familiar with quantile regression, see my previous article about how to use quantile regression to estimate the difference between two medians.

A data distribution of particle sizes

For convenience, I simulated samples for the sizes of pebbles in two stream beds:

  • For Site 'A', the sizes are lognormally distributed with μ=0.64 and σ=1. For the LN(0.64, 1) distribution, the median pebble size is 1.9 mm and the size for the 84th percentile is 5.1 mm.
  • For Site 'B', the sizes are lognormally distributed with μ=0.53 and σ=1. For the LN(0.53, 1) distribution, the median pebble size is 1.7 mm and the size for the 84th percentile is 4.6 mm.

I assume that sizes are measured to the nearest 0.1 mm. I simulated 500 pebbles from Site 'A' and 300 pebbles from Site 'B'. You can use PROC UNIVARIATE in SAS to compute a comparative histogram and to compute the percentiles. The sample distributions are shown below:

For Site 'A', the estimates for the sample median and 84th percentile are 2.05 mm and 5.1 mm, respectively. For Site 'B', the estimates are 1.60 mm and 4.7 mm, respectively. These estimates are close to their respective parameter values.

Estimate difference in percentiles

The QUANTREG procedure in SAS makes it easy to estimate the difference in the 50th and 84th percentiles between the two groups. The syntax for the QUANTREG procedure is similar to other SAS regression procedures such as PROC GLM:

proc quantreg data=Grains;
   class Site;
   model Diameter = Site / quantile=0.5 0.84;
   estimate 'Diff in Pctl' Site 1 -1 / CL;
run;

The output from PROC QUANTREG shows estimates for the difference between the percentiles of the two groups. For the 50th percentile, an estimate for the difference is 0.4 with a 95% confidence of [0.09, 0.71]. Because 0 is not in the confidence interval, you can conclude that the median pebble size at Site 'A' is significantly different from the median pebble size at Site 'B'. For the 84th percentile, an estimate for the difference is 0.3 with a 95% confidence of [-0.57, 1.17]. Because 0 is in the interval, the difference between the 84th percentiles is not significantly different from 0.

Methods for estimating confidence intervals

The QUANTREG procedure supports several different methods for estimating a confidence interval: sparsity, rank, and resampling. The estimates in the previous section are by using the RANK method, which is the default for smaller data sets. You can use the CI= option on the PROC QUANTREG statement to use these methods and to specify options for the methods. The following graph summarizes the results for four combinations of methods and options. The results of the analysis do not change: The medians are significantly different but the 84th percentiles are not.

A comparison with bootstrap estimates

When you use a resampling-based estimate for the confidence interval, the interval depends on the random number seed, the algorithm used to generate random numbers, and the number of bootstrap iterations. This can make it hard to compare Monte Carlo results to the results from deterministic statistical methods. Nevertheless, I will present one result from a bootstrap analysis of the simulated data. The following table shows the bootstrap percentile estimates (used by Eaton, Moore, and MacKenzie (2019)) for the difference between percentiles.

The confidence intervals (the Lower and Upper columns of the table) are comparable to the intervals produced by PROC QUANTREG. The confidence interval for the difference between medians seems equivalent. The bootstrap interval for the 84th percentile is shifted to the right relative to the QUANTREG intervals. However, the inferences are the same: the medians are different but there is no significant difference between the 84th percentiles.

The following histogram shows the difference between the 84th percentiles for 5,000 bootstrap samples. The confidence interval is determined by the 2.5th and 97.5th percentiles of the bootstrap distribution. The computation requires only a short SAS/IML program, which runs very quickly.

Summary

Data analysts might struggle to find an easy way to compute the difference between percentiles for two (or more) groups. A recent paper by Eaton, Moore, and MacKenzie (2019) proposes one solution, which is to use resampling methods. An alternative is to use quantile regression to estimate the difference between percentiles, as shown in this article. I demonstrate the method by using simulated data of the form studied by Eaton, Moore, and MacKenzie.

You can download the SAS program that creates the analyses and graphs in this article.

The post How to estimate the difference between percentiles appeared first on The DO Loop.

5月 082019
 

At SAS Global Forum 2019, Daymond Ling presented an interesting discussion of binary classifiers in the financial industry. The discussion is motivated by a practical question: If you deploy a predictive model, how can you assess whether the model is no longer working well and needs to be replaced?

Daymond discussed the following three criteria for choosing a model:

  1. Discrimination: The ability of the binary classifier to predict the class of a labeled observation. The area under an ROC curve is one measure of a binary model's discrimination power. In SAS, you can compute the ROC curve for any predictive model.
  2. Accuracy: The ability of the model to estimate the probability of an event. The calibration curve is a graphical indication of a model's accuracy. In SAS, you can compute a calibration curve manually, or you can use PROC LOGISTIC in SAS/STAT 15.1 to automatically compute a calibration curve.
  3. Stability: Point estimates are often used to choose a model, but you should be aware of the variability of the estimates. This is a basic concept in statistics: When choosing between two unbiased estimators, you should usually choose the one that has smaller variance. SAS procedures provide (asymptotic) standard errors for many statistics such as the area under an ROC curve. If you have reason to doubt the accuracy of an asymptotic estimate, you can use bootstrap methods in SAS to estimate the sampling distribution of the statistic.

Estimates of model stability

My article about comparing the ROC curves for predictive models contains two competing models: A model from using PROC LOGISTIC and an "Expert model" that was constructed by asking domain experts for their opinions. (The source of the models is irrelevant; you can use any binary classifier.) You can download the SAS program that produces the following table, which estimates the area under each ROC curve, the standard error, and 90% confidence intervals:

The "Expert" model has a larger Area statistic and a smaller standard error, so you might choose to deploy it as a "champion model."

In his presentation, Daymond asked an important question. Suppose one month later you run the model on a new batch of labeled data and discover that the area under the ROC curve for the new data is only 0.73. Should you be concerned? Does this indicate that the model has degraded and is no longer suitable? Should you cast out this model, re-train all the models (at considerable time and expense), and deploy a new "champion"?

The answer depends on whether you think Area = 0.73 represents a degraded model or whether it can be attributed to sampling variability. The statistic 0.73 is barely more than 1 standard error away from the point estimate, and you will recall that 68% of a normal distribution is within one standard deviation of the mean. From that point of view, the value 0.73 is not surprising. Furthermore, the 90% confidence interval indicates that if you run this model every day for 100 days, you will probably encounter statistics lower than 0.68 merely due to sampling variability. In other words, a solitary low score might not indicate that the model is no longer valid.

Bootstrap estimates of model stability

If "asymptotic normality" makes you nervous, you can use the bootstrap method to obtain estimates of the standard error and the distribution of the Area statistic. The following table summarizes the results of 5,000 bootstrap replications. The results are very close to the asymptotic results in the previous table. In particular, the standard error of the Area statistic is estimated as 0.08 and in 90% of the bootstrap samples, the Area was in the interval [0.676, 0.983]. The conclusion from the bootstrap computation is the same as for the asymptotic estimates: you should expect the Area statistic to bounce around. A value such as 0.73 is not unusual and does not necessarily indicate that the model has degraded.

You can use the bootstrap computations to graphically reveal the stability of the two models. The following comparative histogram shows the bootstrap distributions of the Area statistic for the "Expert" and "Logistic" models. You can see that not only is the upper distribution shifted to the right, but it has less variance and therefore greater stability.

I think Daymond's main points are important to remember. Namely, discrimination and accuracy are important for choosing a model, but understanding the stability of the model (the variation of the estimates) is essential for determining when a model is no longer working well and should be replaced. There is no need to replace a model for a "bad score" if that score is within the range of typical statistical variation.

References

Ling, D. (2019), "Measuring Model Stability", Proceedings of the SAS Global Forum 2019 Conference.

Download the complete SAS program that creates the analyses and graphs in this article.

The post Discrimination, accuracy, and stability in binary classifiers appeared first on The DO Loop.

5月 082019
 

At SAS Global Forum 2019, Daymond Ling presented an interesting discussion of binary classifiers in the financial industry. The discussion is motivated by a practical question: If you deploy a predictive model, how can you assess whether the model is no longer working well and needs to be replaced?

Daymond discussed the following three criteria for choosing a model:

  1. Discrimination: The ability of the binary classifier to predict the class of a labeled observation. The area under an ROC curve is one measure of a binary model's discrimination power. In SAS, you can compute the ROC curve for any predictive model.
  2. Accuracy: The ability of the model to estimate the probability of an event. The calibration curve is a graphical indication of a model's accuracy. In SAS, you can compute a calibration curve manually, or you can use PROC LOGISTIC in SAS/STAT 15.1 to automatically compute a calibration curve.
  3. Stability: Point estimates are often used to choose a model, but you should be aware of the variability of the estimates. This is a basic concept in statistics: When choosing between two unbiased estimators, you should usually choose the one that has smaller variance. SAS procedures provide (asymptotic) standard errors for many statistics such as the area under an ROC curve. If you have reason to doubt the accuracy of an asymptotic estimate, you can use bootstrap methods in SAS to estimate the sampling distribution of the statistic.

Estimates of model stability

My article about comparing the ROC curves for predictive models contains two competing models: A model from using PROC LOGISTIC and an "Expert model" that was constructed by asking domain experts for their opinions. (The source of the models is irrelevant; you can use any binary classifier.) You can download the SAS program that produces the following table, which estimates the area under each ROC curve, the standard error, and 90% confidence intervals:

The "Expert" model has a larger Area statistic and a smaller standard error, so you might choose to deploy it as a "champion model."

In his presentation, Daymond asked an important question. Suppose one month later you run the model on a new batch of labeled data and discover that the area under the ROC curve for the new data is only 0.73. Should you be concerned? Does this indicate that the model has degraded and is no longer suitable? Should you cast out this model, re-train all the models (at considerable time and expense), and deploy a new "champion"?

The answer depends on whether you think Area = 0.73 represents a degraded model or whether it can be attributed to sampling variability. The statistic 0.73 is barely more than 1 standard error away from the point estimate, and you will recall that 68% of a normal distribution is within one standard deviation of the mean. From that point of view, the value 0.73 is not surprising. Furthermore, the 90% confidence interval indicates that if you run this model every day for 100 days, you will probably encounter statistics lower than 0.68 merely due to sampling variability. In other words, a solitary low score might not indicate that the model is no longer valid.

Bootstrap estimates of model stability

If "asymptotic normality" makes you nervous, you can use the bootstrap method to obtain estimates of the standard error and the distribution of the Area statistic. The following table summarizes the results of 5,000 bootstrap replications. The results are very close to the asymptotic results in the previous table. In particular, the standard error of the Area statistic is estimated as 0.08 and in 90% of the bootstrap samples, the Area was in the interval [0.676, 0.983]. The conclusion from the bootstrap computation is the same as for the asymptotic estimates: you should expect the Area statistic to bounce around. A value such as 0.73 is not unusual and does not necessarily indicate that the model has degraded.

You can use the bootstrap computations to graphically reveal the stability of the two models. The following comparative histogram shows the bootstrap distributions of the Area statistic for the "Expert" and "Logistic" models. You can see that not only is the upper distribution shifted to the right, but it has less variance and therefore greater stability.

I think Daymond's main points are important to remember. Namely, discrimination and accuracy are important for choosing a model, but understanding the stability of the model (the variation of the estimates) is essential for determining when a model is no longer working well and should be replaced. There is no need to replace a model for a "bad score" if that score is within the range of typical statistical variation.

References

Ling, D. (2019), "Measuring Model Stability", Proceedings of the SAS Global Forum 2019 Conference.

Download the complete SAS program that creates the analyses and graphs in this article.

The post Discrimination, accuracy, and stability in binary classifiers appeared first on The DO Loop.

4月 012019
 

Many SAS procedures support the BY statement, which enables you to perform an analysis for subgroups of the data set. Although the SAS/IML language does not have a built-in "BY statement," there are various techniques that enable you to perform a BY-group analysis. The two I use most often are the UNIQUE-LOC technique and the UNIQUEBY technique. The first is more intuitive, the second is more efficient. This article shows how to use SAS/IML to read and process BY-group data from a data set.

I previously showed that you can perform BY-group processing in SAS/IML by using the UNIQUEBY technique, so this article uses the UNIQUE-LOC technique. The statistical application is simulating clusters of data. If you have a SAS data set that contains the centers and covariance matrices for several groups of observations, you can then read that information into SAS/IML and simulate new observations for each group by using a multivariate normal distribution.

Matrix operations and BY groups

A BY-group analysis in SAS/IML usually starts with a SAS data set that contains a bunch of covariance or correlation matrices. A simple example is a correlation analysis of each species of flower in Fisher's iris data set. The BY-group variable is the species of iris: Setosa, Versicolor, or Virginica. The variables are measurements (in mm) of the sepals and petals of 150 flowers, 50 from each species. A panel of scatter plots for the iris data is shown to the right. You can see that the three species appear to be clustered. From the shapes of the clusters, you might decide to model each cluster by using a multivariate normal distribution.

You can use the OUTP= and COV options in PROC CORR to output mean and covariance statistics for each group, as follows:

proc corr data=sashelp.iris outp=CorrOut COV noprint;
   by Species;
   var Petal: Sepal:;
run;
 
proc print data=CorrOut(where=(_TYPE_ in ("N", "MEAN", "COV"))) noobs;
   where Species="Setosa";  /* just view information for one group */
   by Species  _Type_ notsorted;
   var _NAME_ Petal: Sepal:;
run;

The statistics for one of the groups (Species='Setosa') are shown. The number of observations in the group (N) is actually a scalar value, but it was replicated to fit into a rectangular data set.

Reading BY-group information into SAS/IML

This section reads the sample size, mean vector, and covariance matrix for all groups. A WHERE clause selects only the observations of interest:

/* Read in N, Mean, and Cov for each species. Use to create a parametric bootstrap 
   by simulating N[i] observations from a MVN(Mean[i], Cov[i]) distribution */
proc iml;
varNames = {'PetalLength' 'PetalWidth' 'SepalLength' 'SepalWidth'};
use CorrOut where (_TYPE_="N" & Species^=" ");       /* N */
read all var varNames into mN[rowname=Species];      /* read for all groups */
print mN[c=varNames];
 
use CorrOut where (_TYPE_="MEAN" & Species^=" ");    /* mean */
read all var varNames into mMean[rowname=Species];   /* read for all groups */
print mMean[c=varNames];
 
use CorrOut where (_TYPE_="COV" & Species^=" ");     /* covariance */
read all var varNames into mCov[rowname=Species];    /* read for all groups */
close;
print mCov[c=varNames];

The output (not shown) shows that the matrices mN, mMean, and mCov contain the vertical concatenation (for all groups) of the sample size, mean vectors, and covariance matrices, respectively.

The grouping variable is Species. You can use the UNIQUE function to get the unique (sorted) values of the groups. You can then iterate over the unique values and use the LOC function to extract only the rows of the matrices that correspond to the ith group. What you do with that information depends on your application. In the following program, the information for each group is used to create a random sample from a multivariate normal distribution that has the same size, mean, and covariance as the ith group:

/* Goal: Write random MVN values in X to data set */
X = j(1, ncol(varNames), .);         /* X variables will be numeric */
Spec = BlankStr(nleng(Species));     /* and Spec will be character */
create SimOut from X[rowname=Spec colname=varNames]; /* open for writing */
 
/* The BY-group variable is Species */
call randseed(12345);
u = unique(Species);                 /* get unique values for groups */
do i = 1 to ncol(u);                 /* iterate over unique values */
   idx = loc(Species=u[i]);          /* rows for the i_th group */
   N = mN[i, 1];                     /* extract scalar from i_th group */
   mu = mMean[i,];                   /* extract vector from i_th group */
   Sigma = mCov[idx,];               /* extract matrix from i_th group */
   /* The parameters for this group are now in N, mu, and Sigma.
      Do something with these values. */
   X = RandNormal(N, mu, Sigma);     /* simulate obs for the i_th group */
   X = round(X);                     /* measure to nearest mm */
   Spec = j(N, 1, u[i]);             /* ID for this sample */
   append from X[rowname=Spec];
end;
close SimOut;
quit;
 
ods graphics / attrpriority=none;
title "Parametric Bootstrap Simulation of Iris Data"; 
proc sgscatter data=SimOut(rename=(Spec=Species));
   matrix Petal: Sepal: / group=Species;
run;

The simulation has generated a new set of clustered data. If you compare the simulated data with the original, you will notice many statistical similarities.

Although the main purpose of this article is to discuss BY-group processing in SAS/IML, I want to point out that the simulation in this article is an example of the parametric bootstrap. Simulating Data with SAS (Wicklin, 2013) states that "the parametric bootstrap is nothing more than the process of fitting a model distribution to the data and simulating data from the fitted model." That is what happens in this program. The sample means and covariance matrices are used as parameters to generate new synthetic observations. Thus, the parametric bootstrap technique is really a form of simulation where the parameters for the simulation are obtained from the data.

In conclusion, sometimes you have many matrices in a SAS data set, each identified by a categorical variable. You can perform "BY-group processing" in SAS/IML by reading in all the matrices into a big matrix and then use the UNIQUE-LOC technique to iterate over each matrix.

The post Matrix operations and BY groups appeared first on The DO Loop.

2月 252019
 

When I run a bootstrap analysis, I create graphs to visualize the distribution of the bootstrap statistics. For example, in my article about how to bootstrap the difference of means in a two-sample t test, I included a histogram of the bootstrap distribution and added reference lines to indicate a confidence interval for the difference of means.

For t tests, the TTEST procedure supports the BOOTSTRAP statement, which automates the bootstrap process for standard one- and two-sample t tests. A new feature in SAS/STAT 15.1 (SAS 9.4M6) is that the TTEST procedure supports the PLOTS=BOOTSTRAP option, which automatically creates histograms, Q-Q plots, and scatter plots of various bootstrap distributions.

To demonstrate the new PLOTS=BOOTSTRAP option, I will use the same example that I used to demonstrate the BOOTSTRAP statement. The data are the sedans and SUVs in the Sashelp.Cars data. The research question is to estimate the difference in fuel efficiency, as measured by miles per gallon during city driving. The bootstrap analysis enables you to visualize the approximate sampling distribution of the difference-of-mean statistic and its standard deviation. The following statements create the data and run PROC TTEST to generate the analysis. The PLOTS=BOOTSTRAP option generates the bootstrap graphs. The BOOTSTRAP statement request 5,000 bootstrap resamples and 95% confidence intervals, based on the percentiles of the bootstrap distribution:

/* create data set that has two categories: 'Sedan' and 'SUV' */
data Sample;
set Sashelp.Cars(keep=Type MPG_City);
if Type in ('Sedan' 'SUV');
run;
 
ods trace on;
title "Bootstrap Estimates with Percentile CI";
proc ttest data=Sample plots=bootstrap;
   class Type;
   var MPG_City;
   bootstrap / seed=123 nsamples=5000 bootci=percentile;
run;

The TTEST procedure creates seven tables and eight graphs. The previous article displayed and discussed several tables, so here I display only two of the graphs.

Graph of the bootstrap distribution for the difference of means. The green region indicates the 95% percentile confidence interval, based on the bootstrap samples. Computed by using the PLOTS=BOOSTRAP option in PROC TTEST in SAS/STAT 15.1.

The first graph is a histogram of the bootstrap distribution of the difference between the sample means for each vehicle type. The distribution appears to be symmetric and approximately normal. The middle of the distribution is close to -5, which is the point estimate for the difference in MPG_City between the SUVs and the sedans in the data. How much should we trust that point estimate? The green region indicates that 95% of the bootstrap samples had differences that were in the green area underneath the histogram, which is approximately [-5.9, -4.1]. This is a 95% confidence interval for the difference.

Similar histograms (not shown) are displayed for two other statistics: the pooled standard deviation (of the difference between means) and the Satterthwaite standard deviation. The procedure also creates Q-Q plots of the bootstrap distributions.

Scatter plot of the joint bootstrap distribution for the difference of means and the standard deviation of the difference of means. Computed by using the PLOTS=BOOSTRAP option in PROC TTEST in SAS/STAT 15.1.

The second graph is a scatter plot that shows the joint bootstrap distribution of the mean difference and the pooled standard deviations of the difference, based on 5,000 bootstrap samples. You can see that the statistics are slightly correlated. A sample that has a large (absolute) mean difference also tends to have a relatively large standard deviation. The 95% prediction region for the joint distribution indicates how these two statistics co-vary among random samples.

In summary, the TTEST procedure in SAS/STAT 15.1 supports a new PLOTS=BOOTSTRAP option, which automatically creates many graphs that help you to visualize the bootstrap distributions of the statistics. If you are conducting a bootstrap analysis for a t test, I highly recommend using the plots to visualize the bootstrap distributions and results

The post Graphs of bootstrap statistics in PROC TTEST appeared first on The DO Loop.