Bootstrap and Resampling

5月 252022
 

Many modern statistical techniques incorporate randomness: simulation, bootstrapping, random forests, and so forth. To use the technique, you need to specify a seed value, which determines pseudorandom numbers that are used in the algorithm. Consequently, the seed value also determines the results of the algorithm. In theory, if you know the seed value and the internal details of the pseudorandom algorithm, then the stream is completely determined, and the results of an algorithm are reproducible. For example, if I publish code for a simulation or bootstrap method in SAS, you can reproduce my computations as long as my program specifies the seed value for every part of the program that uses random numbers.

Occasionally, programmers post questions on discussion forums about how to compare the results that were obtained by using different software. A typical question might be, "My colleague used R to run a simulation. He gets a confidence interval of [14.585, 21.42]. When I run the simulation in SAS, I get [14.52, 21.19]. How can I get the same results as my colleague?" In most cases, you can't get the same answer. More importantly, it doesn't matter: these two results are equally valid estimates. When you run an algorithm that uses pseudorandom numbers, you don't get THE answer, you get AN answer. There are many possible answers you can get. Each time you change the random number stream, you will get a different answer, but all answers are equally valid.

This article examines how the random number seed affects the results of a bootstrap analysis by using an example of a small data set that has N=10 observations. I will run the same bootstrap analysis for 50 different seeds and compare the results of a confidence interval. The same ideas apply to simulation studies, machine learning algorithms, and other algorithms that use randomness to obtain an answer.

A bootstrap estimate of a confidence interval

The following data were used by Davison, Hinkley, and Schechtman (1986). They ran an analysis to examine the bootstrap distribution of the sample mean in a small nonnormal set of data. The goal of the following analysis is to obtain a bootstrap estimate of a 90% confidence interval for the population mean. Because we are going to run the same analysis many times with different random number seeds, I will put the bootstrap steps in a SAS macro. The following statements run two bootstrap analyses on the same data. The only difference is the value of the random-number seeds.

data Sample;
input x @@;
datalines;
9.6 10.4 13.0 15.0 16.6 17.2 17.3 21.8 24.0 33.8
;
 
/* macro to perform one bootstrap analysis that uses NumSamples resamples
   and uses a particular seed. Results are in the data set BootEst */
%macro Do1Boot(seed, NumSamples);
   /* 1. generate NumSamples bootstrap resamples from SEED value */
   proc surveyselect data=Sample NOPRINT seed=&SEED
        method=urs              /* URS = resample with replacement */
        samprate=1              /* each bootstrap sample has N obs */
        OUTHITS                 /* do not use a frequency var */
        reps=&NumSamples(repname=SampleID)        /* generate NumSamples samples */
        out=BootSamp;
   run;
 
   /* 2. compute sample mean for each resample; write to data set */
   proc means data=BootSamp NOPRINT;
      by SampleID;
      var x;
      output out=BootStats mean=Mean;  /* bootstrap distribution */
   run;
   /* 3. find boostrap estimates of mean, stdErr, and 90% CI */
   proc means data=BootStats NOPRINT;
      var Mean;
      output n=N mean=Mean stddev=StdErr p5=LCL90 p95=UCL90
             out=BootEst(drop=_TYPE_ _FREQ_);
   run;
   /* add the seed value to these statistics */
   data BootEst;
      Seed = &seed;
      set BootEst;
   run;
%mend;
 
title "A Boostrap Analysis for Seed=1";
%Do1Boot(1, 1000);
proc print data=BootEst noobs; run;
 
title "A Boostrap Analysis for Seed=1234";
%Do1Boot(1234, 1000);
proc print data=BootEst noobs; run;

The output shows that the bootstrap estimates are similar, but not exactly the same. This is expected. Each analysis uses a different seed, which means that the bootstrap resamples are different between the two runs. Accordingly, the means of the resamples are different, and so the bootstrap distributions are different. Consequently, the descriptive statistics that summarize the bootstrap distributions will be different. To make the differences easier to see, I used only 1,000 samples in each bootstrap analysis. If you increase the number of resamples to, say, 10,000, the differences between the two analyses will be smaller.

The fact that the random number seed "changes" the result makes some people nervous. Which value should you report to a journal, to a customer, or to your boss? Which is "the right answer"?

Both answers are correct. A bootstrap analysis relies on randomness. The results from the analysis depend on the pseudorandom numbers that are used.

Do the results depend on the seed value?

You should think of each bootstrap estimate as being accurate to some number of digits, where the number of digits depends on the number of resamples, B. For some estimates, such as the bootstrap standard error, you can calculate that the accuracy is proportional to 1/sqrt(B). Thus, larger values for B tend to shrink the difference between two bootstrap analyses that use different random number streams.

One way to visualize the randomness in a bootstrap estimate is to run many bootstrap analyses that are identical except for the random-number seed. The following SAS macro runs 50 bootstrap analyses and uses the seed values 1,2,3,..., 50:

/* Run nTrials bootstrap analyses. The i_th trials uses SEED=i. The estimates are written 
   to the data set AllBoot by concatenating the results from each analysis. */
%macro DoNBoot(NumSamples, nTrials);
   options nonotes;
   proc datasets noprint nowarn;   /* delete data set */
      delete AllBoot;
   quit;
   %do seed = 1 %to &nTrials;
       %Do1Boot(&seed, &numSamples);   /* run with seed=1,2,3,...,nTrials */
       proc append base=AllBoot data=BootEst force; 
       run;
   %end;
   options notes;
%mend;
 
/* Run 50 bootstrap analyses: numSamples=1000, nTrials=50, seeds=1,2,3,...,50 */
%DoNBoot(1000, 50); 
 
/* find boostrap estimates of mean, stdErr, and 90% CI */
title "Variation in 50 Traditional Boostrap Estimates";
proc means data=AllBoot N Mean StdDev Min Max;
   var Mean StdErr LCL90 UCL90;
run;

The table shows how four estimates vary across the 50 bootstrap analyses:

  • The bootstrap estimate of the mean varies in the interval [17.76, 18.00]. The average of the 50 estimates is 17.88.
  • The bootstrap estimate of the standard error varies in the interval [2.06, 2.25]. The average of the 50 estimates is 17.88.
  • The bootstrap estimate of the lower 90% confidence interval varies in the interval [14.41, 14.80]. The average of the 50 estimates is 14.56.
  • The bootstrap estimate of the upper 90% confidence interval varies in the interval [21.36, 22.09]. The average of the 50 estimates is 21.63.

The table shows that the variation among the 50 estimates is small (and can be made smaller by increasing B). If you and your colleague run an identical bootstrap analysis but use different seed values, you are likely to obtain results that are similar to each other. Even more comforting is that the inferences that you make from the analyses should be the same. For example, suppose you are testing the null hypotheses that these data came from a population in which the mean is μ=20. Let's look at the 90% confidence intervals for the 50 analyses:

title "Traditional Bootstrap Estimates";
title2 "Varying Seed Values";
proc sgplot data=AllBoot;
   refline &ObsStat / axis=Y label="Sample Mean";
   highlow x=Seed low=LCL90 high=UCL90 / legendlabel="90% CL";
   scatter x=Seed y=Mean / legendlabel="Bootstrap Estimate";
   yaxis label="x" ;
run;

The graph shows that the value μ=20 is inside all 50 of the confidence intervals. Therefore, the decision to accept or reject the null hypothesis does not depend on the seed value when μ=20.

On the other hand, if you are testing whether the population mean is μ=21.5, then some analyses would reject the null hypothesis whereas others would fail to reject it. What to do in that situation? I would acknowledge that there is uncertainty in the inference and conclude that it cannot be reliably decided based on these data and these analyses. It would be easy to generate additional bootstrap samples (increase B). You could also advocate for collecting additional data, if possible.

Summary

In summary, the result of a bootstrap analysis (and ANY statistical technique that uses randomness) depends on the stream of pseudorandom numbers that are used. If you change the seed or use different statistical software, you are likely to obtain a different result, but the two results should be similar in many cases. You can repeat the analysis several times with different seeds to understand how the results vary according to the random-number seed.

If some seed values enable you to reject a hypothesis and others do not, you should acknowledge that fact. It is not ethical to use the seed value that gives the result that you want to see! (See "How to lie with a simulation" for an example.)

If you are nervous about the result of an algorithm that uses randomness, I suggest that you rerun the analysis several times with different seed values. If the results are consistent across seeds, then you can have confidence in the analysis. If you get vastly different results, then you are justified in being nervous.

The post How much does a bootstrap estimate depend on the random number stream? appeared first on The DO Loop.

5月 232022
 

I have previously blogged about ways to perform balanced bootstrap resampling in SAS. I recently learned about an easier way: Since SAS/STAT 14.2 (SAS 9.4M4), the SURVEYSELECT procedure has supported balanced bootstrap sampling. This article reviews balanced bootstrap sampling and shows how to use the METHOD=BALBOOT option in PROC SURVEYSELECT to run a balanced bootstrap in SAS. It also addresses the question: Should the balanced bootstrap method be preferred over the traditional bootstrap?

What is the balanced bootstrap?

The traditional bootstrap method creates bootstrap samples (called resamples) by sampling with replacement from the original data, which has N observations. This is called "uniform" resampling because each observation has a uniform probability of 1/N of being selected at each step of the resampling process. Within the union of the B bootstrap samples, each observation has an expected value of appearing B times, but the actual distribution of the frequencies is multinomial.

Balanced bootstrap resampling (Davison, Hinkley, and Schechtman, 1986) is an alternative process in which each observation appears exactly B times in the union of the bootstrap samples. In SAS, you can use the METHOD=BALBOOT option in PROC SURVEYSELECT to generate balanced bootstrap samples.

How many times is each observation selected in a traditional bootstrap?

To demonstrate the difference between a traditional set of bootstrap resamples and a set of balanced bootstrap samples, consider the following 10 observations from Davison, Hinkley, and Schechtman (1986). DHS used these data to bootstrap an estimate of the sampling distribution of the sample mean. The sample size is small and the data are not normal, therefore we do not expect the sampling distribution of the mean to be normally distributed.

data Sample;
input x @@;
datalines;
9.6 10.4 13.0 15.0 16.6 17.2 17.3 21.8 24.0 33.8
;

The graph shows the distribution of the data. You can use PROC MEANS to estimate the mean, standard error, and 90% confidence interval (CI) for the population mean:

title "Statistics on Observed Sample";
proc means data=Sample N Mean StdErr alpha=0.1 CLM;
   var x;
run;

The standard error and the CI estimates are based on a formula that applies when the sample is large or is normally distributed. Neither situation applies to these data. However, you can use the bootstrap method to obtain bootstrap estimates of the standard error and CI.

Traditional bootstrap samples are generated by sampling from the data with replacement. You can use PROC SURVEYSELECT and the METHOD=URS option to generate bootstrap samples. The following statements generate one data set that contains B=10,000 resamples of size N=10. Each sample is identified by a unique value of the SampleID variable. After generating the B=10,000 samples, you can use PROC FREQ to determine how often each observation was selected.

/* use the traditional bootstrap sampling with METHOD=URS */
%let NumSamples = 1000;      /* B = number of bootstrap resamples */
proc surveyselect data=Sample NOPRINT seed=1  /* SEED for RNG */
     method=urs              /* URS = resample with replacement */
     samprate=1              /* each bootstrap sample has N obs */
     OUTHITS                 /* do not use a frequency var */
     reps=&NumSamples(repname=SampleID)        /* generate NumSamples samples */
     out=BootSamp;
run;
 
/* overall, how often was each observation selected? */
proc freq data=BootSamp;
   tables x;
run;

The output from PROC FREQ shows that, on average, each observation is selected about B times. However, some observations are selected more often than others. One observation (9.6) was selected only 9,899 times whereas another (17.3) was selected 10,175 times. This can affect the statistics on the resamples. Overall, the value 9,6 appears less often than 17.3, so it has less influence on the bootstrap statistics. If an outlier is selected more or less often than expected, this might affect the inferences that you are trying to make.

How to run a balanced bootstrap analysis in SAS

An alternative way to generate bootstrap samples is by using a method called balanced bootstrap sampling. In balanced bootstrap sampling, each observation is selected equally often over the course of the B resamples. You can use the METHOD=BALBOOT option (and omit the SAMPRATE= option) to generate B=10,000 bootstrap resamples, as follows:

proc surveyselect data=Sample NOPRINT seed=1  /* SEED for RNG */
     method=BalBoot          /* balanced bootstrap sampling */
     OUTHITS                 /* do not use a frequency var */
     reps=&NumSamples(repname=SampleID) /* generate NumSamples samples */
     out=BalBootSamp;
run;
 
/* overall, how often was each observation selected? */
proc freq data=BalBootSamp;
   tables x;
run;

The output from PROC FREQ shows that each observation was selected a total of B times over the course of generating the B bootstrap samples. Of course, for any one observation, some resamples do not contain it, others contain it once, and others contain it multiple times. So, within each sample, there is randomness regarding which observations are included and excluded from the sample.

The remainder of the balanced bootstrap analysis is identical to the analysis on a traditional set of bootstrap samples: Use a BY-group analysis to analyze all resamples and generate the bootstrap distribution of the statistic. The following statements compute the samples means for each of the B resamples. The statistics in the BalBootStats data set form the bootstrap distribution, which approximates the true sampling distribution of the sample mean.

/* compute sample mean for each resample; write the bootstrap distribution */
proc means data=BalBootSamp NOPRINT;
   by SampleID;
   var x;
   output out=BalBootStats mean=Mean;  /* bootstrap distribution */
run;
/* find boostrap estimates of mean, stdErr, and 90% CI */
proc means data=BalBootStats N Mean StdDev P5 P95;
   var Mean;
run;

The bootstrap estimate of the standard error is smaller than the asymptotic estimate. Similarly, the bootstrap confidence interval is smaller than the asymptotic CI.

Notice that the bootstrap estimate of the mean is exactly the same as the sample mean! This is not a coincidence: Since the balanced bootstrap contains B copies of each of the original observations, the "mean of the means" is identical to the original sample mean. Recall that the sample mean is an unbiased estimate of the population mean. Consequently, when you use balanced bootstrap sampling, the bootstrap estimate is also an unbiased estimate of the mean.

However, this nice property of the balanced bootstrap (the bootstrap estimate equals the observed estimate) applies only to the sample mean and does not extend to other statistics. If you bootstrap a different statistic (for example, the skewness or a correlation coefficient) the balanced bootstrap estimate will typically be different from the observed statistic.

For completeness, the following graph shows the bootstrap distribution for the balanced samples. The red line indicates the bootstrap estimate for the mean, which is the same as the sample mean on the original data. The blue lines represent the 5th and 95th percentiles of the distribution and are therefore estimates for a 90% confidence interval for the population mean.

Should you use the balanced bootstrap?

Should you use the balanced bootstrap method instead of the traditional bootstrap method? In practice, I don't think it matters much which method you use: The inferences that you make should not depend strongly on the method. For example, if you are trying to obtain a confidence interval for a parameter, the estimates should be similar.

On the other hand, there are no major drawbacks to the balanced bootstrap. It provides estimates that are the same as or, on occasion, better than the traditional bootstrap sampling method. So, if it is easy to get a balanced bootstrap sample (as it is in SAS), you might as well use it.

I think the balanced bootstrap is preferred if the data includes an extreme outlier. Look back at the section that used PROC FREQ to examine the frequency that each observation was selected in the traditional bootstrap analysis. You can see that some observations are chosen more than others. These frequencies change if you change the random number seed. Accordingly, some seeds select an outlier more often than other seeds. If you use a seed that selects the outlier many times, the bootstrap distribution will include many extreme statistics. If you change the seed, the new bootstrap distribution might include fewer extreme statistics. The balanced bootstrap removes that uncertainty from your analysis: you know that the outlier appears exactly B times among the set of all B bootstrap samples.

Summary

The METHOD=BALBOOT method in PROC SURVEYSELECT enables you to generate balanced bootstrap resamples. You can analyze these resamples in the same way that you analyze traditional bootstrap resamples. A balanced bootstrap removes one element of uncertainty from your bootstrap estimates. You know that each observation was selected the same number of times over the set of all resamples. This might be comforting if your data contain some extreme outliers, and you want to be certain that the outliers are not overrepresented in the resamples. In this situation, I expect the balanced bootstrap to reduce some of the seed-to-seed variability that you might see in the traditional bootstrap.

Do you have experience with using the balanced bootstrap for real data analysis? Share your experience and thoughts by leaving a comment.

The post The balanced bootstrap in SAS appeared first on The DO Loop.

5月 042022
 

In The Essential Guide to Bootstrapping in SAS, I note that there are many SAS procedures that support bootstrap estimates without requiring the analyst to write a program. I have previously written about using bootstrap options in the TTEST procedure. This article discusses the NLIN procedure, which can fit nonlinear models to data by using a least-squares method. The NLIN procedure supports the BOOTSTRAP statement, which enables you to compute bootstrap estimates of the model parameters. In nonlinear models, the sampling distribution of a regression estimate is often non-normal, so bootstrap confidence intervals can be more useful than the traditional Wald-type confidence intervals that assume the estimates are normally distributed.

Data for a dose-response curve

The documentation for PROC NLIN contains an example of fitting a parametric dose-response curve to data. The example only has seven observations, but I have created an additional seven (fake) observations to make the example richer. The graph to the right visualizes the data and a parametric dose-response curve that is fit by using PROC NLIN. The curve represents the expected value of the log-logistic model
\(f(x) = \delta + \frac{\alpha - \delta }{1+\gamma \exp \left\{ \beta \ln (x)\right\} }\)
where x is the dose. The response is assumed to be bounded in the interval [δ, α], where δ is the response for x=0, and α is the limit of the response as x → ∞. The parameters β and γ determine the shape of the dose-response curve. In the following example, δ=0, and the data suggest that α ≈ 100.

The graph to the right demonstrates the data and the dose-response curve that best fits this data in a least-squares sense.

The following DATA step defines the data. The call to PROC NLIN specifies the model and produces the graph as well as a table of parameter estimates and the correlation between the regression parameters:

data DoseResponse;
   input dose y @@;
   logdose = log(dose);
datalines;
0.009   96.56   0.035   94.12   0.07    89.76
0.15    60.21   0.20    39.95   0.28    21.88
0.50     7.46   0.01    98.1    0.05    90.2
0.10    83.6    0.15    55.1    0.30    32.5
0.40    12.8    0.45     9.6
;
 
proc nlin data=DoseResponse plots(stats=none)=(fitplot); /* request the fit plot */
   parameters alpha=100 beta=3 gamma=300;                /* three parameter model */
   delta = 0;                                            /* a constant; not a parameter */
   Switch = 1/(1+gamma*exp(beta*log(dose)));             /* log-logistic function */
   model y = delta + (alpha - delta)*Switch;
run;

Look at the statistics for the standard errors, the confidence intervals (CIs), and the correlation of the parameters. Notice that these columns contain the word "Approximate" (or "Approx") in the column headers. That is because these statistics are computed by using the formulas from linear regression models. Under the assumptions of ordinary linear regression, the estimates for the regression coefficients are (asymptotically) distributed according to a multivariate normal distribution. For nonlinear regression models, you do not know the sampling distribution of the parameter estimates in advance. However, you can use bootstrap methods to explore the sampling distribution and to improve the estimates of standard error, confidence intervals, and the correlation of the parameters.

The BOOTSTRAP statement in PROC NLIN

You could carry out the bootstrap manually, but PROC NLIN supports the BOOTSTRAP statement, which automatically runs a bootstrap analysis for the regression coefficients. The BOOTSTRAP statement supports the following options:

  • Use the SEED= option to specify the random-number stream.
  • Use the NSAMPLES= option to specify the number of bootstrap samples that you want to use to make inferences. I suggest at least 1000, and 5000 or 10000 are better values.
  • Use the BOOTCI option to request bootstrap confidence intervals for the regression parameters. You can use the BOOTCI(BC) suboption to request the bias-corrected and adjusted method. Use the BOOTCI(PERC) option for the traditional percentile-based confidence intervals. I recommend the bias-corrected option, which is the default.
  • Use the BOOTCOV option to display a table that shows the estimated covariance between parameters.
  • Use the BOOTPLOTS option to produce histograms of the bootstrap statistics for each regression parameter.

The following call to PROC NLIN repeats the previous analysis, but this time requests a bootstrap analysis to improve the estimates of standard error, CIs, and covariance between parameters:

proc nlin data=DoseResponse plots(stats=none)=(fitplot); /* request the fit plot */
   parameters alpha=100 beta=3 gamma=300;                /* three parameter model */
   delta = 0;                                            /* a constant; not a parameter */
   Switch = 1/(1+gamma*exp(beta*log(dose)));             /* log-logistic function */
   model y = delta + (alpha - delta)*Switch;
   BOOTSTRAP / seed=12345 nsamples=5000 bootci(/*BC*/) bootcov bootplots;
run;
run;

The Parameter Estimates table is the most important output because it shows the bootstrap estimates for the standard error and CIs. For these data, you can see that the estimates for the alpha and beta parameters are not very different from the approximate estimates. However, the estimates for the gamma parameter are quite different. The bootstrap standard error is almost 2 units wider. The bootstrap bias-corrected confidence interval is about 5 units shorter and is shifted to the right as compared to the traditional CI.

You can study the bootstrap distribution of the gamma parameter to understand the differences. The following histogram is produced automatically by PROC NLIN because we used the BOOTPLOTS option. You can see that the distribution of the gamma estimates is not normal. It shows a moderate amount of positive skewness and kurtosis. Thus, the bootstrap estimates are noticeably different from the traditional estimates.

The distributions for the alpha and beta statistics are not shown, but they display only small deviations from normality. Thus, the bootstrap estimates for the alpha and beta parameters are not very different from the traditional estimates.

I do not show the bootstrap estimates of the covariance of the parameters. However, if you convert the bootstrap estimates of covariance to a correlation matrix, you will find that the bootstrap estimates are close to the approximate estimates that are shown earlier.

Summary

The BOOTSTRAP statement in PROC NLIN makes it easy to perform a bootstrap analysis of the regression estimates for a nonlinear regression model. The BOOTSTRAP statement will automatically produce bootstrap estimates of the standard error, confidence intervals, and covariance of parameters. In addition, you can use the BOOTPLOTS option to visualize the bootstrap distribution of the estimates. As shown in this example, sometimes the distribution of a parameter estimate is non-normal, so a bootstrap analysis produces better inferential statistics for the regression analysis.

The post Bootstrap estimates for nonlinear regression models in SAS appeared first on The DO Loop.

1月 032022
 

Last year, I wrote almost 100 posts for The DO Loop blog. My most popular articles were about data visualization, statistics and data analysis, and simulation and bootstrapping. If you missed any of these gems when they were first published, here are some of the most popular articles from 2021:

SAS programming and data visualization

Panel of Regression Diagnostic Plots in SAS

SAS programming and data visualization

  • Display the first or last observations in data: Whether your data are in a SAS data set or a SAS/IML matrix, this article describes how to display to print the top rows (and bottom rows!) of a rectangular data set.
  • Customize titles in a visualization of BY groups: Have you ever use the BY statement to graph data across groups, such as Males/Females or Control/Experimental groups? If so, you might want to learn how to use the #BYVAR and #BYVAL keywords to customize the titles that appear on each graph.
  • Horizontal Bar Chart in SAS

  • Reasons to prefer a horizontal bar chart: Bar charts are used to visualize the counts of categorical data. Vertical charts are used more often, but there are advantages to using a horizontal bar chart, especially if you are displaying many categories or categories that have long labels. This article shows how to create a horizontal bar chart in SAS and gives examples for which a horizontal chart is preferable.
  • Why you should visualize distributions: It is common to report the means (of difference between means) for different groups in a study. However, means and standard deviations only tell part of the story. This article shows four examples where the mean difference between group scores is five points. However, when you plot the data for each example, you discover additional information about how the groups differ.

Simulation and Bootstrapping

Since I am the guy who wrote the book on statistical simulation in SAS, it is not surprising that my simulation articles are popular. Simulation helps analysts understand expected values, sampling variation, and standard errors for statistics.

Bootstrapping Residuals for a Time Series Regression Model

Did you resolve to learn something new in the New Year? Reading these articles requires some effort, but they provide tips and techniques that make the effort worthwhile. So, read (or re-read!) these popular articles from 2021. To ensure you don't miss a single article in 2022, consider subscribing to The DO Loop.

The post Top 10 posts from <em>The DO Loop</em> in 2021 appeared first on The DO Loop.

10月 042021
 

A reader asked whether it is possible to find a bootstrap sample that has some desirable properties. I am using the term "bootstrap sample" to refer to the result of randomly resampling with replacement from a data set. Specifically, he wanted to find a bootstrap sample that has a specific value of the mean and standard deviation. If not, can we find a bootstrap sample whose mean and standard deviation are close to the target value? It's an interesting question. Each bootstrap sample is composed of the original observations, but because it is a sample with replacement, some observations can appear more often than others.

For very small values of N, you can solve the reader's problem by generating all bootstrap samples. In a previous article, I used the data {-1, -0.1, -0.6, 0, 0.5, 0.8, 4}, which has the sample mean 0.51 for N=7. I mentioned that one possible bootstrap sample is {4, 4, 4, 4, 0.5, 4, 4}, which has the mean 3.5. This bootstrap sample contains only large values in the original data. The mean of that sample is much greater than the mean of the original data, and the standard deviation is much less.

Of course, generating all bootstrap samples is impractical for even moderate values of N. Furthermore, bootstrap samples like this are improbable if the probability of choosing each observation is 1/N. But what if you assign a higher probability to select some observations and a lesser probability to others? For example, the sample that has six 4s is not improbable if the probability of choosing the 4 is 0.9. If you sample with unequal probability (and still with replacement) the statistics in the samples might be very different from the statistics in the original data.

This kind of sampling is no longer classical bootstrap sampling. It could be considered a kind of importance sampling. Whatever you call it, each sample you generate is a resampling from the original data, but you can control the kinds of samples that you generate by adjusting the probability of selecting each observation.

Sampling with non-uniform probability

Recall that there are four main ways to sample. You can sample with or without replacement; you can sample with equal or unequal probability. In this article, we use sampling with replacement with unequal probability. SAS supports several ways to sample with replacement and with unequal probability. This article uses PROC SURVEYSELECT with the METHOD=PPS_WR option. The acronym PPS_WR stands for "probability proportional to size (PPS), with replacement."

To illustrate sampling with unequal probability, consider the following data with N=1000 that are simulated from a uniform distribution on [0, 20]:

%let N = 1000;    /* sample size */
data Have;
call streaminit(54321);
do i = 1 to &N;
   x = round( rand("uniform", 0, 20), 0.01);  /* round to nearest 0.01 */
   output;
end;
run;
 
/* compute samples mean and std dev of original data */
proc means data=Have mean std min max;
  var x;
run;

The mean and standard deviation of the data are 10.04 and 5.80, respectively. If you were to perform the usual bootstrap sampling, which selects each observation with equal probability, then most samples would have a mean in the interval [9.5, 10.5] and a standard deviation in the interval [5.6, 6.0].

Suppose you are seeking a resample that has unusual properties, such as Mean=9 and StdDev=3. It would be highly unusual to obtain these statistics by using the usual method of resampling observations with equal probability. However, you can assign a high probability to observations whose values are near the value 9. The following DATA step uses the normal density function for the N(9,3) distribution to assign the probability of selecting an observation:

/* Target: mean=9; SD = 3 */
%let mean = 9;
%let StdDev = 3;
/* Use the normal distribution to assign the probability of choosing an observation */
data Have2;
set Have;
Prob = pdf("Normal", x, &mean, &StdDev);
run;

You can use the METHOD=PPS_WR option to resample from the data by using this probability of selecting each observation. The following call to PROC SURVEYSELECT generates a resample of size N from the data but gives preference to observations that are closer to x=9.

/* Sample with replacement from the data, where the probability 
   of choosing an obs is given by the lProb variable */
proc surveyselect noprint data=Have2 out=Resample
     method=PPS_WR 
     seed=12345 N=&N;       /* use OUTHITS option if you want N obs */
   size Prob;               /* specify the probability variable */
run;
 
proc means data=Resample mean std;  /* a different seed will give a different result */
  var x;
  freq NumberHits;
run;

The output from PROC MEANS shows that for these uniformly distributed data, you can obtain a resample whose mean and standard deviations are close to the target values. Of course, using a different random number seed will result in a different resample and different statistics.

You can create a graph to visualize the observations that were chosen for this resample. The following graph plots the NumberHits variable, which records the number of times that each observation appears in the sample:

title "Number of Times Each Observation Appears in Resample";
proc sgplot data=Resample;
   scatter x=x y=NumberHits / markerattrs=(symbol=CircleFilled) transparency=0.75;
   xaxis values=(0 to 20 by 2);
   yaxis values=(1 to 8) valueshint;
run;

You can see that no observations for which x ≥ 18 were selected. Observations for which x ≈ 9 are selected more often than observations for which x is far from 9.

Random samples whose statistics are close to target values

If you change the random number seed, you will get a different resample that has different statistics. If you repeat this process many times, you can choose the resample whose statistics are closest to the target values. You can add the REPS= option to PROC SURVEYSELECT to generate additional resamples. This creates a variable (Replicate) that you can use as a BY-group variable in PROC MEANS. For each (Mean, StdDev) pair, you can compute a distance to the target value. You can then sort the statistics and identify the resamples whose statistics are closest to the target, as follows:

proc surveyselect noprint data=Have2 out=Resample100
     method=PPS_WR 
     Reps=100
     seed=12345 N=&N;       /* use OUTHITS option if you want N obs */
   size Prob;               /* specify the probability variable */
run;
 
/* Compute mean and StdDev of each resample */
proc means data=Resample100 mean std noprint;
  by Replicate;
  var x;
  freq NumberHits;
  output out=BootMeans mean=Mean stddev=StdDev;
run;
 
/* compute distance from statistics to target values */
data Diff;
set BootMeans;     
DiffMean = &Mean - Mean;
DiffStd  = &StdDev - StdDev;
EuclidDist = euclid(DiffMean, DiffStd);  /* distance between observation and a target value */
run;
 
proc sort data=Diff;
   by EuclidDist;    /* sort by how close statistics are to target values */
run;
 
proc print data=Diff(obs=5);
   var Replicate Mean StdDev DiffMean DiffStd EuclidDist;
run;

The top five resamples (from 100) are shown. For these resamples, the mean is close to 9 and the standard deviation is close to 3, which are the target values. If you measure "closeness" by using a Euclidean distance, then Replicate=45 is the resample whose statistics are closest to the target values.

Why the example works

It is possible to turn this example into a general-purpose method for finding a resample that has statistics that are close to target values. However, it requires some additional work. I will outline why this example works and what is required to generalize the example.

Let f(x) be the probability density function of the data. If you sample uniformly from the distribution, then \(E(X) = \int x f(x)\, dx\) is the expected value of the distribution. The classical bootstrap method generates samples for which the expected sample mean is E(X). However, if you sample according to some other probability distribution, g(x), then the expected value of the sample is \(\int x f(x) g(x)\, dx\). By strategically choosing g(x), you can control the expected value. The same is true for higher-order moments, such as the variance.

For this example, the data has a uniform density, so it was easy to choose a value for g(x) so that the expected mean and variance matches target values. For data that has a non-uniform density, the process is more complicated because you must consider the convolution of f(x) and g(x).

Even for this example, you might not be able to match every set of target values. For example, you probably can't find a resample that has a large mean and also a large variance.

Summary

This article considers whether it is possible to find a resample that has a desired set of statistics. A classical bootstrap sample selects observations with replacement and with uniform probability. With high probability, the bootstrap sample has statistical properties that are somewhat similar to the original data. However, if you sample with nonuniform probability, you can obtain resamples whose statistical properties are quite different from the original data. The article uses the METHOD=PPS_WR option in PROC SURVEYSELECT to create an example in the special case where the data are uniformly distributed.

The post Choose samples with specified statistical properties appeared first on The DO Loop.

9月 012021
 

The number of possible bootstrap samples for a sample of size N is big. Really big.

Recall that the bootstrap method is a powerful way to analyze the variation in a statistic. To implement the standard bootstrap method, you generate B random bootstrap samples. A bootstrap sample is a sample with replacement from the data. The phrase "with replacement" is important. In fact, a bootstrap sample is sometimes called a "resample" because it is generated by sampling the data with REplacement.

This article compares the number of samples that you can draw "with replacement" to the number of samples that you can draw "without replacement." It demonstrates, for very small samples, how to generate the complete set of bootstrap samples. It also explains why generating a complete set is impractical for even moderate-sized samples.

The number of samples with and without replacement

For a sample of size N, there are N! samples if you sample WITHOUT replacement (WOR). The quantity N! grows very quickly as a function of N, so the number of permutations is big for even moderate values of N. In fact, I have shown that if N≥171, then you cannot represent N! by using a double-precision floating-point value because 171! is greater than MACBIG (=1.79769e+308), which is the largest value that can be represented in eight bytes.

However, if you sample WITH replacement (WR), then the number of possible samples is NN, which grows even faster than the factorial function! If the number of permutations is "big," then the number of WR samples is "really big"!

For comparison, what do you think is the smallest value of N such that NN exceeds the value of MACBIG? The answer is N=144, which is considerably smaller than 171. The computation is shown at the end of this article.

Another way to compare the relative sizes of N! and NN is to print a few values of both functions for small values of N. The following table shows the values for N≤10:

proc iml;
N = 1:10;
nPerm = fact(N);     /* N!  */
nResamp = N##N;      /* N^N */
both = nPerm // nResamp;
print both[r={"Permutations" "Samples (WR)"} c=N format=comma15. L=""];

Clearly, the number of bootstrap samples (samples WR) grows very big very fast. This is why the general bootstrap method uses random bootstrap samples rather than attempting some sort of "exact" computation that uses all possible bootstrap samples.

An exact bootstrap computation that uses all possible samples

If you have a VERY small data set, you could, in fact, perform an exact bootstrap computation. For the exact computation, you would generate all possible bootstrap samples, compute the distribution on each sample, and thereby obtain the exact bootstrap distribution of the statistic.

Let's carry out this scheme for a data set that has N=7 observations. From the table, we know that there are exactly 77 = 823,543 samples with replacement.

For small N, it's not hard to construct the complete set of bootstrap samples: just form the Cartesian product of the set with itself N times. In the SAS/IML language, you can use the ExpandGrid function to define the Cartesian product. If the data has 7 values, the Cartesian product will be a matrix that has 823,543 rows and 7 columns.

The following SAS/IML program performs a complete bootstrap analysis of the sample mean. The sample, S, has 7 observations. The minimum value is -1; the maximum value is 4. The sample mean is approximately 0.51. What is the bootstrap distribution of the sample mean? You can form the set of all possible subsamples of size 7, where the elements are resampled from S. You can then compute the mean of every resample and plot the distribution of the means:

proc iml;
S = {-1  -0.1  -0.6  0  0.5  0.8  4};
/* 1. compute original statistic */
m = S[:];          /* = mean(S`) */
print m;
 
/* 2. Generate ALL resamples! */
/* Cartesian product of the elements in S is
   S x S x ... x S
   \------v------/
         N times                              */
z = ExpandGrid(S, S, S, S, S, S, S);
 
/* 3. Compute statistic on each resample */
means = z[,:];
 
/* 4. Analyze the bootstrap distribution */
title "Complete Bootstrap Distribution of the Mean";
title2 "N = 7";
call histogram(means) rebin={-1 0.1} xvalues=-1:4
     other="refline 0.51/axis=x lineattrs=(color=red);";

The graph shows the mean for every possible resample of size 7. The mean of the original sample is shown as a red vertical line. Here are some of the resamples in the complete set of bootstrap samples:

  • The resamples where a datum is repeated seven times. For example, one resample is {-1, -1, -1, -1, -1, -1, -1}, which has a mean of -1. Another resample is {4, 4, 4, 4, 4, 4, 4}, which has a mean of 4. These two resamples define the minimum and maximum values, respectively, of the bootstrap distribution. There are only seven of these resamples.
  • The resamples where a datum is repeated six times. For example, one of the resamples is {-1, -1, -1, -1, -1, -1, 4}. Another is {-1, -1, 0, -1, -1, -1, -1}. A third is {4, 4, 4, 4, 4, 4, 4}, which has a mean of 4. These two resamples define the minimum and maximum values. Another resample is {4, 4, 4, 4, 0.5, 4, 4}. There are 73 resamples of this type.
  • The resamples where each datum is present exactly one time. These sets are all permutations of the sample itself, {-1 -0.1 -0.6 0 0.5 0.8 4}. There are 7! = 5,040 resamples of this type. Each of these permutations has the same mean, which is 0.51. This helps to explain why there is a visible peak in the distribution at the value of the sample mean.

The bootstrap distribution does not make any assumptions about the distribution of the data, nor does it use any asymptotic (large sample) assumptions. It uses only the observed data values.

Comparison with the usual bootstrap method

For larger samples, it is impractical to consider ALL possible samples of size N. Therefore, the usual bootstrap method generates B random samples (with replacement) for a large value of B. The resulting distribution is an approximate bootstrap distribution. The following SAS/IML statements perform the usual bootstrap analysis for B=10,000.

/* Generate B random resamples. Compute and analyze bootstrap distribution */
call randseed(123);
w = sample(S, {7, 10000});   /* 10,000 samples of size 7 (with replacement) */
means = w[,:];
title "Bootstrap Distribution of the Mean";
title2 "B = 10,000; N = 7";
call histogram(means) rebin={-1 0.1} xvalues=-1:4
     other="refline 0.51/axis=x lineattrs=(color=red);";

The resulting bootstrap distribution is similar in shape to the complete distribution, but only uses 10,0000 random samples instead of all possible samples. You can see that the properties of the second distribution (such as quantiles) will be similar to the quantiles of the complete distribution, even though the second distribution is based on many fewer bootstrap samples.

Summary

For a sample of size N, there are NN possible resamples of size N when you sample with replacement. The bootstrap distribution is based on a random sample of these resamples. When N is small, you can compute the exact bootstrap distribution by forming the Cartesian product of the sample with itself N times. This article computes the exact bootstrap distribution for the mean of seven observations and compares the exact distribution to an approximate distribution that is based on B random resamples. When B is large, the approximate distribution looks similar to the complete distribution.

For more about the bootstrap method and how to perform bootstrap analyses in SAS, see "The essential guide to bootstrapping in SAS."

Appendix: Solving the equation NN = y

This article asked, "what is the smallest value of N such that NN exceeds the value of MACBIG?" The claim is that N=144, but how can you prove it?

For any value of y > 1, you can use the Lambert W function to find the value of N that satisfies the equation NNy. Here's how to solve the equation:

  1. Take the natural log of both sides to get N*log(N) ≤ log(y).
  2. Define w = log(N) and x = log(y). Then the equation becomes w exp(w) ≤ x. The solution to this equation is found by using the Lambert W function: w = W(x).
  3. The solution to the original equation is N = exp(w). Since N is supposed to be an integer, round up or down, according to the application.

The following SAS/IML program solves for the smallest value of N such that NN exceeds the value of MACBIG in double-precision.

proc iml;
x = constant('LogBig');   /* x = log(MACBIG) */
w = LambertW(x);          /* solve for w such that w*exp(w) = x */
N1 = exp(w);              /* because N = log(w) */
N = ceil(N1);             /* round up so that N^N exceeds MACBIG */
print N1 N;
quit;

So, when N=144, NN is larger than the maximum representable double-precision value (MACBIG).

The post On the number of bootstrap samples appeared first on The DO Loop.

8月 302021
 

You can use the bootstrap method to estimate confidence intervals. Unlike formulas, which assume that the data are drawn from a specified distribution (usually the normal distribution), the bootstrap method does not assume a distribution for the data. There are many articles about how to use SAS to bootstrap statistics for univariate data and for data from a regression model. This article shows how to bootstrap the correlation coefficients in multivariate data. The main challenge is that the correlation coefficients are typically output in a symmetric matrix. To analyze the bootstrap distribution, you must convert the statistics from a "wide form" (a matrix) to a "long form" (a column), as described in a previous article.

To motivate the bootstrap method for correlations, recall that PROC CORR in SAS supports the FISHER option, which estimates confidence intervals for the correlation coefficients. However, the interval estimates are parametric: they assume that the pairs of variables are bivariate normal. How good are the estimates for real data? You can generate the bootstrap distribution for the correlation coefficients and compare the Fisher confidence intervals to the bootstrap estimates. As a preview of things to come, the graph to the right (click to expand) shows histograms of bootstrap estimates and the Fisher confidence intervals.

This article assumes that the reader is familiar with basic bootstrapping in SAS. If not, see the article, "Compute a bootstrap confidence interval in SAS."

Bootstrap multivariate data in SAS

The basic steps of the bootstrap algorithm are as follows:

  1. Compute a statistic for the original data.
  2. Use PROC SURVEYSELECT to resample (with replacement) B times from the data.
  3. Use BY-group processing to compute the statistic of interest on each bootstrap sample.
  4. Use the bootstrap distribution to obtain estimates of bias and uncertainty, such as confidence intervals.

Let's use a subset of Fisher's Iris data for this example. There are four variables to analyze, so you have to estimate confidence intervals for 4*3/2 = 6 correlation coefficients. For this example, let's use the FISHER option to estimate the confidence intervals. We can then compute the proportion of bootstrap estimates that fall within those confidence intervals, thereby estimating the coverage probability of the Fisher intervals? Does a 95% Fisher interval have 95% coverage for these data? Let's find out!

Step 1: Compute statistics on the original data

The following SAS statements subset the Sashelp.Iris data and call PROC CORR to estimate Fisher's confidence intervals for the six correlation coefficients:

data sample;   /* N = 50 */
   set Sashelp.Iris(where=(Species="Versicolor"));
   label PetalLength= PetalWidth= SepalLength= SepalWidth=;
run;
 
/* STEP 1: Compute statistics on the original data */
%let varNames = PetalLength PetalWidth SepalLength SepalWidth;
%let numVars=4;
proc corr data=sample fisher(biasadj=no) noprob plots=matrix;
   var &varNames;
   ods output FisherPearsonCorr=FPCorr(keep=Var WithVar NObs Corr Lcl Ucl);
run;
 
proc print data=FPCorr noobs; run;

The output shows the estimates for the 95% Fisher confidence intervals for the correlation coefficients. Remember, these estimates assume bivariate normality. Let's generate a bootstrap distribution for the correlation statistics and examine how well the Fisher intervals capture the sampling variation among the bootstrap estimates.

Steps 2 and 3: Bootstrap samples and the bootstrap distribution

As shown in other articles, you can use PROC SURVEYSELECT in SAS to generate B bootstrap samples of the data. Each "bootstrap sample" is a sample (with replacement) from the original data. You can then use the BY statement in PROC CORR to obtain B correlation matrices, each estimated from one of the bootstrap samples:

/* STEP 2: Create B=10,000 bootstrap samples */
/* Bootstrap samples */
%let NumSamples = 10000;       /* number of bootstrap resamples */
proc surveyselect data=sample NOPRINT seed=12345
     out=BootSamples
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                 option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;
 
/* STEP 3. Compute the statistic for each bootstrap sample */
proc corr data=BootSamples noprint outp=CorrOut;
   by Replicate;
   freq NumberHits;
   var &varNames;
run;

For many bootstrap analyses, you can proceed directly to the next step, which is to summarize the bootstrap distribution. However, in this case, the statistics in the CorrOut data set are not in a convenient format. The statistics are stored in matrices, as shown by the following call to PROC PRINT, which prints the correlation estimates for the first two bootstrap samples:

/* look at estimates for first two bootstrap samples */
proc print data=CorrOut noobs;
   where _Type_="CORR" & Replicate<=2;
run;

To analyze the distribution of the correlation estimates, it is useful to reshape the data, as described in a previous article. The article shows how to obtain only the values in the cells that are highlighted in yellow. The article also shows how to add labels that identify the pairs of variables. The following DATA step reshapes the data and adds labels such as "Corr(X1,X2)":

/* Convert correlations in the OUTP= data set from wide to long form. See
   https://blogs.sas.com/content/iml/2021/08/25/symmetric-matrix-wide-to-long.html */
data BootLong;
set CorrOut(where=(_Type_="CORR") rename=(_NAME_=Var1));
length Var2 $32 Label $13;
array X[*] &varNames;
Row = 1 + mod(_N_-1, &NumVars);
do Col = Row+1 to &NumVars;       /* names for columns */
   Var2 = vname(X[Col]);
   Corr = X[Col];
   Label = cats("Corr(X",Row,",X",Col,")");
   output;
end;
drop _TYPE_ &varNames;
run;

The bootstrap estimates are now in a "long format," which makes it easier to analyze and visualize the bootstrap distribution of the estimates.

The next section combines the original estimates and the bootstrap estimates. To facilitate combining the data sets, you need to add the Label column to the data set that contains the Fisher intervals. This was also described in the previous article. Having a common Label variable enables you to merge the two data sets.

/* add labels to the Fisher statistics */
data FisherCorr;
length Label $13;
set FPCorr;
retain row col 1;
if col>=&NumVars then do; row+1; col=row; end;
col+1;
Label = cats("Corr(X",row,",X",col,")");
run;

Step 4: Analyze the bootstrap distribution

You can graph the bootstrap distribution for each pair of variables by using the SGPANEL procedure. The following statements create a basic panel:

proc sgpanel data=BootLong;
   panelby Label;
   histogram Corr;
run;

However, a goal of this analysis is to compare the Fisher confidence intervals (CIs) to the bootstrap distribution of the correlation estimates. Therefore, it would be nice to overlay the original correlation estimates and Fisher intervals on the histograms of the bootstrap distributions. To do that, you can concatenate the BootLong and the FisherCorr data sets, as follows:

/* STEP 4: Analyze the bootstrap distribution */
/* overlay the original sample estimates and the Fisher confidence intervals */
data BootLongCI;
set BootLong FisherCorr(rename=(Var=Var1 WithVar=Var2 Corr=Estimate));
label Corr="Correlation Estimate";
run;
 
title "Bootstrap Correlation Estimates (B=&NumSamples)";
title2 "Overlay Fisher Confidence Intervals";
proc sgpanel data=BootLongCI;
   panelby Label / novarname;
   histogram Corr;
   refline Estimate / axis=X lineattrs=(color=red);
   refline LCL / axis=X ;
   refline UCL / axis=X ;
run;

The graph is shown at the top of this article. In the graph, the red lines indicate the original estimates of correlation from the data. The gray vertical lines show the Fisher CIs for the data. The histograms show the distribution of the correlation estimates on 10,000 bootstrap re-samples of the data.

Visually, it looks like the Fisher confidence intervals do a good job of estimating the variation in the correlation estimates. It looks like most of the bootstrap estimates are contained in the 95% CIs.

The coverage of Fisher's confidence intervals

You can use the bootstrap estimates to compute the empirical coverage probability of the Fisher CIs. That is, for each pair of variables, what proportion of the bootstrap estimates (out of 10,000) are within the Fisher intervals? If all pairs of variables were bivariate normal, the expected proportion would be 95%.

You can use the logical expression (Corr>=LCL & Corr<=UCL) to create a binary indicator variable. The proportion of 1s for this variable equals the proportion of bootstrap estimates that are inside a Fisher interval. The following DATA step creates the indicator variable and the call to PROC FREQ counts the proportion of bootstrap estimates that are in the Fisher interval.

/* how many bootstrap estimates are inside a Fisher interval? */
proc sort data=BootLong; by Label; run;
 
data Pctl;
merge BootLong FisherCorr(rename=(Corr=Estimate));
by Label;
label Corr="Correlation Estimate";
InLimits = (Corr>=LCL & Corr<=UCL);
run;
 
proc freq data=Pctl noprint;
   by Label;
   tables InLimits / nocum binomial(level='1');
   output out=Est binomial;
run;
proc print data=Est noobs;
   var Label N _BIN_ L_BIN U_BIN;
run;

The table shows empirical estimates for the coverage probability of the Fisher intervals. The _BIN_ column contains estimates of the binary proportion. The L_BIN and U_BIN columns contain confidence intervals for the proportions. You can see that all intervals include more than 90% of the bootstrap estimates, and some are close to 95% coverage. The following graph visualizes the empirical coverage for each pair of variables:

ods graphics / width=400px height=240px;
title "Proportion of Bootstrap Estimates in Fisher Intervals"; 
proc sgplot data=Est noautolegend;
   scatter x=_BIN_ y=Label / xerrorlower=L_Bin xerrorupper=U_Bin;
   refline 0.95 / axis=x;
   xaxis  max=1 grid;
   yaxis grid display=(nolabel) offsetmin=0.1 offsetmax=0.1;
run;

The graph indicates that the many Fisher intervals are too wide (too conservative). Their empirical coverage is greater than the 95% coverage that you would expect for bivariate normal data. On the other hand, the Fisher interval for Corr(X2,X4) has lower coverage than expected.

Summary

This article shows how to perform a bootstrap analysis for correlation among multiple variables. The process of resampling the data (PROC SURVEYSELECT) and generating the bootstrap estimates (PROC CORR with a BY statement) is straightforward. However, to analyze the bootstrap distributions, you need to restructure the statistics by converting the output data from a wide form to a long form. You can use this technique to perform your own bootstrap analysis of correlation coefficients.

Part of the analysis also used the bootstrap distributions to estimate the coverage probability of the Fisher confidence intervals for these data. The Fisher intervals have 95% coverage for normally distributed data. For the iris data, most of the intervals have coverage that is greater than expected, although one interval has lower-than-expected coverage. These results are strongly dependent on the data.

The post Bootstrap correlation coefficients in SAS appeared first on The DO Loop.

6月 072021
 

For many univariate statistics (mean, median, standard deviation, etc.), the order of the data is unimportant. If you sort univariate data, the mean and standard deviation do not change. However, you cannot sort an individual variable (independently) if you want to preserve its relationship with other variables. This statement is both obvious and profound. For example, the correlation between two variables X and Y depends on the (X,Y) pairs. If you sort Y independently and call the vector of sorted observations Z, then corr(X,Z) is typically not equal to corr(X,Y), even though Y and Z contain exactly the same data values.

This observation is the basis behind permutation tests for paired data values. In a permutation test, you repeatedly change the order of one of the variables independently of the other variable(s). The goal of a permutation test is to determine whether the original statistic is likely to be observed in a random pairing of the data values.

This article looks at how sorting one variable (independently of another) changes the correlation between two variables. You can use this technique to construct a new vector that has the same values (marginal distribution) but almost any correlation you want with the other vector.

A simple permutation: A sorted vector of values

The Sashelp.Class data set in SAS contains observations about the height, weight, age, and gender of 19 students. The Height and Weight variables are highly correlated (R = 0.88), as shown by the following call to PROC CORR in Base SAS:

data class;
set Sashelp.Class;
keep Height Weight;
run;
 
proc corr data=Class nosimple;
   var Height Weight;
run;

The output from PROC CORR computes the sample correlation between the Height and Weight variables as 0.88. The p-value is highly significant (p < .0001), which indicates that it would be rare to observe a statistic this large if the heights and weights were sampled randomly from an uncorrelated bivariate distribution.

Consider the following experiment: What happens if you sort the second variable and look at the correlation between the first variable and the sorted copy of the second variable? The following program runs the experiment twice, first with Weight sorted in increasing order, then again with Weight sorted in decreasing order:

/* sort only the second variable: First increasing order, then decreasing order */
proc sort data=Class(keep=Weight) out=Weight1(rename=(Weight=WeightIncr));
   by Weight;
run;
proc sort data=Class(keep=Weight) out=Weight2(rename=(Weight=WeightDecr));
   by descending Weight;
run;
 
/* add the sorted variables to the original data */
data Class2;
   merge Class Weight1 Weight2;
run;
/* what is the correlation between the first variable and permutations of the second variable? */
proc corr data=Class2 nosimple;
   var Weight WeightIncr WeightDecr;
   with Height;
run;

The output shows that the original variables were highly correlated. However, if you sort the weight variable in increasing order (WeightIncr), the sorted values have 0.06 correlation with the height values. This correlation is not significantly different from 0. Similarly, if you sort the Weight variable in decreasing order (WeightDecr), the sorted values have -0.09 correlation with the Height variable. This correlation is also not significantly different from 0.

It is not a surprise that permuting the values of the second variable changes the correlation with the first variable. After all, correlation measures how the variables co-vary. For the original data, large Height values tend to be paired with high Weight values. Same with the small values. Changing the order of the Weight variable destroys that relationship.

A permutation test for correlation

The experiment in the previous section demonstrates how a permutation test works for two variables. You start with pairs of observations and a statistic computed for those pairs (such as correlation). You want to know whether the statistic that you observe is significantly different than you would observe in random pairings of those same variables. To run a permutation test, you permute the values of the second variable, which results in a random pairing. You look at the distribution of values of the statistic when the values are paired randomly. You ask whether the original statistic is likely to have resulted from a random pairing of the values. If not, you reject the null hypothesis that the values are paired randomly.

You can use SAS/IML software and the RANPERM function to create a permutation test for correlation, as follows:

proc iml;
use Class2;
   read all var "Height" into X;
   read all var "Weight" into Y;
close;
 
start CorrCoeff(X, Y, method="Pearson");
   return ( corr(X||Y, method)[1,2] );   /* CORR returns a matrix; extract the correlation of the pair */
finish;
 
/* permutation test for correlation between two variables */
call randseed(12345);        /* set a random number seed */
B = 10000;                   /* total number of permutations */
PCorr = j(B, 1, .);          /* allocate vector for Pearson correlations */
PCorrObs = CorrCoeff(X, Y);  /* correlation for original data */
PCorr[1] = PCorrObs; 
do i = 2 to B;
   permY = ranperm(Y)`;      /* permute order of Y */
   PCorr[i] = CorrCoeff(X, permY); /* correlation between X and permuted Y */
end;

The previous program computes a vector that has B=10,000 correlation coefficients. The first element in the vector is the observed correlation between X and Y. The remaining elements are the correlations between X and randomly permuted copies of Y. The following statement estimates the p-value for the null hypothesis that X and Y have zero correlation. The two-sided p-value is computed by counting the proportion of the permuted data vectors whose correlation with X is at least as great (in magnitude) as the observed statistic. You can also use a histogram to visualize the distribution of the correlation coefficient under the null hypothesis:

pVal = sum( abs(PCorr)>= abs(PCorrObs) ) / B;   /* proportion of permutations with extreme corr */
print pVal;
 
title "Permutation Test with H0:corr(X,Y)=0";
title2 "Distribution of Pearson Correlation (B=1000)";
refStmt = "refline " + char(PCorrObs) + " / axis=x lineattrs=(color=red);";
call histogram(PCorr) other=refStmt label="Pearson Correlation";

The graph and p-value give the same conclusion: It is unlikely that the observed statistic (0.88) would arise from a random pairing of the X and Y data values. The conclusion is that the observed correlation between X and Y is significantly different from zero. This conclusion agrees with the original PROC CORR analysis.

Find a vector that has a specified correlation with X

Every correlation coefficient in the previous histogram is the result of a permutation of Y. Each permutation has the same data values (marginal distribution) as Y, but a different correlation with X. An interesting side-effect of the permutation method is that you can specify a value of a correlation coefficient that is in the histogram (such as -0.3) and find a vector Z such that Z has the same marginal distribution as Y but corr(X,Z) is the specified value (approximately). The following SAS/IML statements use random permutations to find a data vector with those properties:

/* Note: We can find a vector Z that has the same
   values as Y but has (almost) any correlation we want! 
   Choose the target correlation from the histogram, such as r=-0.3.
   Then find a vector Z that has the same values as Y but has corr(X,Z) = -0.3 */
targetR = -0.3;  
isMatch = 0;
do i = 1 to B while(isMatch=0);   /* to avoid an infinite loop, limit search to B permutations */
   Z = ranperm(Y)`;          /* permute order of Y */
   r = CorrCoeff(X, Z);      /* correlation between X and permuted Y */
   isMatch = ( abs(r - targetR) < 0.01 );  /* is correlation close to target? */
end;
 
print r;
title "X versus Permutation of Y";
call scatter(X,Z);

The scatter plot shows the bivariate distribution of X and Z, where Z has the same data values as Y, but the correlation with X is close to -0.3. You can choose a different target correlation (such as +0.25) and repeat the experiment. If you do not specify a target correlation that is too extreme (such as greater than 0.6), this method should find a data vector whose correlation with X is close to the specified value.

Notice that not all correlations are possible. There are at most n! permutations of Y, where n is the number of observations. Thus, a permutation test involves at most n! distinct correlation coefficients.

Summary and further reading

This article shows that permuting the data for one variable (Y) but not another (X) destroys the multivariate relationship between X and Y. This is the basis for permutations tests. In a permutation test, the null hypothesis is that the variables are not related. You test that hypothesis by using random permutations to generate fake data for Y that are unrelated to X. By comparing the observed statistic to the distribution of the statistic for the fake data, you can reject the null hypothesis.

An interesting side-effect of this process is that you can create a new vector of data values that has the same marginal distribution as the original data but has a different relationship with X. In a future article, I will show how to exploit this fact to generate simulated data that have a specified correlation.

For more information about permutation tests in SAS/IML, see

The post Permutation tests and independent sorting of data appeared first on The DO Loop.

1月 202021
 

This is the third and last introductory article about how to bootstrap time series in SAS. In the first article, I presented the simple block bootstrap and discussed why bootstrapping a time series is more complicated than for regression models that assume independent errors. Briefly, when you perform residual resampling on a time series, you need to resample in blocks to simulate the autocorrelation in the original series. In the second article, I introduced the moving-block bootstrap. For both methods, all blocks are the same size, and the block size must evenly divide the length of the series (n).

In contrast, the stationary block bootstrap uses blocks of random lengths. This article describes the stationary block bootstrap and shows how to implement it in SAS.

The "stationary" bootstrap gets its name from Politis and Romano (1992), who developed it. When applied to stationary data, the resampled pseudo-time series is also stationary.

Choosing the length of a block

In the moving-block bootstrap, the starting location for a block is chosen randomly, but all blocks have the same length. For the stationary block bootstrap, both the starting location and the length of the block are chosen at random.

The distribution for the block length is typically chosen from a geometric distribution with parameter p. If you want the average block length to be L, define p = 1/L because the expected value of a geometric distribution with parameter p is 1/p.

Why do we choose lengths from the geometric distribution? Well, because a length from the geometric distribution is easy to interpret in terms of a discrete process. Imagine performing the following process:

  • Choose an initial location, t, at random.
  • With probability p, choose a different location (at random) for the next residual and start a new block.
  • With probability (1-p), choose the location t+1 for the next residual and add that residual to the current block.
By definition, this process generates blocks whose lengths are geometrically distributed: L ~ Geom(p).

Generate a random block

There are several ways to implement the stationary block bootstrap in SAS. A straightforward method is to generate a starting integer, t, uniformly at random in the range [1, n], where n is the length of the series. Then, choose a length, L ~ Geom(p). If t + L exceeds n, truncate the block. (Some practitioners "wrap" the block to include residuals at the beginning of the series. This alternative method is called the circular block bootstrap.) Continue this process until the total length of the blocks is n. You will usually need to truncate the length of the last block so that the sum of the lengths is exactly n. This process is implemented in the following SAS/IML user-defined functions:

/* STATIONARY BLOCK BOOTSTRAP */
/* Use a random starting point, t, and a random block length, L ~ Geom(p), 
   - probability p of starting a new block at a new position
   - probability (1-p) of using the next residual from the current block
*/
%let p = (1/12);                /* expected block length is 1/p */ 
proc iml;
/* generate indices for a block of random length L~Geom(p)
   chosen from the sequence 1:n */
start RandBlockIdx(n, p);
   beginIdx = randfun(1, "Integer", n); /* random start: 1:n */
   L = randfun(1, "Geom", p);           /* random length */
   endIdx = min(beginIdx+L-1, n);       /* truncate into 1:n */
   return( beginIdx:endIdx );           /* return sequence of indices in i:n */
finish;
 
/* generate indices for one stationary bootstrap resample */
start StationaryIdx(n, p);
   bootIdx = j(n,1,.);
   s = 1;                       /* starting index for first block */
   do while( s <= n );
      idx = RandBlockIdx(n, p); /* get block of random length */
      if ncol(idx) > n-s then   /* truncate if block too long */
         idx = idx[, 1:n-s+1];
      L = ncol(idx);            /* length of this block */
      jdx = s:s+L-1;            /* all indices for this block */
      bootIdx[jdx] = idx;       /* assign this block */
      s = s + L;                /* starting index for next block */
   end;
   return bootIdx;
finish;

Let's run these functions on some sample data. The first article in this series shows how to fit a model to the Sashelp.Air data and write the predicted and residual values. The following statements use the stationary block bootstrap to generate one bootstrap sample of length n from the residuals. When you add the bootstrapped residuals to the predicted values, you get a "new" time series, which is graphed:

use OutReg;                     /* data to bootstrap: Y = Pred + Resid */
   read all var {'Time' 'Pred' 'Resid'};
close;
 
call randseed(1234);
n = nrow(Pred);                 /* length of series */
prob = &p;                      /* probability of jumping to a new block */
 
/*---- Visualization Example: generate one bootstrap resample ----*/
idx = StationaryIdx(n, prob);
YBoot = Pred + Resid[idx];
 
/* visualize the places where a sequence is not ascending:
https://blogs.sas.com/content/iml/2013/08/28/finite-diff-estimate-maxi.html */
blocksIdx = loc(dif(idx) <= 0); /* where is the sequence not ascending? */
b = Time[blocksIdx];
 
title "One Bootstrap Resample";
title2 "Stationary Block Bootstrap";
refs = "refline " + char(b) + " / axis=x;";
call series(Time, YBoot) other=refs;
/*---- End of visualization example ----*/

The graph shows a single bootstrap sample. I have added vertical reference lines to show the blocks of residuals. For this random sample, there are nine blocks whose lengths vary between 2 (the last block) and 28. If you choose another random sample, you will get a different set of blocks that have different lengths.

The stationary block bootstrap

In practice, the bootstrap method requires many bootstrap resamples. You can call the StationaryIdx function repeatedly in a loop to generate B=1000 bootstrap samples, as follows:

/* A complete stationary block bootstrap generates B resamples
   and usually writes the resamples to a SAS data set. */
B = 1000;
SampleID = j(n,1,.);
create BootOut var {'SampleID' 'Time' 'YBoot'};  /* create outside of loop */
do i = 1 to B;
   SampleId[,] = i;   /* fill array. See https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
   idx = StationaryIdx(n, prob);
   YBoot = Pred + Resid[idx];
   append;
end;
close BootOut;

The BootOut data set contains all the bootstrap samples. You can use a BY-group analysis to obtain the bootstrap estimates for almost any statistic. For example, the following scatter plot shows the distribution of 1000 bootstrap estimates of the slope and intercept parameters for an autoregressive model of the series. Reference lines indicate the estimates for the original data.

You can download the complete SAS program that performs all computations and creates all graphs in this article.

Summary

This article is the third of a series. It shows how to perform a stationary block bootstrap by randomly sampling blocks of residuals. Whereas the simple-block and the moving-block bootstrap methods use blocks that have a constant length, the stationary block bootstrap chooses blocks of random lengths.

In researching these blog posts, I read several articles. I recommend Bates (2019) for an introduction and Politis and Romano (1994) for details.

The post The stationary block bootstrap in SAS appeared first on The DO Loop.

1月 132021
 

As I discussed in a previous article, the simple block bootstrap is a way to perform a bootstrap analysis on a time series. The first step is to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L) that divides the total length of the series (n). Each bootstrap resample is generated by randomly choosing from among the non-overlapping n/L blocks of residuals, which are added to the predicted model.

The simple block bootstrap is not often used in practice. One reason is that the total number of blocks (k=n/L) is often small. If so, the bootstrap resamples do not capture enough variation for the bootstrap method to make correct inferences. This article describes a better alternative: the moving block bootstrap. In the moving block bootstrap, every block has the same block length but the blocks overlap. The following figure illustrates the overlapping blocks when L=3. The indices 1:L define the first block of residuals, the indices 2:L+1 define the second block, and so forth until the last block, which contains the residuals n-L+1:n.

To form a bootstrap resample, you randomly choose k=n/L blocks (with replacement) and concatenate them. You then add these residuals to the predicted values to create a "new" time series. Repeat the process many times and you have constructed a batch of bootstrap resamples. The process of forming one bootstrap sample is illustrated in the following figure. In the figure, the time series has been reshaped into a k x L matrix, where each row is a block.

The moving block bootstrap in SAS

To demonstrate the moving block bootstrap in SAS, let's use the same data that I analyzed in the previous article about the simple block bootstrap. The previous article extracted 132 observations from the Sashelp.Air data set and used PROC AUTOREG to form an additive model Predicted + Residuals. The OutReg data set contains three variables of interest: Time, Pred, and Resid.

As before, I will choose the block size to be L=12. The following SAS/IML program reads the data and defines a matrix (R) such that the i_th row contains the residuals with indices i:i+L-1. In total, the matrix R has n-L+1 rows.

/* MOVING BLOCK BOOTSTRAP */
%let L = 12;
proc iml;
call randseed(12345);
 
use OutReg;
   read all var {'Time' 'Pred' 'Resid'};
close;
 
/* Restriction for Simple Block Bootstrap: 
   The length of the series (n) must be divisible by the number of blocks (k)
   so that all blocks have the same length (L) */
n = nrow(Pred);          /* length of series */
L = &L;                  /* length of each block */
k = n / L;               /* number of random blocks to use */
if k ^= int(k) then 
   ABORT "The series length is not divisible by the block length";
 
/* Trick: Reshape data into k x L matrix. Each row is block of length L */
P = shape(Pred, k, L);   /* there are k rows for Pred */
J = n - L + 1;           /* total number of overlapping blocks to choose from */
R = j(J, L, .);          /* there are n-L+1 blocks of residuals */
Resid = rowvec(Resid);   /* make Resid into row vector so we don't need to transpose each row */
do i = 1 to J;
   R[i,] = Resid[ , i:i+L-1]; /* fill each row with a block of residuals */
end;

With this setup, the formation of bootstrap resamples is almost identical to the program in the previous article. The only difference is that the matrix R for the moving block bootstrap has more rows. Nevertheless, each resample is formed by randomly choosing k rows from R and adding them to a block of predicted values. The following statements generate B=1000 bootstrap resamples, which are written to a SAS data set (BootOut). The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.

/* The moving block bootstrap repeats this process B times
   and usually writes the resamples to a SAS data set. */
B = 1000;
SampleID = j(n,1,.);
create BootOut var {'SampleID' 'Time' 'YBoot'};  /* create outside of loop */
do i = 1 to B;
   SampleId[,] = i;
   idx = sample(1:J, k);  /* sample of size k from the set 1:J */
   YBoot = P + R[idx,];
   append;
end;
close BootOut;
QUIT;

The BootOut data set contains B=1000 bootstrap samples. The rest of the bootstrap analysis is exactly the same as in the previous article.

Summary

This article shows how to perform a moving block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which must divide the total length of the series (n), and form the n-L+1 overlapping blocks of residuals. Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.

You can download the SAS program that shows the complete analysis for both the simple block and moving block bootstrap.

The post The moving block bootstrap for time series appeared first on The DO Loop.