1月 232021
 

It’s been a full eight years since HBR declared that data scientists hold the sexiest job of the 21st century. Yet, here we are in 2021, and the work of data scientists is still found outside the spotlight in many settings. Younger data scientists, in particular, can find it hard [...]

5 data scientists offer advice on improving your visibility was published on SAS Voices by Alison Bolen

1月 202021
 

Editor's note: This blog post is the first in a series of posts, originally published here by our partner News Literacy Project, exploring the role of data in understanding our world. Like infographics, social media and other forms of user-generated content pose unique challenges regarding data. Many news outlets and journalists have checks and balances [...]

Beware of data shared via social media - get the facts was published on SAS Voices by Jen Sabourin

1月 202021
 

This is the third and last introductory article about how to bootstrap time series in SAS. In the first article, I presented the simple block bootstrap and discussed why bootstrapping a time series is more complicated than for regression models that assume independent errors. Briefly, when you perform residual resampling on a time series, you need to resample in blocks to simulate the autocorrelation in the original series. In the second article, I introduced the moving-block bootstrap. For both methods, all blocks are the same size, and the block size must evenly divide the length of the series (n).

In contrast, the stationary block bootstrap uses blocks of random lengths. This article describes the stationary block bootstrap and shows how to implement it in SAS.

The "stationary" bootstrap gets its name from Politis and Romano (1992), who developed it. When applied to stationary data, the resampled pseudo-time series is also stationary.

Choosing the length of a block

In the moving-block bootstrap, the starting location for a block is chosen randomly, but all blocks have the same length. For the stationary block bootstrap, both the starting location and the length of the block are chosen at random.

The distribution for the block length is typically chosen from a geometric distribution with parameter p. If you want the average block length to be L, define p = 1/L because the expected value of a geometric distribution with parameter p is 1/p.

Why do we choose lengths from the geometric distribution? Well, because a length from the geometric distribution is easy to interpret in terms of a discrete process. Imagine performing the following process:

  • Choose an initial location, t, at random.
  • With probability p, choose a different location (at random) for the next residual and start a new block.
  • With probability (1-p), choose the location t+1 for the next residual and add that residual to the current block.
By definition, this process generates blocks whose lengths are geometrically distributed: L ~ Geom(p).

Generate a random block

There are several ways to implement the stationary block bootstrap in SAS. A straightforward method is to generate a starting integer, t, uniformly at random in the range [1, n], where n is the length of the series. Then, choose a length, L ~ Geom(p). If t + L exceeds n, truncate the block. (Some practitioners "wrap" the block to include residuals at the beginning of the series. This alternative method is called the circular block bootstrap.) Continue this process until the total length of the blocks is n. You will usually need to truncate the length of the last block so that the sum of the lengths is exactly n. This process is implemented in the following SAS/IML user-defined functions:

/* STATIONARY BLOCK BOOTSTRAP */
/* Use a random starting point, t, and a random block length, L ~ Geom(p), 
   - probability p of starting a new block at a new position
   - probability (1-p) of using the next residual from the current block
*/
%let p = (1/12);                /* expected block length is 1/p */ 
proc iml;
/* generate indices for a block of random length L~Geom(p)
   chosen from the sequence 1:n */
start RandBlockIdx(n, p);
   beginIdx = randfun(1, "Integer", n); /* random start: 1:n */
   L = randfun(1, "Geom", p);           /* random length */
   endIdx = min(beginIdx+L-1, n);       /* truncate into 1:n */
   return( beginIdx:endIdx );           /* return sequence of indices in i:n */
finish;
 
/* generate indices for one stationary bootstrap resample */
start StationaryIdx(n, p);
   bootIdx = j(n,1,.);
   s = 1;                       /* starting index for first block */
   do while( s <= n );
      idx = RandBlockIdx(n, p); /* get block of random length */
      if ncol(idx) > n-s then   /* truncate if block too long */
         idx = idx[, 1:n-s+1];
      L = ncol(idx);            /* length of this block */
      jdx = s:s+L-1;            /* all indices for this block */
      bootIdx[jdx] = idx;       /* assign this block */
      s = s + L;                /* starting index for next block */
   end;
   return bootIdx;
finish;

Let's run these functions on some sample data. The first article in this series shows how to fit a model to the Sashelp.Air data and write the predicted and residual values. The following statements use the stationary block bootstrap to generate one bootstrap sample of length n from the residuals. When you add the bootstrapped residuals to the predicted values, you get a "new" time series, which is graphed:

use OutReg;                     /* data to bootstrap: Y = Pred + Resid */
   read all var {'Time' 'Pred' 'Resid'};
close;
 
call randseed(1234);
n = nrow(Pred);                 /* length of series */
prob = &p;                      /* probability of jumping to a new block */
 
/*---- Visualization Example: generate one bootstrap resample ----*/
idx = StationaryIdx(n, prob);
YBoot = Pred + Resid[idx];
 
/* visualize the places where a sequence is not ascending:
https://blogs.sas.com/content/iml/2013/08/28/finite-diff-estimate-maxi.html */
blocksIdx = loc(dif(idx) <= 0); /* where is the sequence not ascending? */
b = Time[blocksIdx];
 
title "One Bootstrap Resample";
title2 "Stationary Block Bootstrap";
refs = "refline " + char(b) + " / axis=x;";
call series(Time, YBoot) other=refs;
/*---- End of visualization example ----*/

The graph shows a single bootstrap sample. I have added vertical reference lines to show the blocks of residuals. For this random sample, there are nine blocks whose lengths vary between 2 (the last block) and 28. If you choose another random sample, you will get a different set of blocks that have different lengths.

The stationary block bootstrap

In practice, the bootstrap method requires many bootstrap resamples. You can call the StationaryIdx function repeatedly in a loop to generate B=1000 bootstrap samples, as follows:

/* A complete stationary block bootstrap generates B resamples
   and usually writes the resamples to a SAS data set. */
B = 1000;
SampleID = j(n,1,.);
create BootOut var {'SampleID' 'Time' 'YBoot'};  /* create outside of loop */
do i = 1 to B;
   SampleId[,] = i;   /* fill array. See https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
   idx = StationaryIdx(n, prob);
   YBoot = Pred + Resid[idx];
   append;
end;
close BootOut;

The BootOut data set contains all the bootstrap samples. You can use a BY-group analysis to obtain the bootstrap estimates for almost any statistic. For example, the following scatter plot shows the distribution of 1000 bootstrap estimates of the slope and intercept parameters for an autoregressive model of the series. Reference lines indicate the estimates for the original data.

You can download the complete SAS program that performs all computations and creates all graphs in this article.

Summary

This article is the third of a series. It shows how to perform a stationary block bootstrap by randomly sampling blocks of residuals. Whereas the simple-block and the moving-block bootstrap methods use blocks that have a constant length, the stationary block bootstrap chooses blocks of random lengths.

In researching these blog posts, I read several articles. I recommend Bates (2019) for an introduction and Politis and Romano (1994) for details.

The post The stationary block bootstrap in SAS appeared first on The DO Loop.

1月 182021
 

Recommended soundtrack for this blog post: Netflix Trip by AJR.

This week's news confirms what I already knew: The Office was the most-streamed television show of 2020. According to reports that I've seen, the show was streamed for 57 billion minutes during this extraordinary year. I'm guessing that's in part because we've all been shut in and working from home; we crave our missing office interactions. We lived vicariously (and perhaps dysfunctionally) through watching Dunder Mifflin staff. But another major factor was the looming deadline of the departure of The Office from Netflix as of January 1, 2021. It was a well-publicized event, so Netflix viewers had to get their binge on while they could.

People in my house are fans of the show, and they account for nearly 6,000 of those 57 billion streaming minutes. I can be this precise (nerd alert!) because I'm in the habit of analyzing our Netflix activity by using SAS. In fact, I can tell you that since late 2017, we've streamed 576 episodes of The Office. We streamed 297 episodes in 2020. (Since the show has only 201 episodes we clearly we have a few repeats in there.)

I built a heatmap that shows the frequency and intensity of our streaming of this popular show. In this graph each row is a month, each square is a day. White squares are Office-free. A square with any red indicates at least one virtual visit with the Scranton crew; the darker the shade, the more episodes streamed during that day. You can see that Sept 15, 2020 was a particular big binge with 17 episodes. (Each episode is about 20-21 minutes, so it's definitely achievable.)

netflix trip through The Office

Heatmap of our household streaming of The Office

How to build the heatmap

To build this heatmap, I started with my Netflix viewing history (downloaded from my Netflix account as CSV files). I filtered to just "The Office (U.S.)" titles, and then merged with a complete "calendar" of dates between late 2017 and the start of 2021. Summarized and merged, the data looks something like this:


With all of the data summarized in this way such that there is only one observation per X and Y value, I can use the HEATMAPPARM statement in PROC SGPLOT to visualize it. (If I needed the procedure to summarize/bin the data for me, I would use the HEATMAP statement. Thanks to Rick Wicklin for this tip!)

proc sgplot data=ofc_viewing;
 title height=2.5 "The Office - a Netflix Journey";
 title2 height=2 "&episodes. episodes streamed on &days. days, over 3 years";
 label Episodes="Episodes per day";
 format monyear monyy7.;
 heatmapparm x=day y=monyear 
   colorresponse=episodes / x2axis
    outline
   colormodel=(white  CXfcae91 CXfb6a4a CXde2d26 CXa50f15) ;
 yaxis  minor reverse display=(nolabel) 
  values=(&amp;allmon.)
  ;
 x2axis values=(1 to 31 by 1) 
   display=(nolabel)  ;
run;

You can see the full code -- with all of the data prep -- on my GitHub repository here. You may even run the code in your own SAS environment -- it will fetch my Netflix viewing data from another GitHub location where I've stashed it.

Distribution of Seasons (not "seasonal distribution")

If you examine the heatmap I produced, you can almost see our Office enthusiasm in three different bursts. These relate directly to our 3 children and the moments they discovered the show. First was early 2018 (middle child), then late 2019 (youngest child), then late 2020 (oldest child, now 22 years old, striving to catch up).

The Office ran for 9 seasons, and our kids have their favorite seasons and episodes -- hence the repeated viewings. I used PROC FREQ to show the distribution of episode views across the seasons:


Season 1 is remarkably low for two reasons. First and most importantly, it contains the fewest episodes. Second, many viewers agree that Season 1 is the "cringiest" content, and can be uncomfortable to watch. (This Reddit user leaned into the cringe with his data visualization of "that's what she said" jokes.)

From the data (and from listening to my kids), I know that Season 2 is a favorite. Of the 60 episodes we streamed at least 4 times, 19 of them were in Season 2.

More than streaming, it's an Office lifestyle

Office fandom goes beyond just watching the show. Our kids continue to embrace The Office in other mediums as well. We have t-shirts depicting the memes for "FALSE." and "Schrute Farms." We listen to The Office Ladies podcast, hosted by two stars of the show. In 2018 our daughter's Odyssey of the Mind team created a parody skit based on The Office (a weather-based office named Thunder Mifflin) -- and advanced to world finals.

Rarely does a day go by without some reference to an iconic phrase or life lesson that we gleaned from The Office. We're grateful for the shared experience, and we'll miss our friends from the Dunder Mifflin Paper Company.

The post Visualizing our Netflix Trip through <em>The Office</em> appeared first on The SAS Dummy.

1月 182021
 

What if you had a technology solution that creates a real-time link between the customer demand signal and what's happening on the ground? What if plans that are being steered centrally could  finally be connected to every shipping lane, while simultaneously, creating cost saving carrier adjustments? The first-of-its kind integration [...]

SAS and C.H. Robinson are rewriting the rules of transportation planning and management was published on SAS Voices by Charlie Chase

1月 182021
 

Have you ever heard of the DOLIST syntax? You might know the syntax even if you are not familiar with the name. The DOLIST syntax is a way to specify a list of numerical values to an option in a SAS procedure. Applications include:

  • Specify the end points for bins of a histogram
  • Specify percentiles to be output to a data set
  • Specify tick marks for a custom axis on a graph
  • Specify the location of reference lines on a graph
  • Specify a list of parameters for an algorithm. Examples include smoothing parameters (the SMOOTH= option in PROC LOESS), sample sizes (the NTOTAL= option in PROC POWER), and initial guess for parameters in an optimization (the PARMS statement in PROC NLMIXED and PROC NLIN)

This article demonstrates how to use the DOLIST syntax to specify a list of values in SAS procedures. It shows how to use a single statement to specify individual values and also a sequence of values.

The DOLIST syntax enables you to write a single statement that specifies individual values and one or more sequences of values. The DOLIST syntax should be in the toolbox of every intermediate-level SAS programmer!

The DOLIST syntax in the SAS DATA step

According to the documentation of PROC POWER, the syntax described in this article is sometimes called the DOLIST syntax because it is based on the syntax for the iterative DO loop in the DATA step.

The most common syntax for a DO loop is DO x = start TO stop BY increment. For example, DO x = 10 TO 90 BY 20; iterates over the sequence of values 10, 30, 50, 70, and 90. If the increment is 1, you can omit the BY increment portion of the statement. However, you can also specify values as a common-separated list, such as DO x = 10, 30, 50, 70, 90;, which generates the same values. What you might not know is that you can combine these two methods. For example, in the following DATA step, the values are specified by using two comma-separated lists and three sequences. For clarity, I have placed each list on a separate line, but that is not necessary:

/* the DOLIST syntax for a DO loop in the DATA step */
data A;
do pctl = 5,                  /* individual value(s)  */
          10 to 50 by 20,     /* a sequence of values */
          54.3, 69.1,         /* individual value(s)  */
          80 to 90 by 5,      /* another sequence     */
          60 to 40 by -20;    /* yet another sequence */
   output;
end;
run;
 
proc print; run;

The output (not shown) is a list of values: 5, 10, 30, 50, 54.3, 69.1, 80, 85, 90, 60, 40. Notice that the values do not need to be in sorted order, although they often are.

The expressions to the right of the equal sign are what I mean by the "DOLIST syntax." You can use the same syntax to specify a list of options in many SAS procedures. When the SAS documentation says that an option takes a "list of values," you can often use a comma-separated list, a space-separated list, and the syntax start TO stop BY increment. (Or a combination of these expressions!) The following sections provide a few examples, but there are literally hundreds of options in SAS that support the DOLIST syntax!

Some procedures (for example, PROC SGPLOT) require the DOLIST values to be in parentheses. Consequently, I have adopted the convention of always using parentheses around DOLIST values, even if the parentheses are not strictly required. As far as I know, it is never wrong to put the DOLIST inside parentheses, and it keeps me from having to remember whether parentheses are required. The examples in this article all use parentheses to enclose DOLIST values.

Histogram bins and percentiles

You can use the DOLIST syntax to specify the endpoints of bins in a histogram. For example, in PROC UNIVARIATE, the ENDPOINTS= option in the HISTOGRAM statement supports a DOLIST. Because histograms use evenly spaced bins, usually you will specify only one sequence, as follows:

proc univariate data=sashelp.cars;
   var weight;
   histogram weight / endpoints=(1800 to 7200 by 600);   /* DOLIST sequence expression */
run;

You can also use the DOLIST syntax to specify percentiles. For example, the PCTLPTS= option on the OUTPUT statement enables you to specify which percentiles of the data should be written to a data set:

proc univariate data=sashelp.cars;
   var MPG_City;
   output out=UniOut pctlpre=P_  pctlpts=(50 75, 95 to 100 by 2.5);  /* DOLIST */
run;

Notice that this example specifies both individual percentiles (50 and 75) and a sequence of percentiles (95, 97.5, 100).

Tick marks and reference lines

The SGPLOT procedure enables you to specify the locations of tick marks on the axis of a graph. Most of the time you will specify an evenly spaced set of values, but (just for fun) the following example shows how you can use the DOLIST syntax to combine evenly spaced values and a few custom values:

title "Specify Ticks on the Y Axis";
proc sgplot data=sashelp.cars;
   scatter x=Weight y=Mpg_City;
   yaxis grid values=(10 to 40 by 5, 50 60); /* DOLIST; commas optional */
run;

As shown in the previous example, the GRID option on the XAXIS and YAXIS statements enables you to display reference lines at each tick location. However, sometimes you want to display reference lines independently from the tick marks. In that case, you can use the REFLINE statement, as follows:

title "Many Reference Lines";
proc sgplot data=sashelp.cars;
   scatter x=Weight y=MPG_City;
   refline (1800 to 6000 by 600, 7000) / axis=x;  /* many reference lines */
run;

Statistical procedures

Many statistical procedures have options that support lists. In most cases, you can use the DOLIST syntax to provide values for the list.

I have already written about how to use the DOLIST syntax to specify initial guesses for the PARM statement in PROC NLMIXED and PROC NLIN. The documentation for the POWER procedure discusses how to specify lists of values and uses the term "DOLIST" in its discussion.

Some statistical procedures enable you to specify multiple parameter values, and the analysis is repeated for each parameter in the list. One example is the SMOOTH= option in the MODEL statement of the LOESS procedure. The SMOOTH= option specifies values of the loess smoothing parameter. The following call to PROC LOESS fits four loess smoothers to the data. The call to PROC SGPLOT overlays the smoothers on a scatter plot of the data:

proc loess data=sashelp.cars plots=none;
   model MPG_City = Weight / smooth=(0.1 to 0.5 by 0.2, 0.75); /* value-list */
   output out=LoessOut P=Pred;
run;
 
proc sort data=LoessOut; by SmoothingParameter Weight; run;
proc sgplot data=LoessOut;
   scatter x=Weight y=MPG_City / transparency=0.9;
   series x=Weight y=Pred / group=SmoothingParameter 
          curvelabel curvelabelpos=min;
run;

Summary

In summary, this article describes the DOLIST syntax in SAS, which enables you to simultaneously specify individual values and evenly spaced sequences of values. A sequence is specified by using the start TO step BY increment syntax. The DOLIST syntax is valid in many SAS procedures and in the DATA step. In some procedures (such as PROC SGPLOT), the syntax needs to be inside parentheses. For readability, you can use commas to separate individual values and sequences.

Many SAS procedures accept the special syntax even if it is not explicitly mentioned in the documentation. In the documentation for an option, look for terms such as value-list or numlist or value-1 <...value-n>, which indicate that the option supports the DOLIST syntax.

The post The DOLIST syntax: Specify a list of numerical values in SAS appeared first on The DO Loop.

1月 162021
 

The management of the COVID-19 vaccination program is one of the most complex tasks attempted in modern history.  Even without the added complications of administering the vaccine during a pandemic, the race to vaccinate the populations who need it most all while maintaining the necessary cold-storage protocols, meeting double dose [...]

3 ways analytics can improve vaccine distribution and administration was published on SAS Voices by Steve Kearney

1月 132021
 

Running SAS programs in parallel reduces run time

As earth completes its routine annual circle around the sun and a new (and hopefully better) year kicks in, it is a perfect occasion to reflect on the idiosyncrasy of time.

While it is customary to think that 3+2=5, it is only true in sequential world. In parallel world, however, 3+2=3. Think about it: if you have two SAS programs one of which runs 3 hours, and the second one runs 2 hours, their total duration will be 5 hours if you run them one after another sequentially, but it will take only 3 hours if you run them simultaneously, in parallel.

I am sure you remember those “filling up a swimming pool” math problems from elementary school. They clearly and convincingly demonstrate that two pipes will fill up a swimming pool faster than one. That’s the power of running water in parallel.

The same principle of parallel processing (or parallel computing) is applicable to SAS programs (or non-SAS programs) by running their different independent pieces in separate SAS sessions at the same time (in parallel).  Divide and conquer.

You might be surprised at how easily this can be done, and at the same time how powerful it is. Let’s take a look.

SAS/CONNECT

MP CONNECT) parallel processing functionality was added to SAS/CONNECT enabling you to execute multiple SAS sessions asynchronously. When a remote SAS session kicks off asynchronously, a portion of your SAS program is sent to the server session for execution and control is immediately returned to the client session. The client session can continue with its own processing or spawn one or more additional asynchronous remote server sessions.

Running programs in parallel on a single machine

Sometimes, what comes across as new is just well forgotten old. They used to be Central Processing Units (CPU), but now they are called just processors. Nowadays, practically every single computer is a “multi-machine” (or to be precise “multi-processor”) device. Even your laptop. Just open Task Manager (Ctrl-Alt-Delete), click on the Performance tab and you will see how many physical processors (or cores) and logical processors your laptop has:

Parallel processing on a single machine

That means that this laptop can run eight independent SAS processes (sessions) at the same time. All you need to do is to say nicely “Dear Mr. & Mrs. SAS/CONNECT, my SAS program consists of several independent pieces. Would you please run each piece in its own SAS session, and run them all at the same time?” And believe me, SAS/CONNECT does not care how many logical processors you have, whether your logical processors are far away from each other “remote machines” or they are situated in a single laptop or even in a single chip.

Here is how you communicate your request to SAS/CONNECT in SAS language.

Spawning multiple SAS sessions using MP Connect

Suppose you have a SAS code that consists of several pieces – DATA or PROC steps that are independent of each other, i.e. they do not require to be run in a specific sequence. For example, each of the two pieces generates its own data set.

Then we can create these two data sets in two separate “remote” SAS sessions that run in parallel. Here is how you do this.  (For illustration purposes, I just create two dummy data sets.)

options sascmd="sas";
 
/* Current datetime */
%let _start_dt = %sysfunc(datetime());
 
/* Prosess 1 */
signon task1;
rsubmit task1 wait=no;
 
   libname SASDL 'C:\temp';
 
   data SASDL.DATA_A (keep=str);
      length str $1000;
      do i=1 to 1150000;
         str = '';
         do j=1 to 1000;
            str = cats(str,'A');
         end;
         output;
      end;
   run;
 
endrsubmit;
 
/* Process 2 */
signon task2;
rsubmit task2 wait=no;
 
   libname SASDL 'C:\temp';
 
   data SASDL.DATA_B (keep=str);
      length str $1000;
      do i=1 to 750000;
         str = '';
         do j=1 to 1000;
            str = cats(str,'B');
         end;
         output;
      end;
   run;
 
endrsubmit;
 
waitfor _all_;
signoff _all_;
 
/* Print total duration */
data _null_;
   dur = datetime() - &_start_dt;
   put 30*'-' / ' TOTAL DURATION:' dur time13.2 / 30*'-';
run;

In this code, the key elements are:

SIGNON Statement - initiates a connection between a client session and a server session.

ENDRSUBMIT statement - marks the end of a block of statements that a client session submits to a server session for execution.

SIGNOFF Statement - ends the connection between a client session and a server session.

Parallel processing vs. threaded processing

There is a distinction between parallel processing described above and threaded processing (aka multithreading). Parallel processing is achieved by running several independent SAS sessions, each processing its own unit of SAS code.

Threaded processing, on the other hand, is achieved by developing special algorithms and implementing executable codes that run on multiple processors (threads) within the same SAS session. Amdahl's law, which provides theoretical background and estimation of potential time saving achievable by parallel computing on multiple processors.

Passing information to and from “remote” SAS sessions

Besides passing pieces of SAS code from client sessions to server sessions, MP CONNECT allows you to pass some other SAS objects.

Passing data library definitions

For example, if you have a data library defined in your client session, you may pass that library definition on to multiple server sessions without re-defining them in each server session.

Let’s say you have two data libraries defined in your client session:

libname SRCLIB oracle user=myusr1 password=mypwd1 path=mysrv1;
libname TGTLIB '/sas/data/datastore1';

In order to make these data libraries available in the remote session all you need is to add

rsubmit task1 wait=no inheritlib=(SRCLIB TGTLIB);

This will allow libraries that are defined in the client session to be inherited by and available in the server session. As an option, each client libref can be associated with a libref that is named differently in the server session:

rsubmit task1 wait=no inheritlib=(SRCLIB=NEWSRC TGTLIB=NEWTGT);

Passing macro variables from client to server session

options sascmd="sas";
%let run_dt = %sysfunc(datetime());
signon task1;
%syslput rem_run_dt=&run_dt / remote=task1;
rsubmit task1 wait=no;
   %put &=rem_run_dt;
endrsubmit;
 
waitfor task1;
signoff task1;

Passing macro variables from server to client session

Similarly,

  • %SYSRPUT_USER_ </LIKE=‘character-string’>;
  • (/LIKE=<‘character-string’ >specifies a subset of macro variables whose names match a user-specified character sequence, or pattern.)

    Here is a code example that passes two macro variables, rem_start and rem_year from the remote session and outputs them to the SAS log in the client session:

    options sascmd="sas";
    signon task1;
    rsubmit task1 wait=no;
       %let start_dt = %sysfunc(datetime());
       %sysrput rem_start=&start_dt;
       %sysrput rem_year=2021;
    endrsubmit;
     
    waitfor task1;
    signoff task1;
    %put &=rem_start &=rem_year;

    Summary

    SAS’ Multi-Process Connect is a simple and efficient tool enabling parallel execution of independent programming units. Compared to sequential processing of time-intensive programs, it allows to substantially reduce overall duration of your program execution.

    Additional resources

    Running SAS programs in parallel using SAS/CONNECT® was published on SAS Users.

    1月 132021
     

    As I discussed in a previous article, the simple block bootstrap is a way to perform a bootstrap analysis on a time series. The first step is to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L) that divides the total length of the series (n). Each bootstrap resample is generated by randomly choosing from among the non-overlapping n/L blocks of residuals, which are added to the predicted model.

    The simple block bootstrap is not often used in practice. One reason is that the total number of blocks (k=n/L) is often small. If so, the bootstrap resamples do not capture enough variation for the bootstrap method to make correct inferences. This article describes a better alternative: the moving block bootstrap. In the moving block bootstrap, every block has the same block length but the blocks overlap. The following figure illustrates the overlapping blocks when L=3. The indices 1:L define the first block of residuals, the indices 2:L+1 define the second block, and so forth until the last block, which contains the residuals n-L+1:n.

    To form a bootstrap resample, you randomly choose k=n/L blocks (with replacement) and concatenate them. You then add these residuals to the predicted values to create a "new" time series. Repeat the process many times and you have constructed a batch of bootstrap resamples. The process of forming one bootstrap sample is illustrated in the following figure. In the figure, the time series has been reshaped into a k x L matrix, where each row is a block.

    The moving block bootstrap in SAS

    To demonstrate the moving block bootstrap in SAS, let's use the same data that I analyzed in the previous article about the simple block bootstrap. The previous article extracted 132 observations from the Sashelp.Air data set and used PROC AUTOREG to form an additive model Predicted + Residuals. The OutReg data set contains three variables of interest: Time, Pred, and Resid.

    As before, I will choose the block size to be L=12. The following SAS/IML program reads the data and defines a matrix (R) such that the i_th row contains the residuals with indices i:i+L-1. In total, the matrix R has n-L+1 rows.

    /* MOVING BLOCK BOOTSTRAP */
    %let L = 12;
    proc iml;
    call randseed(12345);
     
    use OutReg;
       read all var {'Time' 'Pred' 'Resid'};
    close;
     
    /* Restriction for Simple Block Bootstrap: 
       The length of the series (n) must be divisible by the number of blocks (k)
       so that all blocks have the same length (L) */
    n = nrow(Pred);          /* length of series */
    L = &L;                  /* length of each block */
    k = n / L;               /* number of random blocks to use */
    if k ^= int(k) then 
       ABORT "The series length is not divisible by the block length";
     
    /* Trick: Reshape data into k x L matrix. Each row is block of length L */
    P = shape(Pred, k, L);   /* there are k rows for Pred */
    J = n - L + 1;           /* total number of overlapping blocks to choose from */
    R = j(J, L, .);          /* there are n-L+1 blocks of residuals */
    Resid = rowvec(Resid);   /* make Resid into row vector so we don't need to transpose each row */
    do i = 1 to J;
       R[i,] = Resid[ , i:i+L-1]; /* fill each row with a block of residuals */
    end;

    With this setup, the formation of bootstrap resamples is almost identical to the program in the previous article. The only difference is that the matrix R for the moving block bootstrap has more rows. Nevertheless, each resample is formed by randomly choosing k rows from R and adding them to a block of predicted values. The following statements generate B=1000 bootstrap resamples, which are written to a SAS data set (BootOut). The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.

    /* The moving block bootstrap repeats this process B times
       and usually writes the resamples to a SAS data set. */
    B = 1000;
    SampleID = j(n,1,.);
    create BootOut var {'SampleID' 'Time' 'YBoot'};  /* create outside of loop */
    do i = 1 to B;
       SampleId[,] = i;
       idx = sample(1:J, k);  /* sample of size k from the set 1:J */
       YBoot = P + R[idx,];
       append;
    end;
    close BootOut;
    QUIT;

    The BootOut data set contains B=1000 bootstrap samples. The rest of the bootstrap analysis is exactly the same as in the previous article.

    Summary

    This article shows how to perform a moving block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which must divide the total length of the series (n), and form the n-L+1 overlapping blocks of residuals. Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.

    You can download the SAS program that shows the complete analysis for both the simple block and moving block bootstrap.

    The post The moving block bootstrap for time series appeared first on The DO Loop.