1月 162021
 

The management of the COVID-19 vaccination program is one of the most complex tasks attempted in modern history.  Even without the added complications of administering the vaccine during a pandemic, the race to vaccinate the populations who need it most all while maintaining the necessary cold-storage protocols, meeting double dose [...]

3 ways analytics can improve vaccine distribution and administration was published on SAS Voices by Steve Kearney

1月 132021
 

Running SAS programs in parallel reduces run time

As earth completes its routine annual circle around the sun and a new (and hopefully better) year kicks in, it is a perfect occasion to reflect on the idiosyncrasy of time.

While it is customary to think that 3+2=5, it is only true in sequential world. In parallel world, however, 3+2=3. Think about it: if you have two SAS programs one of which runs 3 hours, and the second one runs 2 hours, their total duration will be 5 hours if you run them one after another sequentially, but it will take only 3 hours if you run them simultaneously, in parallel.

I am sure you remember those “filling up a swimming pool” math problems from elementary school. They clearly and convincingly demonstrate that two pipes will fill up a swimming pool faster than one. That’s the power of running water in parallel.

The same principle of parallel processing (or parallel computing) is applicable to SAS programs (or non-SAS programs) by running their different independent pieces in separate SAS sessions at the same time (in parallel).  Divide and conquer.

You might be surprised at how easily this can be done, and at the same time how powerful it is. Let’s take a look.

SAS/CONNECT

MP CONNECT) parallel processing functionality was added to SAS/CONNECT enabling you to execute multiple SAS sessions asynchronously. When a remote SAS session kicks off asynchronously, a portion of your SAS program is sent to the server session for execution and control is immediately returned to the client session. The client session can continue with its own processing or spawn one or more additional asynchronous remote server sessions.

Running programs in parallel on a single machine

Sometimes, what comes across as new is just well forgotten old. They used to be Central Processing Units (CPU), but now they are called just processors. Nowadays, practically every single computer is a “multi-machine” (or to be precise “multi-processor”) device. Even your laptop. Just open Task Manager (Ctrl-Alt-Delete), click on the Performance tab and you will see how many physical processors (or cores) and logical processors your laptop has:

Parallel processing on a single machine

That means that this laptop can run eight independent SAS processes (sessions) at the same time. All you need to do is to say nicely “Dear Mr. & Mrs. SAS/CONNECT, my SAS program consists of several independent pieces. Would you please run each piece in its own SAS session, and run them all at the same time?” And believe me, SAS/CONNECT does not care how many logical processors you have, whether your logical processors are far away from each other “remote machines” or they are situated in a single laptop or even in a single chip.

Here is how you communicate your request to SAS/CONNECT in SAS language.

Spawning multiple SAS sessions using MP Connect

Suppose you have a SAS code that consists of several pieces – DATA or PROC steps that are independent of each other, i.e. they do not require to be run in a specific sequence. For example, each of the two pieces generates its own data set.

Then we can create these two data sets in two separate “remote” SAS sessions that run in parallel. Here is how you do this.  (For illustration purposes, I just create two dummy data sets.)

options sascmd="sas";
 
/* Current datetime */
%let _start_dt = %sysfunc(datetime());
 
/* Prosess 1 */
signon task1;
rsubmit task1 wait=no;
 
   libname SASDL 'C:\temp';
 
   data SASDL.DATA_A (keep=str);
      length str $1000;
      do i=1 to 1150000;
         str = '';
         do j=1 to 1000;
            str = cats(str,'A');
         end;
         output;
      end;
   run;
 
endrsubmit;
 
/* Process 2 */
signon task2;
rsubmit task2 wait=no;
 
   libname SASDL 'C:\temp';
 
   data SASDL.DATA_B (keep=str);
      length str $1000;
      do i=1 to 750000;
         str = '';
         do j=1 to 1000;
            str = cats(str,'B');
         end;
         output;
      end;
   run;
 
endrsubmit;
 
waitfor _all_;
signoff _all_;
 
/* Print total duration */
data _null_;
   dur = datetime() - &_start_dt;
   put 30*'-' / ' TOTAL DURATION:' dur time13.2 / 30*'-';
run;

In this code, the key elements are:

SIGNON Statement - initiates a connection between a client session and a server session.

ENDRSUBMIT statement - marks the end of a block of statements that a client session submits to a server session for execution.

SIGNOFF Statement - ends the connection between a client session and a server session.

Parallel processing vs. threaded processing

There is a distinction between parallel processing described above and threaded processing (aka multithreading). Parallel processing is achieved by running several independent SAS sessions, each processing its own unit of SAS code.

Threaded processing, on the other hand, is achieved by developing special algorithms and implementing executable codes that run on multiple processors (threads) within the same SAS session. Amdahl's law, which provides theoretical background and estimation of potential time saving achievable by parallel computing on multiple processors.

Passing information to and from “remote” SAS sessions

Besides passing pieces of SAS code from client sessions to server sessions, MP CONNECT allows you to pass some other SAS objects.

Passing data library definitions

For example, if you have a data library defined in your client session, you may pass that library definition on to multiple server sessions without re-defining them in each server session.

Let’s say you have two data libraries defined in your client session:

libname SRCLIB oracle user=myusr1 password=mypwd1 path=mysrv1;
libname TGTLIB '/sas/data/datastore1';

In order to make these data libraries available in the remote session all you need is to add

rsubmit task1 wait=no inheritlib=(SRCLIB TGTLIB);

This will allow libraries that are defined in the client session to be inherited by and available in the server session. As an option, each client libref can be associated with a libref that is named differently in the server session:

rsubmit task1 wait=no inheritlib=(SRCLIB=NEWSRC TGTLIB=NEWTGT);

Passing macro variables from client to server session

options sascmd="sas";
%let run_dt = %sysfunc(datetime());
signon task1;
%syslput rem_run_dt=&run_dt / remote=task1;
rsubmit task1 wait=no;
   %put &=rem_run_dt;
endrsubmit;
 
waitfor task1;
signoff task1;

Passing macro variables from server to client session

Similarly,

  • %SYSRPUT_USER_ </LIKE=‘character-string’>;
  • (/LIKE=<‘character-string’ >specifies a subset of macro variables whose names match a user-specified character sequence, or pattern.)

    Here is a code example that passes two macro variables, rem_start and rem_year from the remote session and outputs them to the SAS log in the client session:

    options sascmd="sas";
    signon task1;
    rsubmit task1 wait=no;
       %let start_dt = %sysfunc(datetime());
       %sysrput rem_start=&start_dt;
       %sysrput rem_year=2021;
    endrsubmit;
     
    waitfor task1;
    signoff task1;
    %put &=rem_start &=rem_year;

    Summary

    SAS’ Multi-Process Connect is a simple and efficient tool enabling parallel execution of independent programming units. Compared to sequential processing of time-intensive programs, it allows to substantially reduce overall duration of your program execution.

    Additional resources

    Running SAS programs in parallel using SAS/CONNECT® was published on SAS Users.

    1月 132021
     

    As I discussed in a previous article, the simple block bootstrap is a way to perform a bootstrap analysis on a time series. The first step is to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L) that divides the total length of the series (n). Each bootstrap resample is generated by randomly choosing from among the non-overlapping n/L blocks of residuals, which are added to the predicted model.

    The simple block bootstrap is not often used in practice. One reason is that the total number of blocks (k=n/L) is often small. If so, the bootstrap resamples do not capture enough variation for the bootstrap method to make correct inferences. This article describes a better alternative: the moving block bootstrap. In the moving block bootstrap, every block has the same block length but the blocks overlap. The following figure illustrates the overlapping blocks when L=3. The indices 1:L define the first block of residuals, the indices 2:L+1 define the second block, and so forth until the last block, which contains the residuals n-L+1:n.

    To form a bootstrap resample, you randomly choose k=n/L blocks (with replacement) and concatenate them. You then add these residuals to the predicted values to create a "new" time series. Repeat the process many times and you have constructed a batch of bootstrap resamples. The process of forming one bootstrap sample is illustrated in the following figure. In the figure, the time series has been reshaped into a k x L matrix, where each row is a block.

    The moving block bootstrap in SAS

    To demonstrate the moving block bootstrap in SAS, let's use the same data that I analyzed in the previous article about the simple block bootstrap. The previous article extracted 132 observations from the Sashelp.Air data set and used PROC AUTOREG to form an additive model Predicted + Residuals. The OutReg data set contains three variables of interest: Time, Pred, and Resid.

    As before, I will choose the block size to be L=12. The following SAS/IML program reads the data and defines a matrix (R) such that the i_th row contains the residuals with indices i:i+L-1. In total, the matrix R has n-L+1 rows.

    /* MOVING BLOCK BOOTSTRAP */
    %let L = 12;
    proc iml;
    call randseed(12345);
     
    use OutReg;
       read all var {'Time' 'Pred' 'Resid'};
    close;
     
    /* Restriction for Simple Block Bootstrap: 
       The length of the series (n) must be divisible by the number of blocks (k)
       so that all blocks have the same length (L) */
    n = nrow(Pred);          /* length of series */
    L = &L;                  /* length of each block */
    k = n / L;               /* number of random blocks to use */
    if k ^= int(k) then 
       ABORT "The series length is not divisible by the block length";
     
    /* Trick: Reshape data into k x L matrix. Each row is block of length L */
    P = shape(Pred, k, L);   /* there are k rows for Pred */
    J = n - L + 1;           /* total number of overlapping blocks to choose from */
    R = j(J, L, .);          /* there are n-L+1 blocks of residuals */
    Resid = rowvec(Resid);   /* make Resid into row vector so we don't need to transpose each row */
    do i = 1 to J;
       R[i,] = Resid[ , i:i+L-1]; /* fill each row with a block of residuals */
    end;

    With this setup, the formation of bootstrap resamples is almost identical to the program in the previous article. The only difference is that the matrix R for the moving block bootstrap has more rows. Nevertheless, each resample is formed by randomly choosing k rows from R and adding them to a block of predicted values. The following statements generate B=1000 bootstrap resamples, which are written to a SAS data set (BootOut). The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.

    /* The moving block bootstrap repeats this process B times
       and usually writes the resamples to a SAS data set. */
    B = 1000;
    SampleID = j(n,1,.);
    create BootOut var {'SampleID' 'Time' 'YBoot'};  /* create outside of loop */
    do i = 1 to B;
       SampleId[,] = i;
       idx = sample(1:J, k);  /* sample of size k from the set 1:J */
       YBoot = P + R[idx,];
       append;
    end;
    close BootOut;
    QUIT;

    The BootOut data set contains B=1000 bootstrap samples. The rest of the bootstrap analysis is exactly the same as in the previous article.

    Summary

    This article shows how to perform a moving block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which must divide the total length of the series (n), and form the n-L+1 overlapping blocks of residuals. Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.

    You can download the SAS program that shows the complete analysis for both the simple block and moving block bootstrap.

    The post The moving block bootstrap for time series appeared first on The DO Loop.

    1月 112021
     

    From mental health to biodiversity, SAS is committed to using data for social innovation, committing our resources, analytics expertise and software to tackle global issues. It's a vital undertaking, and we can't do it alone -- strategic partnerships are key to these efforts. Here's a quick overview of our top [...]

    Top 8 ways SAS used data for good in 2020 was published on SAS Voices by Lee Ellen Harmer

    1月 112021
     

    On The DO Loop blog, I write about a diverse set of topics, including statistical data analysis, machine learning, statistical programming, data visualization, simulation, numerical analysis, and matrix computations. In a previous article, I presented some of my most popular blog posts from 2020. The most popular articles often deal with elementary or familiar topics that are useful to almost every data analyst.

    However, among last year's 100+ articles are many that discuss advanced topics. Did you make a New Year's resolution to learn something new this year? Here is your chance! The following articles were fun to write and deserve a second look.

    Machine learning concepts

    Relationship between a threshold value and true/false negatives and positives

    Statistical smoothers

    Bilinear interpolation of 12 data values

    I write a lot about scatter plot smoothers, which are typically parametric or nonparametric regression models. But a SAS customer wanted to know how to get SAS to perform various classical interpolation schemes such as linear and cubic interpolations:

    SAS Viya and parallel computing

    SAS is devoting tremendous resources to SAS Viya, which offers a modern analytic platform that runs in the cloud. One of the advantages of SAS Viya is the opportunity to take advantage of distributed computational resources. In 2020, I wrote a series of articles that demonstrate how to use the iml action in Viya 3.5 to implement custom parallel algorithms that use multiple nodes and threads on a cluster of machines. Whereas many actions in SAS Viya perform one and only one task, the iml action supports a general framework for custom, user-written, parallel computations:

    The map-reduce functionality in the iml action

    • The map-reduce paradigm is a two-step process for distributing a computation. Every thread runs a function and produces a result for the data that it sees. The results are aggregated and returned. The iml action supports the MAPREDUCE function, which implements the map-reduce paradigm.
    • The parallel-tasks paradigm is a way to run independent computations concurrently. The iml action supports the PARTASKS function, which implements the map-reduce paradigm.

    Simulation and visualization

    Decomposition of a convex polygon into triangles

    Generate random points in a polygon

    Your turn

    Did I omit one of your favorite blog posts from The DO Loop in 2020? If so, leave a comment and tell me what topic you found interesting or useful. And if you missed some of these articles when they were first published, consider subscribing to The DO Loop in 2021.

    The post Blog posts from 2020 that deserve a second look appeared first on The DO Loop.

    1月 072021
     

    This is the last of three posts on our hot-fix process, aimed at helping you better manage your SAS®9 environment through tips and best practices. The first two installments are linked here:

    Having a good understanding of the hot-fix process can help you keep your SAS environment running smoothly. This last installment aims to help you get on a schedule with your hot-fix installations and provides an example spreadsheet (available for download on GitHub) to manage hot fixes.

    Schedule hot fixes

    As an administrator, sometimes applying outstanding hot fixes can be a daunting task. However, the longer you wait, the worse your situation becomes—with a potentially unstable environment and a growing backlog of hot fixes to apply. With a little careful planning, the task can become routine and everyone involved will be much happier. The next sections outline a strategy for getting on a quarterly schedule.

    Run the SASHFADD tool

    The first step of getting on a quarterly schedule to apply hot fixes is to run the SAS Hot Fix Analysis, Download and Deployment (SASHFADD) Tool. For information about running this tool and analyzing the report it generates, see the first two installments in this blog series, The SAS Hot Fix Analysis, Download and Deployment (SASHFADD) Tool and Understanding the SAS Hot Fix Analysis, Download and Deployment Tool Report.

    Once you review the SASHFADD report, you will have a better understanding of what resources will be needed to apply the outstanding hot fixes. You also need to decide which philosophy of installing hot fixes you want to follow. For more information, see Which hot fixes should I apply?

    Coordinate with IT

    The second step is to coordinate the process with your IT department. Before you take the system offline to apply hot fixes, IT typically wants to do the following:

    • Perform a full system backup.
    • Check for scheduled jobs and make necessary adjustments.
    • Decide the best time to stop SAS services.
    • Evaluate how long the system will need to be offline.

    After the first session of applying your hot-fix backlog, all these tasks can run on a regular (preferably at least quarterly) schedule that won't require as much analysis time from IT.

    Communicate with end users

    Before you implement the plan that you and the IT department devised, you need to communicate with your end users. Let them know ahead of time (maybe by a week) when the outage will occur, what they need to do to prepare for it, and how long it will take. It's a best practice to perform the update outside of regular business hours.

    Reap the benefits

    When you follow a quarterly schedule of applying hot fixes, there are many benefits:

    • The administrator is more experienced from installing hot fixes regularly, so the process goes more smoothly and takes less time.
    • The IT department has an established process in place for backing up the system and taking it offline for the maintenance.
    • End users know what to expect and are not surprised by the outage.
    • The system runs better and is protected from vulnerabilities due to the regular schedule of updates.
    • There is less downtime because there are fewer hot fixes to install with each update.

    Manage hot fixes

    Applying hot fixes can often be a complicated process with multiple steps before and after you install them. So, a key aspect of successfully applying hot fixes is ensuring that you follow all the steps that are included in the SASHFADD report. A great tool for managing this complexity is a spreadsheet!

    Download one I created and customize it:

    SANDY'S SPREADSHEET | DOWNLOAD IT NOW

    This tool allows you to see and then check off (through highlighting, color coding, or notes) each of the steps to get the best results.

    Administrators will have different approaches to their spreadsheets. Mine, linked above, is the result of much trial and error. Here are the items that I keep track of in my spreadsheet:

    • Hot-fix number
    • Hot-fix dependencies
    • Pre-installation steps
    • Post-installation steps
    • Special notes

    Another benefit of the spreadsheet is that you can group steps together so that you can do them all at once. Here are some examples of when you can group steps to save time:

    • When performing the Rebuild Web Applications step, you can select the SAS® Marketing Automation, SAS® Marketing Optimization, and/or SAS® Deployment Manager web applications to be rebuilt in one iteration.
    • When performing the Deploy Web Applications step, you can select the SAS® Marketing Automation, SAS® Marketing Optimization, and/or SAS® Deployment Manager web applications to be rebuilt in one iteration.

    Helpful resources

    See the following links for the detailed and thorough documentation:

    I hope that this blog series has been helpful to you! Have a terrific day!

    READ PART ONE | The SAS Hot Fix Analysis, Download and Deployment (SASHFADD) Tool READ PART TWO | Understanding the SAS Hot Fix Analysis, Download and Deployment Tool Report

    How to schedule and manage your SAS hot fixes was published on SAS Users.

    1月 062021
     

    For ordinary least squares (OLS) regression, you can use a basic bootstrap of the residuals (called residual resampling) to perform a bootstrap analysis of the parameter estimates. This is possible because an assumption of OLS regression is that the residuals are independent. Therefore, you can reshuffle the residuals to get each bootstrap sample.

    For a time series, the residuals are not independent. Rather, if you fit a model to the data, the residuals at time t+i are often close to the residual at time t for small values of i. This is known as autocorrelation in the error component. Accordingly, if you want to bootstrap the residuals of a time series, it is not correct to randomly shuffle the residuals, which would destroy the autocorrelation. Instead, you need to randomly choose a block of residuals (for example, at times t, t+1, ..., and t+L) and use those blocks of residuals to create bootstrap resamples. You repeatedly choose random blocks until you have enough residuals to create a bootstrap resample.

    There are several ways to choose blocks:

    • The simplest way is to choose from non-overlapping blocks of a fixed length, L. This is called the simple block bootstrap and is described in this article.
    • Only slightly more complicated is to allow the blocks of length L to overlap. This is called the moving block bootstrap and is described in a separate article.
    • A more complicated scheme (but which has superior statistical properties) is to allow the length of the blocks to vary. This is called the stationary block bootstrap and is described in a third article.

    Create the residuals

    There are many ways to fit a model to a time series and to obtain the model residuals. Trovero and Leonard (2018) discuss several modern methods to fit trends, cycles, and seasonality by using SAS 9.4 or SAS Viya. To get the residuals, you will want to fit an additive model. In this article, I will use the Sashelp.Air data and will fit a simple additive model (trend plus noise) by using the AUTOREG procedure in SAS/ETS software.

    The Sashelp.Air data set has 144 months of data. The following SAS DATA step drops the first year of data, which leaves 11 years of 12 months. I am doing this because I am going to use blocks of size L=12, and I think the example will be clearer if there are 11 blocks of size 12 (rather than 12 blocks).

    data Air;
    set Sashelp.Air;
    if Date >= '01JAN1950'd;   /* exclude first year of data */
    Time = _N_;                /* the observation number */
    run;
     
    title "Original Series: Air Travel";
    proc sgplot data=Air;
       series x=Time y=Air;
       xaxis grid; yaxis grid;
    run;

    The graph suggests that the time series has a linear trend. The following call to PROC AUTOREG fits a linear model to the data. The predicted mean and residuals are output to the OutReg data set as the PRED and RESID variables, respectively. The call to PROC SGPLOT overlays a graph of the trend and a graph of the residuals.

    /* Similar to Getting Started example in PROC AUTOREG */
    proc autoreg data=Air plots=none outest=RegEst;
       AR12: model Air = Time / nlag=12;
       output out=OutReg pm=Pred rm=Resid;  /* mean prediction and residuals */
       ods select FinalModel.ParameterEstimates ARParameterEstimates;
    run;
     
    title "Mean Prediction and Residuals from AR Model";
    proc sgplot data=OutReg;
       series x=Time y=Pred;
       series x=Time y=Resid;
       refline 0 / axis=y;
       xaxis values=(24 to 144 by 12) grid valueshint;
    run;

    The parameter estimates are shown for the linear model. On average, airlines carried an additional 2.8 thousand passengers per month during this time period. The graph shows the decomposition of the series into a linear trend and residuals. I added vertical lines to indicate the blocks of residuals that are used in the next section. The first block contains the residuals for times 13-24. The second block contains the residuals for times 25-36, and so forth until the 11th block, which contains the residuals for times 133-144.

    Set up the simple block bootstrap

    For the simple bootstrap, the length of the blocks (L) must evenly divide the length of the series (n), which means that k = n / L is an integer. Because I dropped the first year of observations from Sashelp.Air, there are n=132 observations. I will choose the block size to be L=12, which means that there are k=11 non-overlapping blocks.

    Each bootstrap resample is formed by randomly choosing k blocks (with replacement) and add those residuals to the predicted values. Think about putting the n predicted values and residuals into a matrix in row-wise order. The first L observations are in the first row, the next L are in the second row, and so forth. Thus, the matrix has k rows and L columns. The original series is of the form Predicted + Residuals, where the plus sign represents matrix addition. For the simple block bootstrap, each bootstrap resample is obtained by resampling the rows of the residual array and adding the rows together to obtain a new series of the form Predicted + (Random Residuals). This process is shown schematically in the following figure.

    You can use the SAS/IML language to implement the simple block bootstrap. The following call to PROC IML reads in the original predicted and residual values and reshapes then vectors into k x L matrices (P and R, respectively). The SAMPLE function generates a sample (with replacement) of the vector 1:k, which is used to randomly select rows of the R matrix. To make sure that the process is working as expected, you can create one bootstrap resample and graph it. It should resemble the original series:

    /* SIMPLE BLOCK BOOTSTRAP */
    %let L = 12;
    proc iml;
    call randseed(12345);
    /* the original series is Y = Pred + Resid */
    use OutReg;
       read all var {'Time' 'Pred' 'Resid'};
    close;
     
    /* For the Simple Block Bootstrap, the length of the series (n) 
       must be divisible by the block length (L). */
    n = nrow(Pred);          /* length of series */
    L = &L;                  /* length of each block */
    k = n / L;               /* number of non-overlapping blocks */
    if k ^= int(k) then 
       ABORT "The series length is not divisible by the block length";
     
    /* Trick: reshape data into k x L matrix. Each row is block of length L */
    P = shape(Pred, k, L);
    R = shape(Resid, k, L); /* non-overlapping residuals (also k x L) */
     
    /* Example: Generate one bootstrap resample by randomly selecting from the residual blocks */
    idx = sample(1:nrow(R), k);     /* sample (w/ replacement) of size k from the set 1:k */
    YBoot = P + R[idx,];
    title "One Bootstrap Resample";
    title2 "Simple Block Bootstrap";
    refs = "refline " + char(do(12,nrow(Pred),12)) + " / axis=x;";
    call series(Time, YBoot) other=refs;

    The graph shows one bootstrap resample. The residuals from arbitrary blocks are concatenated until there are n residuals. These are added to the predicted value to create a "new" series, which is a bootstrap resample. You can generate a large number of bootstrap resamples and use them to perform inferences for time series statistics.

    Implement the simple block bootstrap in SAS

    You can repeat the process in a loop to generate more resamples. The following statements generate B=1000 bootstrap resamples. These are written to a SAS data set (BootOut). The program uses a technique in which the results of each computation are immediately written to a SAS data set, which is very efficient. The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.

    /* The simple block bootstrap repeats this process B times
       and usually writes the resamples to a SAS data set. */
    B = 1000;
    J = nrow(R);              /* J=k for non-overlapping blocks, but prepare for moving blocks */
    SampleID = j(n,1,.);
    create BootOut var {'SampleID' 'Time' 'YBoot'};  /* open data set outside of loop */
    do i = 1 to B;
       SampleId[,] = i;       /* fill array: https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
       idx = sample(1:J, k);  /* sample of size k from the set 1:k */
       YBoot = P + R[idx,];
       append;                /* append each bootstrap sample */
    end;
    close BootOut;
    QUIT;

    The BootOut data set contains B=1000 bootstrap samples. You can efficiently analyze the samples by using a BY statement. For example, suppose that you want to investigate how the parameter estimates for the trend line vary among the bootstrap samples. You can run PROC AUTOREG on each bootstrap sample by using BY-group processing. Be sure to suppress ODS output during the BY-group analysis, and write the desired statistics to an output data set (BootEst), as follows:

    /* Analyze the bootstrap samples by using a BY statement. See
       https://blogs.sas.com/content/iml/2012/07/18/simulation-in-sas-the-slow-way-or-the-by-way.html
    */
    proc autoreg data=BootOut plots=none outest=BootEst noprint;
       by SampleID;
       AR12: model YBoot = Time / nlag=12;
    run;
     
    /* OPTIONAL: Use PROC MEANS or PROC UNIVARIATE to estimate standard errors and CIs */
    proc means data=BootEst mean stddev P5 P95;
       var Intercept Time _A:;
    run;
     
    title "Distribution of Parameter Estimates";
    proc sgplot data=BootEst;
       scatter x=Intercept y=Time;
       xaxis grid; yaxis grid; 
       refline 77.5402 / axis=x;
       refline 2.7956  / axis=y;
    run;

    The scatter plot shows the bootstrap distribution of the parameter estimates of the linear trend. The reference lines indicate the parameter estimates for the original data. You can use the bootstrap distribution for inferential statistics such as estimation of standard errors, confidence intervals, the covariance of estimates, and more.

    You can perform a similar bootstrap analysis for any other statistic that is generated by any time series analysis. The important thing is that the block bootstrap is performed on some sort of residual or "noise" component, so be sure to remove the trend, seasonality, cycles, and so forth and then bootstrap the remainder.

    Summary

    This article shows how to perform a simple block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which (for the simple block bootstrap) must divide the total length of the series (n). Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.

    In practice, the simple block bootstrap is rarely used. However, it illustrates the basic ideas for bootstrapping a time series, and it provides a foundation for more sophisticated bootstrap methods.

    The post The simple block bootstrap for time series in SAS appeared first on The DO Loop.

    1月 062021
     

    For ordinary least squares (OLS) regression, you can use a basic bootstrap of the residuals (called residual resampling) to perform a bootstrap analysis of the parameter estimates. This is possible because an assumption of OLS regression is that the residuals are independent. Therefore, you can reshuffle the residuals to get each bootstrap sample.

    For a time series, the residuals are not independent. Rather, if you fit a model to the data, the residuals at time t+i are often close to the residual at time t for small values of i. This is known as autocorrelation in the error component. Accordingly, if you want to bootstrap the residuals of a time series, it is not correct to randomly shuffle the residuals, which would destroy the autocorrelation. Instead, you need to randomly choose a block of residuals (for example, at times t, t+1, ..., and t+L) and use those blocks of residuals to create bootstrap resamples. You repeatedly choose random blocks until you have enough residuals to create a bootstrap resample.

    There are several ways to choose blocks:

    • The simplest way is to choose from non-overlapping blocks of a fixed length, L. This is called the simple block bootstrap and is described in this article.
    • Only slightly more complicated is to allow the blocks of length L to overlap. This is called the moving block bootstrap and is described in a separate article.
    • A more complicated scheme (but which has superior statistical properties) is to allow the length of the blocks to vary. This is called the stationary block bootstrap and is described in a third article.

    Create the residuals

    There are many ways to fit a model to a time series and to obtain the model residuals. Trovero and Leonard (2018) discuss several modern methods to fit trends, cycles, and seasonality by using SAS 9.4 or SAS Viya. To get the residuals, you will want to fit an additive model. In this article, I will use the Sashelp.Air data and will fit a simple additive model (trend plus noise) by using the AUTOREG procedure in SAS/ETS software.

    The Sashelp.Air data set has 144 months of data. The following SAS DATA step drops the first year of data, which leaves 11 years of 12 months. I am doing this because I am going to use blocks of size L=12, and I think the example will be clearer if there are 11 blocks of size 12 (rather than 12 blocks).

    data Air;
    set Sashelp.Air;
    if Date >= '01JAN1950'd;   /* exclude first year of data */
    Time = _N_;                /* the observation number */
    run;
     
    title "Original Series: Air Travel";
    proc sgplot data=Air;
       series x=Time y=Air;
       xaxis grid; yaxis grid;
    run;

    The graph suggests that the time series has a linear trend. The following call to PROC AUTOREG fits a linear model to the data. The predicted mean and residuals are output to the OutReg data set as the PRED and RESID variables, respectively. The call to PROC SGPLOT overlays a graph of the trend and a graph of the residuals.

    /* Similar to Getting Started example in PROC AUTOREG */
    proc autoreg data=Air plots=none outest=RegEst;
       AR12: model Air = Time / nlag=12;
       output out=OutReg pm=Pred rm=Resid;  /* mean prediction and residuals */
       ods select FinalModel.ParameterEstimates ARParameterEstimates;
    run;
     
    title "Mean Prediction and Residuals from AR Model";
    proc sgplot data=OutReg;
       series x=Time y=Pred;
       series x=Time y=Resid;
       refline 0 / axis=y;
       xaxis values=(24 to 144 by 12) grid valueshint;
    run;

    The parameter estimates are shown for the linear model. On average, airlines carried an additional 2.8 thousand passengers per month during this time period. The graph shows the decomposition of the series into a linear trend and residuals. I added vertical lines to indicate the blocks of residuals that are used in the next section. The first block contains the residuals for times 13-24. The second block contains the residuals for times 25-36, and so forth until the 11th block, which contains the residuals for times 133-144.

    Set up the simple block bootstrap

    For the simple bootstrap, the length of the blocks (L) must evenly divide the length of the series (n), which means that k = n / L is an integer. Because I dropped the first year of observations from Sashelp.Air, there are n=132 observations. I will choose the block size to be L=12, which means that there are k=11 non-overlapping blocks.

    Each bootstrap resample is formed by randomly choosing k blocks (with replacement) and add those residuals to the predicted values. Think about putting the n predicted values and residuals into a matrix in row-wise order. The first L observations are in the first row, the next L are in the second row, and so forth. Thus, the matrix has k rows and L columns. The original series is of the form Predicted + Residuals, where the plus sign represents matrix addition. For the simple block bootstrap, each bootstrap resample is obtained by resampling the rows of the residual array and adding the rows together to obtain a new series of the form Predicted + (Random Residuals). This process is shown schematically in the following figure.

    You can use the SAS/IML language to implement the simple block bootstrap. The following call to PROC IML reads in the original predicted and residual values and reshapes then vectors into k x L matrices (P and R, respectively). The SAMPLE function generates a sample (with replacement) of the vector 1:k, which is used to randomly select rows of the R matrix. To make sure that the process is working as expected, you can create one bootstrap resample and graph it. It should resemble the original series:

    /* SIMPLE BLOCK BOOTSTRAP */
    %let L = 12;
    proc iml;
    call randseed(12345);
    /* the original series is Y = Pred + Resid */
    use OutReg;
       read all var {'Time' 'Pred' 'Resid'};
    close;
     
    /* For the Simple Block Bootstrap, the length of the series (n) 
       must be divisible by the block length (L). */
    n = nrow(Pred);          /* length of series */
    L = &L;                  /* length of each block */
    k = n / L;               /* number of non-overlapping blocks */
    if k ^= int(k) then 
       ABORT "The series length is not divisible by the block length";
     
    /* Trick: reshape data into k x L matrix. Each row is block of length L */
    P = shape(Pred, k, L);
    R = shape(Resid, k, L); /* non-overlapping residuals (also k x L) */
     
    /* Example: Generate one bootstrap resample by randomly selecting from the residual blocks */
    idx = sample(1:nrow(R), k);     /* sample (w/ replacement) of size k from the set 1:k */
    YBoot = P + R[idx,];
    title "One Bootstrap Resample";
    title2 "Simple Block Bootstrap";
    refs = "refline " + char(do(12,nrow(Pred),12)) + " / axis=x;";
    call series(Time, YBoot) other=refs;

    The graph shows one bootstrap resample. The residuals from arbitrary blocks are concatenated until there are n residuals. These are added to the predicted value to create a "new" series, which is a bootstrap resample. You can generate a large number of bootstrap resamples and use them to perform inferences for time series statistics.

    Implement the simple block bootstrap in SAS

    You can repeat the process in a loop to generate more resamples. The following statements generate B=1000 bootstrap resamples. These are written to a SAS data set (BootOut). The program uses a technique in which the results of each computation are immediately written to a SAS data set, which is very efficient. The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.

    /* The simple block bootstrap repeats this process B times
       and usually writes the resamples to a SAS data set. */
    B = 1000;
    J = nrow(R);              /* J=k for non-overlapping blocks, but prepare for moving blocks */
    SampleID = j(n,1,.);
    create BootOut var {'SampleID' 'Time' 'YBoot'};  /* open data set outside of loop */
    do i = 1 to B;
       SampleId[,] = i;       /* fill array: https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
       idx = sample(1:J, k);  /* sample of size k from the set 1:k */
       YBoot = P + R[idx,];
       append;                /* append each bootstrap sample */
    end;
    close BootOut;
    QUIT;

    The BootOut data set contains B=1000 bootstrap samples. You can efficiently analyze the samples by using a BY statement. For example, suppose that you want to investigate how the parameter estimates for the trend line vary among the bootstrap samples. You can run PROC AUTOREG on each bootstrap sample by using BY-group processing. Be sure to suppress ODS output during the BY-group analysis, and write the desired statistics to an output data set (BootEst), as follows:

    /* Analyze the bootstrap samples by using a BY statement. See
       https://blogs.sas.com/content/iml/2012/07/18/simulation-in-sas-the-slow-way-or-the-by-way.html
    */
    proc autoreg data=BootOut plots=none outest=BootEst noprint;
       by SampleID;
       AR12: model YBoot = Time / nlag=12;
    run;
     
    /* OPTIONAL: Use PROC MEANS or PROC UNIVARIATE to estimate standard errors and CIs */
    proc means data=BootEst mean stddev P5 P95;
       var Intercept Time _A:;
    run;
     
    title "Distribution of Parameter Estimates";
    proc sgplot data=BootEst;
       scatter x=Intercept y=Time;
       xaxis grid; yaxis grid; 
       refline 77.5402 / axis=x;
       refline 2.7956  / axis=y;
    run;

    The scatter plot shows the bootstrap distribution of the parameter estimates of the linear trend. The reference lines indicate the parameter estimates for the original data. You can use the bootstrap distribution for inferential statistics such as estimation of standard errors, confidence intervals, the covariance of estimates, and more.

    You can perform a similar bootstrap analysis for any other statistic that is generated by any time series analysis. The important thing is that the block bootstrap is performed on some sort of residual or "noise" component, so be sure to remove the trend, seasonality, cycles, and so forth and then bootstrap the remainder.

    Summary

    This article shows how to perform a simple block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which (for the simple block bootstrap) must divide the total length of the series (n). Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.

    In practice, the simple block bootstrap is rarely used. However, it illustrates the basic ideas for bootstrapping a time series, and it provides a foundation for more sophisticated bootstrap methods.

    The post The simple block bootstrap for time series in SAS appeared first on The DO Loop.

    1月 052021
     

    Editor's note: This blog post is the first in a series of posts, originally published here by our partner News Literacy Project, exploring the role of data in understanding our world. Infographics are one of the most visual ways to tell stories with data. They are designed to catch the reader’s eye, and they use [...]

    Don't be misled: How to interpret data in infographics was published on SAS Voices by Jen Sabourin