14.3

6月 182018
 

Bootstrap resampling is a powerful way to estimate the standard error for a statistic without making any parametric assumptions about its sampling distribution. The bootstrap method is often implemented by using a sequence of calls to resample from the data, compute a statistic on each sample, and analyze the bootstrap distribution. An example is provided in the article "Compute a bootstrap confidence interval in SAS." This process can be lengthy and in Base SAS it requires reading and writing a large amount of data. In SAS/STAT 14.3 (SAS 9.4m5), the TTEST procedure supports the BOOTSTRAP statement, which automatically performs a bootstrap analysis of one-sample and two-sample t tests. The BOOTSTRAP statement also applies to two-sample paired tests.

The difference of means between two groups

The BOOTSTRAP statement makes it easy to obtain bootstrap estimates of bias and standard error for a statistic and confidence intervals (CIs) for the underlying parameter. The BOOTSTRAP statement supports several estimates for the confidence intervals, including normal-based intervals, t-based intervals, percentile intervals, and bias-adjusted intervals. This section shows how to obtain bootstrap estimates for a two-sample t test. The statistic of interest is the difference between the means of two groups.

The following SAS DATA step subsets the Sashelp.Cars data to create a data set that contains only two types of vehicles: sedans and SUVs. A call to PROC UNIVARIATE displays a comparative histogram that shows the distributions of the MPG_City variable for each group. The MPG_City variable measures the fuel efficiency (in miles per gallon) for each vehicle during typical city driving.

/* create data set that has two categories: 'Sedan' and 'SUV' */
data Sample;
set Sashelp.Cars(keep=Type MPG_City);
if Type in ('Sedan' 'SUV');
run;
 
proc univariate data=Sample;
   class Type;
   histogram MPG_City;
   inset N Mean Std Skew Kurtosis / position=NE;
   ods select histogram;
run;

Bootstrap estimates for a two-sample t test

Suppose that you want to test whether the mean MPG of the "SUV" group is significantly different from the mean of the "Sedan" group. The groups appear to have different variances, so you would probably choose the Satterthwaite version of the t test, which accommodates different variances. You can use PROC TTEST to run a two-sample t test for these data, but in looking at the distributions of the groups, you might be concerned that the normality assumptions for the t test are not satisfied by these data. Notice that the distribution of the MPG_City variable for the "Sedan" group has high skewness (1.3) and moderately high kurtosis (1.9). Although the t test is somewhat robust to the normality assumption, you might want to use the bootstrap method to estimate the standard error and confidence interval for the difference of means between the two groups.

If you are using SAS/STAT 14.3, you can compute bootstrap estimates for a t test by using the BOOTSTRAP statement, as follows:

title "Bootstrap Estimates with Percentile CI";
proc ttest data=Sample;
   class Type;
   var MPG_City;
   bootstrap / seed=123 nsamples=10000 bootci=percentile;  /* or BOOTCI=BC */
run;

The BOOTSTRAP statement supports three options:

  • The SEED= option initializes the internal random number generator for the TTEST procedure.
  • The NSAMPLES= option specifies the number of bootstrap resamples to be drawn from the data.
  • The BOOTCI= option specifies the estimate for the confidence interval for the parameter. This example uses the PERCENTILE method, which uses the α/2 and 1 – α/2 quantiles of the bootstrap distribution as the endpoints of the confidence interval. A more sophisticated second-order method is the bias-corrected interval, which you can specify by using the BOOTCI=BC option. For educational purposes, you might want to compare these nonparametric estimates with more traditional estimates such as t-based confidence intervals (BOOTCI=TBOOTSE).

The TTEST procedure produces several tables and graphs, but I have highlighted a few statistics in two tables. The top table is the "ConfLimits" table, which is based on the data and shows the traditional statistics for the t test. The estimate for the difference in means between the "SUV" and "Sedan" groups is -4.98 and is highlighted in blue. The traditional (parametric) estimate for a 95% confidence interval is highlighted in red. The interval is [-5.87, -4.10], which does not contain 0, therefore you can conclude that the group means are significantly different at the 0.05 significance level.

The lower table is the "Bootstrap" table, which is based on the bootstrap resamples. The TTEST documentation explains the resampling process and the computation of the bootstrap statistics. The top row of the table shows estimates for the difference of means. The bootstrap estimate for the standard error is 0.45. The estimate of bias (which subtracts the average bootstrap statistic from the sample statistic) is -0.01, which is small. The percentile estimate for the confidence interval is [-5.87, -4.14], which is similar to the parametric interval estimate in the top table. (For comparison, the bias-adjusted CI is also similar: [-5.85, -4.12].) Every cell in this table will change if you change the SEED= or NSAMPLES= options because the values in this table are based on the bootstrap samples.

Although the difference of means is the most frequent statistic to bootstrap, you can see from the lower table that the BOOTSTRAP statement also estimates the standard error, bias, and confidence interval for the standard deviation of the difference. Although this article focuses on the two-sample t test, the BOOTSTRAP statement also applies to one sample t tests.

In summary, the BOOTSTRAP statement in PROC TTEST in SAS/STAT 14.3 makes it easy to obtain bootstrap estimates for statistics in one-sample or two-sample t tests (and paired t tests). By using the BOOTSTRAP statement, the manual three-step bootstrap process (resample, compute statistics, and summarize) is reduced to a zero-step process. The TTEST procedure handles the details for you.

The post The BOOTSTRAP statement for t tests in SAS appeared first on The DO Loop.

6月 082018
 

I recently recorded a short video about the new syntax for specifying and manipulating lists in SAS/IML 14.3. This is a video of my Super Demo at SAS Global Forum 2018. The new syntax supports dynamic arrays, associative arrays ("named lists"), and hierarchical data structures such as lists of lists.


If your browser does not support embedded video, you can go directly to the video on YouTube.

If you prefer to read about lists, the following references provide more information and additional examples of ways to use lists:

The post Video: A new syntax for lists in SAS/IML appeared first on The DO Loop.

5月 092018
 

In a previous blog post, I discussed ways to produce statistically independent samples from a random number generator (RNG). The best way is to generate all samples from one stream. However, if your program uses two or more SAS DATA steps to simulate the data, you cannot use the same seed value multiple times because that could introduce biases into your study. An easy way to deal with this situation is to use the CALL STREAM subroutine to set different key values for each DATA step.

An example of using two DATA steps to generate random numbers

Suppose that you have a simulation that needs to call the DATA step twice, and each call generates random numbers. If you want to encapsulate that functionality into a macro call, you might want to pass in the seed for the random number streams. An INCORRECT approach is to do the following:

%macro DoItWRONG(seed=);
data A;
   call streaminit('PCG', &seed);
   /* ... do something with random numbers ... */
run;
data B;
   call streaminit('PCG', &seed);       /* <== uh-oh! Problem here! */
   /* ... do something else with random numbers ... */
run;
%mend;

This code is unacceptable because the same seed value and RNG are used for both DATA steps. That means that the stream of random numbers is the same for the second DATA step as for the first, which can lead to correlations that can bias your study.

Prior to SAS 9.4M5, the only way to resolve this issue was to use different seed values for each DATA step. For example, you could use &seed + 1 in the second DATA step. Because the Mersenne twister RNG (the only RNG in SAS prior to 9.4M5) has an astronomically large period, streams that use different seed values have an extremely small probability of overlapping.

Use the CALL STREAM subroutine

SAS 9.4M5 introduced new random number generators. The Mersenne twister algorithm supports one parameter (the seed) which initializes each stream. The Threefry and PCG algorithms support two initialization parameters. To prevent confusion, the first parameter is called the seed value and the second is called the key value. By default, the key value is zero, but you can use the CALL STREAM subroutine to set a positive key value.

You should use the key value (not the seed value) to generate independent streams in two DATA steps. The following statements are true:

  • The PCG generator is not guaranteed to generate independent streams for different seed values. However, you can use the key value to ensure independent streams. Two streams with the same seed and different key values never overlap.
  • The Threefry generators (TF2 and TF4) generate independent streams for different seed values. However, the Threefry generators also accept key values, and two streams with different key values never overlap.
  • The Mersenne twister algorithm does not inherently support a key value. However, the STREAM routine creates a new seed from the current seed and the specified key value. Because of the long-period property of the MT algorithm, two streams that use the same seed and different key values are "independent with high probability."

The following example modifies the previous macro to use the CALL stream routine. For the PCG RNG, two streams that have the same seed and different key values do not overlap:

%macro DoIt(seed=);
data A;
   call streaminit('PCG', &seed);
   call stream(1); 
   /* ... do something with random numbers ... */
run;
data B;
   call streaminit('PCG', &seed);
   call stream(2);                  /* <== different stream, guaranteed */
   /* ... do something else with random numbers ... */
run;
%mend;

Of course, this situation generalizes. If the macro uses three (or 10 or 20) DATA steps, each of which generates random numbers, you can use a different key value in each DATA step. If you initialize each DATA step with the same seed and a different key, the DATA steps generate independent streams (for the Threefry and PCG RNGs) or generate independent streams with high probability (for the Mersenne twister RNG).

A real example of using different streams

This section provides a real-life example of using the CALL STREAM subroutine. Suppose a researcher wants to create a macro that simulates many samples for a regression model. The macro first creates a data set that contains p independent continuous variables. The macro then simulates B response variables for a regression model that uses the independent variables from the first data set.

To prevent the second DATA step from using the same random number stream that was used in the first DATA step, the second DATA step calls the STREAM subroutine to set a key value:

%macro Simulate(seed=, N=, NCont=,  /* params for indep vars */
                key=, betas=,       /* params for dep vars */
                NumSamples=);       /* params for MC simul */
/* Step 1: Generate nCont explanatory variables */
data MakeData;
   call streaminit('PCG', &seed);
   /* if you do not call STREAM, you get the stream for key=0 */
   array x[&NCont];         /* explanatory vars are named x1-x&nCont  */
 
   do ObsNum = 1 to &N;     /* for each observation in the sample  */
      do j = 1 to &NCont;
         x[j] = rand("Normal"); /* simulate NCont explanatory variables   */
      end;
      output;
    end;
    keep ObsNum x:;
run;
 
/* Step 2: Generate NumSamples responses, each from the same regression model 
   Y = X*beta + epsilon, where epsilon ~ N(0, 0.1)  */
data SimResponse;
   call streaminit('PCG', &seed);
   call stream(&key); 
   keep SampleId ObsNum x: Y;
   /* define regression coefficients beta[0], beta[1], beta[2], ... */
   array beta[0:&NCont] _temporary_ (&betas);
 
   /* construct linear model eta = beta[0] + beta[1]*x1 + betat[2]*x2 + ... */
   set MakeData;                    /* for each obs in data ...*/
   array x[&nCont] x1-x&NCont;      /* put the explanatory variables into an array */
   eta = beta[0];
   do i = 1 to &NCont;
      eta = eta + beta[i] * x[i];
   end;
 
   /* add random noise for each obs in each sample */
   do SampleID = 1 to &NumSamples;
      Y = eta + rand("Normal", 0, 0.1); 
      output;
   end;
run;
 
/* sort to prepare data for BY-group analysis */
proc sort data=SimResponse;
   by SampleId ObsNum;
run;
%mend;

For a discussion of how the simulation works, see the article "Simulate many samples from a linear regression model." In this example, the macro uses two DATA steps to simulate the regression data. The second DATA step uses the STREAM subroutine to ensure that the random numbers it generates are different from those used in the first DATA step.

For completeness, here's how you can call the macro to generate the data and then a call PROC REG with a BY statement to analyze the 1,000 regression models.

%Simulate(seed=123, N=50, NCont=3,         /* params for indep vars */
          key=1, Betas = 1 -0.5 0.5 -0.25, /* params for dep vars */
          NumSamples=1000)                 /* params for MC sim */
 
proc reg data=SimResponse noprint outest=PE;
   by SampleID;
   model Y = x1-x3;
run;
 
title "Parameter Estimates for key=1";
proc print data=PE(obs=5) noobs;
   var SampleID Intercept x1-x3;
run;

You can verify that the macro generates different simulated data if you use the same seed value but change the key value in the macro call.

In summary, this article shows that you can use the STREAM call in the SAS DATA step to generate different random number streams from the same seed value. For many simulation studies, the STREAM subroutine is not necessary. However, it useful when the simulation is run in stages or is implemented in modules that need to run independently.

The post Independent streams of random numbers in SAS appeared first on The DO Loop.

5月 072018
 

Simulation studies require both randomness and reproducibility, two qualities that are sometimes at odds with each other. A Monte Carlo simulation might need to generate millions of random samples, where each sample contains dozens of continuous variables and many thousands of observations. In simulation studies, the researcher wants each sample to be statistically independent. For example, if you are drawing samples of size N=100 from a continuous distribution, it is not acceptable for the millionth sample to equal the first sample. Such a situation would bias the study. Similarly, it is not acceptable for half (or a fourth, or an eighth,...) of the numbers in the millionth sample to be the same as the numbers in the first sample.

The easiest way to ensure that each sample is statistically independent is to generate all samples from one high-quality random number stream. However, if the structure of your simulation requires that you use two or more streams, then SAS 9.4M5 provides several ways to produce random samples that are statistically independent. One way is to use one of the new Threefry random number generators. The other way is to use the new CALL STREAM subroutine.

Independence and overlap in random number streams

I've previously written about the many random number generators (RNGs) in SAS. Recall that all RNGs produce uniform variates. When you call a RNG repeatedly, you are generating a stream of random numbers. A stream is the sequence of pseudorandom uniform values that is generated from a given seed value. (If you request a random value from a nonuniform distribution, such as normal or exponential, the software generates one or more uniform variates and then applies a mathematical transformation to produce a nonuniform variate.)

In SAS, the modern RNGs are high quality and have a very long period. For all practical purposes, the samples that are generated from a single stream are independent. However, for the Mersenne twister and PCG RNGs, samples from different streams might not be independent. It might happen that the first sample for one stream is equal to (or has a large intersection with) the millionth sample from another seed, as shown in the image to the right. This phenomenon is known as overlap of streams.

The possibility that streams might overlap is important because some programmers use a macro loop to generate thousands of samples, where each sample is from a different stream. Not only is this technique inefficient, but the birthday problem in probability warns us that the probability that two streams overlap is larger than our intuition suggests.

The probability of overlapping streams is usually small, even for very large simulation studies. The "number of birthdays" equals the period of the RNG, which for Mersenne twister RNG is astronomically huge (≈ 106000). However, for the PCG RNG (which has a smaller period of "only" 264 ≈ 1019), the probability of an overlap could be nonnegligible.

Vigna (2009) estimates the probability of overlap when you use the first L variates in each of n streams from a RNG that has period P. Assuming that nL/P ≪ 1, the probability is approximately n2L/P. When P = 1019 and you generate n = 10,000 different streams, each containing L = 1,000,000 variates, the probability of an overlap is approximately 10-5.

Fortunately, SAS makes it it is easy to ensure the independence of random samples:

  • As shown in Simulating Data with SAS and in many blog posts, you can almost always use a single stream to generate all the random samples.
  • You can use one of the Threefry generators, TF2 or TF4. For a Threefry generator, two streams with different seed values never overlap! Similarly, streams for the hardware-based RDRAND generator never overlap.
  • You can use the CALL STREAM subroutine to set a key value for the PCG algorithm. For the PCG RNG, streams that have the same seed value but different key values never overlap.

In my next blog post, I will demonstrate how to use the STREAM subroutine in the SAS DATA step to generate independent streams. The technique can be useful when you are running a simulation that requires several DATA steps, each of which generates random numbers.

The post Independence and overlap in streams of random numbers appeared first on The DO Loop.

1月 242018
 

A popular way to use lists in the SAS/IML language is to pack together several related matrices into a single data structure that can be passed to a function. Imagine that you have written an algorithm that requires a dozen different parameters. Historically, you would have to pass those parameters to the function by explicitly listing each argument. If you use lists (introduced in SAS/IML 14.2, which is SAS 9.4M4), you can pack all parameters into one list and pass that list into the function module where the individual items can be extracted as needed.

Example: An algorithm that requires multiple parameters

To illustrate this list-passing technique, consider the algorithm for Newton's method, which is an iterative method for finding the roots of a nonlinear systems of equations. An implementation of Newton's method is included the SAS/IML documentation. Newton's method requires several arguments:

  • The name of a module that evaluates the function whose roots are sought.
  • The name of a module that evaluates the Jacobian matrix (the derivative of the function).
  • An initial guess for the solution.
  • The maximum number of iterations before the algorithm decides that the method is not converging.
  • A convergence criterion that determines the accuracy of the solution.

The example in the documentation (which was written in the 1980s) uses global variables and hard-codes the names of functions and the convergence criteria. The example is meant to demonstrate the basic ideas, but is not reusable. You can make the example more flexible and robust by passing parameters to the module as shown in the next section.

Create a list of parameters

Suppose that you define the modules F and DF to evaluate a multivariate function and the matrix of partial derivatives (the Jacobian matrix), respectively. The following modules evaluate the same mathematical function and derivatives that are used in the SAS/IML documentation:

proc iml;
/* Newton's method to solve for roots of the system of equations
  F(x1,x2) = [ x1+x2-x1*x2+2 ] = [ 0 ]
             [ x1*exp(-x2)-1 ]   [ 0 ]
*/
/* 1. define a function F that evaluates a multivariate function */
start F(x);
   x1 = x[1];   x2 = x[2];                 /* extract the values */
   f = (x1+x2-x1*x2+2) // (x1*exp(-x2)-1); /* evaluate the function */
   return ( f );
finish F;
 
/* 2. define a function DF that evaluates the Jacobian of F */
start DF(x);
   x1 = x[1];   x2 = x[2];                 /* extract the values */
   J = {. ., . .};
   J[1,1] = 1-x2;       J[1,2] = 1-x1;
   J[2,1] = exp(-x2);   J[2,2] =  -x1*exp(-x2);
   return ( J );
finish DF;

As shown in a previous article, you can create a named list that contains the parameters for Newton's algorithm. In SAS/IML 14.3, you can create the list by specifying a sequence of name-value pairs:

/* 3. list of NAME  = VALUE pairs  (syntax uses SAS/IML 14.3) */
args = [#"Func"     = "F",         /* name of module that evaluates the function */
        #"Deriv"    = "DF",        /* name of module that evaluates the deivative */ 
        #"x0"       = {0.1, -2.0}, /* initial guess for solution */
        #"MaxIters" = 100,         /* maximum iterations */
        #"ConvCrit" = 1e-6 ];      /* convergence criterion */

In SAS/IML 14.3, you can use the list item operator ($) to specify an item in a list. You can also get the name of the function and use CALL EXECUTE to evaluate the function at the initial guess, as follows:

x = args$"x0";                           /* initial guess */ 
EvalFunc = "y=" + args$"Func" + "(x);";  /* string for "y=F(x);" */
call execute(EvalFunc);                  /* evaluates function at x */
print EvalFunc, y;

This is a useful trick for evaluating a function by name. The next section uses this trick inside a module that implements Newton's method.

Pass a list of parameters to and from a SAS/IML module

You can now implement Newton's method by defining a SAS/IML module that takes one argument, which is a list. Inside the module, the parameters are extracted from the list and used to evaluate the functions F and DF and to control the iteration and convergence of Newton's method, as follows:

/* 4. Newton's method for solving the roots for F(x)=0. The argument to the module is a list. */
start Newton( L );
   x = L$"x0";                              /* initial guess */
   EvalFunc = "y=" + L$"Func" + "(x);";     /* string for "y=F(x);" */
   EvalDeriv = "D=" + L$"Deriv" + "(x);";   /* string for "D=DF(x);" */
 
   call execute(EvalFunc);          /* evaluate F at starting values */
   converged = (max(abs(y)) < L$"ConvCrit");              /* 0 or 1 */
   do iter = 1 to L$"Maxiters"      /* iterate until max iterations */
             while( ^converged );   /*      or until convergence */
      call execute(EvalDeriv);      /* evaluate derivatives D=DF(x) */
      delta = -solve(D,y);          /* solve for correction vector */
      x = x+delta;                  /* update x with new approximation */
      call execute(EvalFunc);       /* evaluate the function y=F(x)*/
      converged = (max(abs(y)) < L$"ConvCrit");
   end;
   /* you can return x or  a list that summarizes the results */
   result = [#"x"          = x,           /* approximate root (if converged) */
             #"iterations" = iter,        /* total number of iterations */
             #"status"      = converged]; /* 1 if root found; 0 otherwise */
   return ( result );
finish Newton;
 
/* 5. call Newton's method and pass in a list of parameters */
soln = Newton( args );              
xSoln = soln$"x";                   /* get root */
if soln$"status" then               /* check convergence statust */
   print "Newton's method converged in " (soln$"iterations") " iterations", xSoln;
else 
   print "Newton's method did not converge in " (soln$"iterations") " iterations";

The Newton module returns a named list that has three items. The item named "status" is a binary value that indicates whether the method converged. If the method converged, the item named "x" contains the approximate root. The item named "iterations" contains the number of iterations of the algorithm.

In summary, you can use lists to create a structure that contains many parameters. You can pass the list to a SAS/IML module, which can extract and use the parameters. This enables you to simplify the calling syntax for user-defined modules that require many parameters.

For additional examples of using lists to pass heterogeneous data to SAS/IML modules, see "More Than Matrices: SAS/IML Software Supports New Data Structures" (Wicklin, 2017, pp. 11–12) and the SAS/IML documentation for lists.

The post Use lists to pass parameters to SAS/IML functions appeared first on The DO Loop.

1月 222018
 

SAS/IML 14.3 (SAS 9.4M5) introduced a new syntax for creating lists and for assigning and extracting item in a list. Lists (introduced in SAS/IML 14.2) are data structures that are convenient for holding heterogeneous data. A single list can hold character matrices, numeric matrices, scalar values, and other lists, as discussed in a previous article about how to use lists in SAS/IML.

The list creation operator

You can use square brackets to create a list. The elements of the list are separated by commas. For example, the following syntax creates a list that contains three elements. The list represents a hypothetical patient in a cholesterol-lowering study.

proc iml;
/* 1. New list creation syntax (L = [val1, val2,...]) */
/*                 Cholesterol at
       Name  Age   0mo  3mo  6mo  12mo */
P1 = ["Bob",  36, {182, 170, 170, 162}]; /* Patient 1 */

The name of the list is P1. The first element of the list is a string, the second is a scalar number, and the third is a numeric vector. In this case, the elements are specified by using literal values, but you can also use previously defined variables. The following statement is equivalent but defines the list by using existing variables:

Name = "Bob"; Age = 36; Chol = {182, 170, 170, 162};
P1 = [Name, Age, Chol];   /* define list from copies of existing variables */

As mentioned earlier, lists can contain other lists. For example, if you have multiple patients in a study, you can create a list of each patient's data, then create a list that contains all the patients' data, as follows:

P2 = ["Fred", 52, {175, 165, 155}     ]; /* Patient 2 */
P3 = ["Tom",  45, {160, 145,   ., 139}]; /* Patient 3 */
Patients = [P1, P2, P3];                 /* a list of patients */

Assign and extract list items

You can use the list item operator ($) to specify an item in a list. For each patient, the age is stored as the second item in a list. If you want the age of Bob (the first patient), you can use either of the following statements:

/* 2. New list item syntax ($) */
BobAge = P1$2;         /* get 2nd item from P1 list */
BobAge = Patients$1$2; /* get 1st item of Patients, then 2nd item from that list */

The first statement is straightforward: P1$2 means "get the second item from the P1 list." The second statement is parsed left to right. The syntax Patient$1 means "get the first item from the Patients list," which is equivalent to P1. Thus Patient$1$2 gets Bob's age.

The preceding example uses literal values to specify the position of an item in a list, but you can also use a variable. This makes it possible to extract items in a loop. For example, the following statements loop over all patients, extract the name and cholesterol values of the patient, and compute each patient's average cholesterol value during the study:

N = ListLen(Patients);      /* number of patients */
Name = j(N,1,"     ");      /* allocate vector for names */
AvgChol = j(N,1,.);         /* allocate vector for results */
do i = 1 to N;
   Name[i] = Patients$i$1;  /* get name of i_th patient */
   Chol = Patients$i$3;     /* get sequence of cholesterol values */
   AvgChol[i] = mean(Chol); /* average value */
end;
print AvgChol[rowname=Name];

You can use the list item operator on the left side of an assignment operator to change an item in a list. For example, suppose you discover that Bob's age was typed wrong: Bob is really 63, not 36. You can update Bob's data as follows:

P1$2 = 63;         /* update 2nd item in P1 list */
Patients$1 = P1;   /* update entire patient list */

Alternatively, you could use the syntax Patients$1$2 = 63 to update the data in place.

Extract sublists

To subset a list, use square brackets ([]) and specify the indices of the items. For example, the following statement extracts the second and third patients into a new list:

/* 3. Extract sublist SL = L[ {i1, i2,...} ] */
SubList = Patients[ {2 3} ]; /* Fred and Tom */

The sublist operator ([]) ALWAYS returns a list, even if you specify only one item! Thus Patients[1] is a LIST that contains one item. If you want the item itself, use Patients$1.

Named lists

In the previous example, the items in the list P1 are a patient's name, age, and cholesterol readings. If you want to extract Bob's age, you can write P1$2, but someone unfamiliar with the order of the list items would have no idea what that item represents. Thus it is helpful to define a named list, which is also called an associative array. When you specify a named list, you specify the items by using name-value pairs, as follows:

P = [#"Name" = "Bob",                        /* P$1 or P$"Name" */
     #"Age"  = 63,                           /* P$2 or P$"Age" */
     #"Cholesterol" = {182, 170, 170, 162}]; /* P$3 or P$"Cholesterol" */

You can use the names to refer to the items in a list. For example, the following statements extract the patient's name and cholesterol readings by using the list item operator:

Name = P$"Name";
Chol = P$"Cholesterol";       /* get Bob's measurements */
print Chol[Label=Name];

You can also use names in the sublist operator to extract a sublist:

L = P[ {"Age" "Cholesterol"} ];  /* sublist that contains two items */

Summary

In summary, SAS/IML 14.3 contains new syntax for creating a list and for extracting items and sublists. This syntax makes it easier to use lists and to read and write SAS/IML programs that use lists.

The post Create lists by using a natural syntax in SAS/IML appeared first on The DO Loop.

11月 292017
 
Heat map of missing values among among 8 variables

Missing values present challenges for the statistical analyst and data scientist. Many modeling techniques (such as regression) exclude observations that contain missing values, which can reduce the sample size and reduce the power of a statistical analysis. Before you try to deal with missing values in an analysis (for example, by using multiple imputation), you need to understand which variables contain the missing values and to examine the patterns of missing values. I have previously written about how to use SAS to do the following:

An example of a visualization of a missing value pattern is shown to the right. The heat map shows the missing data for eight variables and 5209 observations in the Sashelp.Heart data set. It is essentially the heat map of a binary matrix where 1 indicates nonmissing values (white) and 0 indicates missing values (black).

In this article, I present a bar chart that helps to visualize the "Missing Data Patterns" table from PROC MI in SAS/STAT software. I also show two SAS tricks:

The pattern of missing data

The MI procedure can perform multiple imputation of missing data, but it also can be used as a diagnostic tool to group observations according to the pattern of missing data. You can use the NIMPUTE=0 option to display the pattern of missing value. By default, the "Missing Data Patterns" table is very wide because it includes the group means for each variable. However, you can use the DISPLAYPATTERN=NOMEANS option (SAS9.4M5) to suppress the group means, as follows:

%let Vars = AgeAtStart Height Weight Diastolic Systolic MRW Smoking Cholesterol;
%let numVars = %sysfunc(countw(&Vars));    /* &numVars = 8 for this example */
 
proc mi data=Sashelp.Heart nimpute=0 displaypattern=nomeans;   /* SAS 9.4M5 option */
var &Vars;
ods output MissPattern=Pattern;        /* output missing value pattern to data set */
run;

In the table, an X indicates that a variable does not contain any missing values, whereas a dot (.) indicates that a variable contains missing values. The table shows that Group 1 consists of about 5000 observations that are complete. Groups 2, 3, and 6 contains observations for which exactly one variable contains missing values. Groups 4 and 5 contain observations for which two variables are jointly missing. Group 7 contains observations for which three variables are missing. You can see that the size of the groups vary widely. Some patterns of missing value appear only a few times, whereas others appear much more often.

By inspecting the table, you can determine which combinations of variables are missing for each group. However, imagine a dataset that has 20 variables. The "Missing Data Patterns" table for 20 variables would be very wide, and it would be cumbersome to determine which combination of variables are jointly missing. The next section condenses the table into a smaller format and then creates a graph to summarize the pattern of missing values.

Shorten the labels for the missing data patterns

The information in the "Missing Data Patterns" table can be condensed. I saw the following idea in Chapter 10 of Gerhard Svolba's Data Quality for Analytics Using SAS, who credits the idea to a display that appears in JMP software. (Svolba does not use PROC MI, but defines a macro that you can download from the book's website.)

The table is wide because there is a column for every variable in the analysis. But the information in those columns is binary: for the i_th group does the j_th variable have a missing value? If the analysis involves k variables, you can replace the k columns by a single binary string that has k digits. The j_th digit in the string indicates whether the j_th variable has a missing value.

In the preceding analysis, the ODS OUTPUT statement wrote the "Missing Data Patterns" table to a data set. The following SAS DATA step reads the data and constructs a binary string of length k from the k character variables in the data. The binary string is formed by using the CATT function to concatenate the 'X' and '.' values in the table. The TRANSLATE function then converts those characters to '0' and '1' characters. The DATA step also computes the NumMiss variable, which counts the number of variables that have missing values for each row:

data Miss;
set Pattern;
array vars[*] _CHARACTER_;    /* or use &Vars */
length Pattern $&numVars.;    /* length = number of variables in array */
Pattern = translate(catt(of vars[*]), '01', 'X.');     /* Ex: 00100100 */
NumMiss = countc(pattern, '1');
run;
 
proc print data=Miss noobs;
   var Group Pattern NumMiss Freq;
run;

The table shows that the eight columns for the analysis variables have been replaced by a single column that displays an eight-character binary string. For Group 1, the string is all zeros, which indicates that no variable has missing values. For Group 2, the binary string contains all zeros except for a 1 in the last position. This means that Group 1 is the set of observations for which the last variable contains a missing value. Similarly, Group 7 is the set of observations for which the second, third, and sixth variables contain a missing value. Notice that this table does not provide the names of the variables. If you cannot remember the name of the second, third, and sixth variables, you need to look them up in the VARS macro variable.

Create a bar chart that has a logarithmic scale

The condensed version of the "Missing Data Patterns" table is suitable to graph as a bar chart. However, for these data the size of the groups vary widely. Consequently, you might want to plot the frequencies of the groups on a logarithmic scale. (Or you might not! There is considerable debate about whether you should display a bar chart that has a logarithmic axis. For a discussion and alternatives, see Sanjay's article "Graphs with log axis." My opinion is that a log axis is fine to use for a technical audience such as statisticians.)

By default, you cannot use a logarithmic scale on a bar chart because the baseline for the vertical axis (the frequency or count axis) starts at 0 and the LOG function is not defined at 0. If you have count data and no category has zero counts, then you can set the baseline of the graph to 1. The bars then indicate the log-frequency of the counts. The following call to PROC SGPLOT creates the bar chart and a marginal table:

title "Pattern of Missing Values";
proc sgplot data=Miss;
   hbar Pattern /response=Freq datalabel
                       baseline=1;  /* set BASELINE=1 for log scale */
   yaxistable NumMiss / valuejustify=left label="Num Miss"
                       valueattrs=GraphValueText labelattrs=GraphLabelText; /* use same attributes as axis */
   yaxis labelposition=top; 
   xaxis grid type=log logbase=10 label="Frequency (log10 scale)";
run;

This graph displays the counts of the number of observations in each pattern group. The labels on the Y axis indicate which variables have missing data. The values in the right margin indicate how many variables have missing data. If you are opposed to drawing bar charts on a logarithmic axis, you can use one of Sanjay's alternative visualizations.

In summary, you can create a bar chart that visualizes the number of observations that have a similar pattern of missing values. The summarization comes from PROC MI in SAS, which has a new DISPLAYPATTERN=NOMEANS option. A short DATA step can create a binary string for each group. That string can be used to indicate the missing data pattern in each group. If the groups vary widely in size, you can use a logarithmic axis to display the data. To create a bar chart on a logarithmic scale in SAS, set BASELINE=1 in the HBAR statement and use TYPE=LOG on the XAXIS statement in PROC SGPLOT.

The post Visualize patterns of missing values appeared first on The DO Loop.

11月 292017
 
Heat map of missing values among among 8 variables

Missing values present challenges for the statistical analyst and data scientist. Many modeling techniques (such as regression) exclude observations that contain missing values, which can reduce the sample size and reduce the power of a statistical analysis. Before you try to deal with missing values in an analysis (for example, by using multiple imputation), you need to understand which variables contain the missing values and to examine the patterns of missing values. I have previously written about how to use SAS to do the following:

An example of a visualization of a missing value pattern is shown to the right. The heat map shows the missing data for eight variables and 5209 observations in the Sashelp.Heart data set. It is essentially the heat map of a binary matrix where 1 indicates nonmissing values (white) and 0 indicates missing values (black).

In this article, I present a bar chart that helps to visualize the "Missing Data Patterns" table from PROC MI in SAS/STAT software. I also show two SAS tricks:

The pattern of missing data

The MI procedure can perform multiple imputation of missing data, but it also can be used as a diagnostic tool to group observations according to the pattern of missing data. You can use the NIMPUTE=0 option to display the pattern of missing value. By default, the "Missing Data Patterns" table is very wide because it includes the group means for each variable. However, you can use the DISPLAYPATTERN=NOMEANS option (SAS9.4M5) to suppress the group means, as follows:

%let Vars = AgeAtStart Height Weight Diastolic Systolic MRW Smoking Cholesterol;
%let numVars = %sysfunc(countw(&Vars));    /* &numVars = 8 for this example */
 
proc mi data=Sashelp.Heart nimpute=0 displaypattern=nomeans;   /* SAS 9.4M5 option */
var &Vars;
ods output MissPattern=Pattern;        /* output missing value pattern to data set */
run;

In the table, an X indicates that a variable does not contain any missing values, whereas a dot (.) indicates that a variable contains missing values. The table shows that Group 1 consists of about 5000 observations that are complete. Groups 2, 3, and 6 contains observations for which exactly one variable contains missing values. Groups 4 and 5 contain observations for which two variables are jointly missing. Group 7 contains observations for which three variables are missing. You can see that the size of the groups vary widely. Some patterns of missing value appear only a few times, whereas others appear much more often.

By inspecting the table, you can determine which combinations of variables are missing for each group. However, imagine a dataset that has 20 variables. The "Missing Data Patterns" table for 20 variables would be very wide, and it would be cumbersome to determine which combination of variables are jointly missing. The next section condenses the table into a smaller format and then creates a graph to summarize the pattern of missing values.

Shorten the labels for the missing data patterns

The information in the "Missing Data Patterns" table can be condensed. I saw the following idea in Chapter 10 of Gerhard Svolba's Data Quality for Analytics Using SAS, who credits the idea to a display that appears in JMP software. (Svolba does not use PROC MI, but defines a macro that you can download from the book's website.)

The table is wide because there is a column for every variable in the analysis. But the information in those columns is binary: for the i_th group does the j_th variable have a missing value? If the analysis involves k variables, you can replace the k columns by a single binary string that has k digits. The j_th digit in the string indicates whether the j_th variable has a missing value.

In the preceding analysis, the ODS OUTPUT statement wrote the "Missing Data Patterns" table to a data set. The following SAS DATA step reads the data and constructs a binary string of length k from the k character variables in the data. The binary string is formed by using the CATT function to concatenate the 'X' and '.' values in the table. The TRANSLATE function then converts those characters to '0' and '1' characters. The DATA step also computes the NumMiss variable, which counts the number of variables that have missing values for each row:

data Miss;
set Pattern;
array vars[*] _CHARACTER_;    /* or use &Vars */
length Pattern $&numVars.;    /* length = number of variables in array */
Pattern = translate(catt(of vars[*]), '01', 'X.');     /* Ex: 00100100 */
NumMiss = countc(pattern, '1');
run;
 
proc print data=Miss noobs;
   var Group Pattern NumMiss Freq;
run;

The table shows that the eight columns for the analysis variables have been replaced by a single column that displays an eight-character binary string. For Group 1, the string is all zeros, which indicates that no variable has missing values. For Group 2, the binary string contains all zeros except for a 1 in the last position. This means that Group 1 is the set of observations for which the last variable contains a missing value. Similarly, Group 7 is the set of observations for which the second, third, and sixth variables contain a missing value. Notice that this table does not provide the names of the variables. If you cannot remember the name of the second, third, and sixth variables, you need to look them up in the VARS macro variable.

Create a bar chart that has a logarithmic scale

The condensed version of the "Missing Data Patterns" table is suitable to graph as a bar chart. However, for these data the size of the groups vary widely. Consequently, you might want to plot the frequencies of the groups on a logarithmic scale. (Or you might not! There is considerable debate about whether you should display a bar chart that has a logarithmic axis. For a discussion and alternatives, see Sanjay's article "Graphs with log axis." My opinion is that a log axis is fine to use for a technical audience such as statisticians.)

By default, you cannot use a logarithmic scale on a bar chart because the baseline for the vertical axis (the frequency or count axis) starts at 0 and the LOG function is not defined at 0. If you have count data and no category has zero counts, then you can set the baseline of the graph to 1. The bars then indicate the log-frequency of the counts. The following call to PROC SGPLOT creates the bar chart and a marginal table:

title "Pattern of Missing Values";
proc sgplot data=Miss;
   hbar Pattern /response=Freq datalabel
                       baseline=1;  /* set BASELINE=1 for log scale */
   yaxistable NumMiss / valuejustify=left label="Num Miss"
                       valueattrs=GraphValueText labelattrs=GraphLabelText; /* use same attributes as axis */
   yaxis labelposition=top; 
   xaxis grid type=log logbase=10 label="Frequency (log10 scale)";
run;

This graph displays the counts of the number of observations in each pattern group. The labels on the Y axis indicate which variables have missing data. The values in the right margin indicate how many variables have missing data. If you are opposed to drawing bar charts on a logarithmic axis, you can use one of Sanjay's alternative visualizations.

In summary, you can create a bar chart that visualizes the number of observations that have a similar pattern of missing values. The summarization comes from PROC MI in SAS, which has a new DISPLAYPATTERN=NOMEANS option. A short DATA step can create a binary string for each group. That string can be used to indicate the missing data pattern in each group. If the groups vary widely in size, you can use a logarithmic axis to display the data. To create a bar chart on a logarithmic scale in SAS, set BASELINE=1 in the HBAR statement and use TYPE=LOG on the XAXIS statement in PROC SGPLOT.

The post Visualize patterns of missing values appeared first on The DO Loop.