8月 172020
 

I recently showed how to use simulation to estimate the power of a statistical hypothesis test. The example (a two-sample t test for the difference of means) is a simple SAS/IML module that is very fast. Fast is good because often you want to perform a sequence of simulations over a range of parameter values to construct a power curve. If you need to examine 100 sets of parameter values, you would expect the full computation to take about 100 times longer than one simulation.

But that calculation applies only when you run the simulations sequentially. If you run the simulations in parallel, you can complete the simulation study much faster. For example, if you have access to a machine or cluster of machines that can run 32 threads, then each thread needs to run only a few simulations. You can perform custom parallel computations by using the iml action in SAS Viya. The iml action is supported in SAS Viya 3.5.

This article shows how to distribute computations by using the PARTASKS function in the iml action. (PARTASKS stands for PARallel TASKS.) If you are not familiar with the iml action or with parallel processing, see my previous example that uses the PARTASKS function.

Use simulation to estimate a power curve

I previously showed how to use simulation to estimate the power of a statistical test. In that article, I simulated data from N(15,5) and N(16,5) distributions. Because the t test is invariant under centering and scaling operations, this study is mathematically equivalent to simulating from the N(0,1) and N(0.2,1) distributions. (Subtract 15 and divide by 5.)

Recall that the power of a statistical hypothesis test is the probability of making a Type 2 error. A Type 2 error is rejecting the null hypothesis when the alternative hypothesis is true. The power depends on the sample sizes and the magnitude of the effect you are trying to detect. In this case, the effect is the difference between two means. You can study how the power depends on the mean difference by estimating the power for a t test between data from an N(0,1) distribution and an N(δ, 1) distribution, where δ is a parameter in the interval [0, 2]. All hypothesis tests in this article use the α=0.05 significance level.

The power curve we want to estimate is shown to the right. The horizontal axis is the δ parameter, which represents the true difference between the population means. Each point on the curve is an estimate of the power of a two-sample t test for random samples of size n1 = n2 = 10. The curve indicates that when there is no difference between the means, you will conclude otherwise in 5% of random samples (because α=0.05). For larger differences, the probability of detecting the difference increases. If the difference is very large (more than 1.5 standard deviations apart), more than 90% of random samples will correctly reject the null hypothesis in the t test.

The curve is calculated at 81 values of δ, so this curve is the result of running 81 independent simulations. Each simulation uses 100,000 random samples and carries out 100,000 t tests. Although it is hard to see, the graph also shows 95% confidence intervals for power. The confidence intervals are very small because so many simulations are run.

So how long did it take to compute this power curve, which is the result of 8.1 million t tests? About 0.4 seconds by using parallel computations in the iml action.

Using the PARTASKS function to distribute tasks

Here's how you can write a program in the iml action to run 81 simulations across 32 threads. The following program uses four nodes, each with 8 threads, but you can achieve similar results by using other configurations. The main parts of the program are as follows:

  • The program assumes that you have used the CAS statement in SAS to establish a CAS session that uses four worker nodes, each with at least eight CPUs. For example, the CAS statement might look something like this:
        cas sess4 host='your_host_name' casuser='your_username' port=&MPPPort nworkers=4;
    The IML program is contained between the SOURCE/ENDSOURCE statements in PROC CAS.
  • The TTestH0 function implements the t test. It runs a t test on the columns of the input matrices and returns the proportion of samples that reject the null hypothesis. The TTestH0 function is explained in a previous article.
  • The "task" we will distribute is the SimTTest function. (This encapsulates the "main program" in my previous article.) The SimTTest function simulates the samples, calls the TTestH0 function, and returns the number of t tests that reject the null hypothesis. The first sample is always from the N(0, 1) distribution and the second sample is from the N(&delta, 1) distribution. Each call creates B samples. The input to the SimTTest function is a list that contains all parameters: a random-number seed, the sample sizes (n1 and n2), the number of samples (B), and the value of the parameter (delta).
  • The main program first sets up a list of arguments for each task. The i_th item in the list is the argument to the i_th task. For this computation, all arguments are the same (a list of parameters) except that the i_th argument includes the parameter value delta[i], where delta is a vector of parameter values in the interval [0, 2].
  • The call to the PARTASKS function is
         RL = ParTasks('SimTTest', ArgList, {2, 0});
    where the first parameter is the task (or list of tasks), the second parameter is the list of arguments to the tasks, and the third parameter specifies how to distribute the computations. The documentation of the PARTASKS function provides the full details.
  • After the PARTASKS function returns, the program processes the results. For each value of the parameter, δ, the main statistic is the proportion of t tests (p) that reject the null hypothesis. The program also computes a 95% (Wald) confidence interval for the binomial proportion. These statistics are written to a CAS table called 'PowerCurve' by using the MatrixWriteToCAS subroutine.
/* Power curve computation: delta = 0 to 2 by 0.025 */
proc cas;
session sess4;                         /* use session with four workers     */
loadactionset 'iml';                   /* load the iml action set           */
source TTestPower;
 
/* Helper function: Compute t test for each column of X and Y.
   X is (n1 x m) matrix and Y is (n2 x m) matrix.
   Return the number of columns for which t test rejects H0 */
start TTestH0(x, y);
   n1 = nrow(X);     n2 = nrow(Y);     /* sample sizes                      */
   meanX = mean(x);  varX = var(x);    /* mean & var of each sample         */
   meanY = mean(y);  varY = var(y);
   poolStd = sqrt( ((n1-1)*varX + (n2-1)*varY) / (n1+n2-2) );
 
   /* Compute t statistic and indicator var for tests that reject H0 */
   t = (meanX - meanY) / (poolStd*sqrt(1/n1 + 1/n2));
   t_crit =  quantile('t', 1-0.05/2, n1+n2-2);       /* alpha = 0.05        */
   RejectH0 = (abs(t) > t_crit);                     /* 0 or 1              */
   return  RejectH0;                                 /* binary vector       */
finish;
 
/* Simulate two groups; Count how many reject H0: delta=0 */
start SimTTest(L);                     /* define the mapper                 */
   call randseed(L$'seed');            /* each thread uses different stream */
   x = j(L$'n1', L$'B');               /* allocate space for Group 1        */
   y = j(L$'n2', L$'B');               /* allocate space for Group 2        */
   call randgen(x, 'Normal', 0,         1);   /* X ~ N(0,1)                 */
   call randgen(y, 'Normal', L$'delta', 1);   /* Y ~ N(delta,1)             */
   return sum(TTestH0(x, y));          /* count how many reject H0          */
finish;
 
/* ----- Main Program ----- */
numSamples = 1e5;
L = [#'delta' = .,   #'n1' = 10,  #'n2' = 10,
     #'seed'  = 321, #'B'  = numSamples];
 
/* Create list of arguments. Each arg gets different value of delta */
delta = T( do(0, 2, 0.025) );
ArgList = ListCreate(nrow(delta));
do i = 1 to nrow(delta);
   L$'delta' = delta[i];
   call ListSetItem(ArgList, i, L);
end;
 
RL = ParTasks('SimTTest', ArgList, {2, 0});  /* assign nodes before threads */
 
/* Summarize results and write to CAS table for graphing */
varNames = {'Delta' 'ProbEst' 'LCL' 'UCL'};  /* names of result vars        */
Result = j(nrow(delta), 4, .);
zCrit = quantile('Normal', 1-0.05/2);  /* zCrit = 1.96                      */
do i = 1 to nrow(delta);               /* for each task                     */
   p = RL$i / numSamples;              /* get proportion that reject H0     */
   SE = sqrt(p*(1-p) / L$'B');         /* std err for binomial proportion   */
   LCL = p - zCrit*SE;                 /* 95% CI                            */
   UCL = p + zCrit*SE;
   Result[i,] = delta[i] || p || LCL || UCL;
end;
 
call MatrixWriteToCas(Result, '', 'PowerCurve', varNames);
endsource;
iml / code=TTestPower nthreads=8;
quit;

You can pull the 81 statistics back to a SAS data set and use PROC SGPLOT to plot the results. You can download the complete program that generates the statistics and graphs the power curve.

Notice that there are 81 tasks but only 32 threads. That is not a problem. The PARTASKS tries to distribute the workload as evenly as possible. In this example, 17 threads are assigned three tasks and 15 threads are assigned two tasks. If T is the time required to perform one power estimate, the total time to compute the power curve will be approximately 3T, plus "overhead costs" such as the time required to set up the problem, distribute the parameters to each task, and aggregate the results. You can minimize the overhead costs by passing only small amounts of data across nodes.

Summary

In summary, if you have a series of independent tasks in IML, you can use the PARTASKS function to distribute tasks to available threads. The speedup can be dramatic; it depends on the time required for the thread that performs the most work.

This article shows an example of using the PARTASKS function in the iml action, which is available in SAS Viya 3.5. The example shows how to distribute a sequence of computations among k threads that run concurrently and independently. In this example, k=32. Because the tasks are essentially identical, each thread computes 2/32 or 3/32 of the total work. The results from each task are returned in a list, where they can be further processed by the main program.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post Estimate a power curve in parallel in SAS Viya appeared first on The DO Loop.

8月 122020
 

As he watched me unpack yet another Amazon delivery, my husband decided to do an intervention. Scattered around me on the floor were solar panels, water purification tablets, tarps, hunting knives and enough batteries to power New York City for many weeks. It was 2019 and hurricane season was upon [...]

Manufacturing production during the pandemic: Panicked or pragmatic? was published on SAS Voices by Marcia Walker

8月 122020
 

CVP engine as a magnifying glassIn my earlier blog post, Changing variable type and variable length in SAS datasets, I showed how you can effectively change variables lengths in a SAS data set. That approach works fine when you need to change length attribute for few variables, on a case by case basis. But what if you need to change lengths for all character variables in a data set? Or if you need to do this for all data sets in a data library? For example, you need to expand (increase) all your character variables lengths by 50%. Well, then the case-by-case approach becomes too laborious and inefficient.

What is a character variable’s length attribute?

Before reading any further, let’s take a quick quiz:

Q: A character variable length attribute represents a number of:

  1. Bits
  2. Bytes
  3. Centimeters
  4. Characters

If your answer is anything but B, it’s incorrect. According to the SAS documentation, length refers to the number of bytes used to store each of the variable's values in a SAS data set. You can use a LENGTH statement to set the length of both numeric and character variables.

It is true though that for some older encoding systems (ASCII, ISO/IEC 8859, EBCIDIC, etc.) there was no difference between the number of bytes and the number of characters as those systems were based on exactly one byte per character encoding. They are even called Single Byte Character Sets (SBCS) for that reason. The problem is they can accommodate only a maximum of 28=256 symbols which is not nearly enough to cover all the variety of natural languages, special characters, emojis etc.

Why would we want to expand character variable lengths?

Use case 1. Expanding character values range

For this scenario, let’s consider Internet traffic analysis where your data contains multiple character columns for Internet Protocol addresses (IP addresses) in 32-bit version 4 (IPv4, e.g. ‘125.255.501.780’). You transition to a newer 128-bit IPv6 standard (e.g. ‘2001:0000:3238:DFE1:0063:0000:0000:FEFB’) and need to modify your data structure to accommodate the new standard with longer character values.

Use case 2. Migrating SAS data to multi-byte encoding environment

In this scenario, you migrate/move SAS data sets from older SBCS environments to newer Multi-Byte-Character Set (MBCS) encoding environments. For such a case, the ability to increase character variables lengths in bulk with a simple action becomes especially significant and critical.

Currently, the most commonly used MBCS is Unicode which is supported by all modern operating systems, databases and web browsers. Out of different flavors of Unicode (UTF-8, UTF-16, UTF-32) the most popular is UTF-8. UTF-8 (8-bit Unicode Transformation Format) is a variable-width character set that uses from 1 to 4 one-byte (8-bit) code units per character; it is capable of encoding 1,112,064 various characters that covers most modern languages, including Arabic and Hebrew characters, hieroglyphs, emojis as well as many other special characters.

Since each UTF-8 encoded character may require somewhere between one and four bytes, and not all SBCS characters are represented by one byte in UTF-8, data migration from SBCS to UTF-8 may cause data truncation and subsequently data loss.

When SAS reads an SBCS-encoded data set and writes its records into UTF-8-encoded data set it throws an ERROR message in the log and stops execution:

ERROR: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

When SAS reads an SBCS-encoded data set and produces a UTF-8-encoded printed report only (without generating a UTF-8-encoded output data set) it generates a WARNING message (with identical description as the above ERROR message) while continuing execution:

WARNING: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

Either ERROR or WARNING is unacceptable and must be properly addressed.

How to expand all character variables lengths?

Regardless of character transcoding, SAS’ CVP Engine is short and effective answer to this question. CVP stands for Character Variable Padding which is exactly what this special-purpose engine does – it pads or expands, increases character variables by a number of bytes. CVP engine is part of Base SAS and does not require any additional licensing.

The CVP engine is a read-only engine for SAS data sets only. You can think of it as of a magnifying glass: it creates an expanded view of the character data descriptors (lengths) without changing them. Still we can use the CVP Engine to actually change a data set or a whole data library to their expanded character variables version. All we need to do is to define our source library as CVP library, for example:

libname inlib cvp 'c:\source_folder';

Then use PROC COPY to create expanded versions of our original data sets in a target library:

libname outlib 'c:\target_folder';
proc copy in=inlib out=outlib noclone;
   select dataset1 dataset2;
run;

Or, if we need to expand character variable lengths for the whole library, then we use the same PROC COPY without the SELECT statement:

proc copy in=inlib out=outlib noclone;
run;

It’s that easy. And the icing on the cake is that CVP engine automatically adjusts the variables format widths to meet the expanded byte lengths for all converted character variables.

Avoiding character data truncation by using the CVP Engine

CVP Engine is a near-perfect SAS solution to the problem of potential data truncation when data is transcoded during migration or move from SBCS-based to MBCS-based systems.

To avoid data loss from possible data truncation during transcoding we can use the above code with a slight but important modification – define the target library with outencoding='UTF-8' option. It will result in our target data not only expanded lengthwise but properly encoded as well. Then we run this modified code in the old SBCS environment before moving/migrating our data sets to the new MBCS environment:

libname inlib cvp 'c:\source_folder';
libname outlib 'c:\utf8_target_folder' outencoding='UTF-8';
proc copy in=inlib out=outlib noclone;
   select dataset1 dataset2;
run;

Again, if you need to expand character variable lengths for the whole library, then you can use the same PROC COPY without the SELECT statement:

proc copy in=inlib out=outlib noclone;
run;

After that we can safely move our expanded, UTF-8-encoded data to the new UTF-8 environment.

Code notes

  • The code above will create a different version of your original data sets with desired encoding and expanded by 50% (default) character variables lengths. As shown below, this default behavior can be changed by using CVPBYTES= or CVPMULTIPLIER= options which explicitly define bytes expansion rate.
  • It is important to note that CVP option is specified on the input library since the CVP engine is read-only engine, thus available for input (read) processing only.
  • For the output library you specify your desired encoding option, in this case outencoding='UTF-8'.
  • The noclone option specifies not to copy data set attributes. This is needed to make sure the attributes are recreated rather than duplicated.
  • If you want to migrate your data sets using PROC MIGRATE, you should expand column lengths before using PROC COPY as shown above since the CVP engine is not currently supported with PROC MIGRATE.
  • The CVP engine supports only SAS data files (no SAS views, catalogs, item stores, and so on).

CVP Engine options

There are several options available with the CVP Engine. Here are the most widely used:

CVPBYTES=bytes - specifies the number of bytes by which to expand character variable lengths. The lengths of character variables are increased by adding the specified bytes value to the current length.

Example: libname inlib 'SAS data-library' cvpbytes=5;

The CVPBYTES= option implicitly specifies the CVP engine, that is if you specify the CVPBYTES= option you don’t have to specify CVP engine explicitly as SAS will use it automatically.

CVPMULTIPLIER=multiplier - specifies a multiplier value that expands character variable. The lengths of character variables are increased by multiplying the current length by the specified multiplier value. You can specify a multiplier value from 1 to 5, or you can specify 0 and then the CVP engine determines the multiplier automatically.

Example: libname inlib 'SAS data-library' cvpmultiplier=2.5;

The CVPMULTIPLIER= option also implicitly specifies the CVP engine, that is if you specify the CVPMULTIPLIER= option, you don’t have to specify CVP engine explicitly as SAS will use it automatically.

Note:

  • You cannot specify both the CVPMULTIPLIER= option and the CVPBYTES= option. Specify only one of these options.
  • If you explicitly assign the CVP engine but do not specify either CVPBYTES= or CVPMULTIPLIER=, then SAS defaults to using CVPMULTIPLIER=1.5 to increase the lengths of the character variables.

Additional Resources

Your thoughts?

Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments section below.

Expanding lengths of all character variables in SAS data sets was published on SAS Users.

8月 122020
 

A previous article about standardizing data in groups shows how to simulate data from two groups. One sample (with n1=20 observations) is simulated from an N(15, 5) distribution whereas a second (with n2=30 observations) is simulated from an N(16, 5) distribution. Although I didn't mention, due to random sampling variation the sample means of the two groups are close to each other: 14.9 and 15.15. In fact, if you run a t test on these data, the t test concludes that the samples probably came from populations that have the same mean parameter. More formally, the t test fails to reject the null hypothesis that the two group means are the same.

The probability of making an incorrect inference like this is called the power of the hypothesis test. This article uses PROC POWER to compute the exact power of a t test and then uses a simulation to estimate the power. Along the way, I show a surprising feature of SAS/IML functions.

Power and Type 2 errors

Let's run the t test on the simulated data. The following SAS DATA step is the same as in the previous article. It simulates the data and calls PROC TTEST to analyze the data. (In this article, all tests are performed at the α=0.05 significance level.) The null hypothesis is that the group means are equal; the alternative hypothesis is that they are not equal. The data are simulated from distributions that have the same variance, so you can use the pooled t test to analyze the data:

/* Samples from N(15,5) and N(16, 5). */
%let mu1 = 15;
%let mu2 = 16;
%let sigma = 5;
data TwoSample;
call streaminit(54321);
n1 = 20; n2 = 30;
Group = 1;
do i = 1 to n1;
   x = round(rand('Normal', &mu1, &sigma), 0.01);  output;
end;
Group = 2;
do i = 1 to n2;
   x = round(rand('Normal', &mu2, &sigma), 0.01);  output;
end;
keep Group x;
run;
 
proc ttest data=TwoSample plots=none;
   class Group;
   var x;
run;

I blurred the rows of the tables that correspond to the Satterthwaite test so that you can focus on the pooled t test. The pooled test uses the difference between the sample means (-0.25) and the pooled standard deviation (4.86) to construct a t test for the difference of means. The p-value for the two-sided test is 0.8613, which means that the data do not provide enough evidence to reject the null hypothesis.

Because we simulated the data, we know that, in fact, the difference between the population means is not zero. In other words, the t test failed to reject the null hypothesis even though the alternative hypothesis (unequal means) is true. This is called a "Type 2" error. The probability of making a Type 2 error is called the power of a statistical hypothesis test. Typically, the power depends on the sample size and the magnitude of the effect you are trying to detect.

An exact computation of power

A two-sample t test is one of the tests that is supported by PROC POWER in SAS. You can use PROC POWER to find the probability of a Type 2 error for random samples of size n1=20 and n2=30 when the true difference between the means is 1.

/* compute exact power */
proc power;
  twosamplemeans  power = .           /* missing ==> "compute this" */
    meandiff= 1                       /* difference of means        */
    stddev=5                          /* N(mu1-mu2, 5)              */
    groupns=(20 30);                  /* num obs in the samples     */
run;

The tables tell you that (with these parameter values) the two-sided t test will commit a Type 2 error about 10% of the time (power=0.104) at the α=0.05 significance level. So we got "unlucky" when the simulated sample failed to reject the null hypothesis. Apparently, this only happens for 10% of random samples. In 9 out of 10 random samples, the t test will conclude that the sample comes from populations that have different means.

A simulation approach to estimate power in SAS/IML

I previously wrote about how to use simulation to estimate the power of a t test. That article shows how to use the DATA step to simulate many independent samples and then uses BY-group processing in PROC TTEST to analyze each sample. You use the ODS OUTPUT statement to capture the statistics for each analysis. For some samples, the test rejects the null hypothesis; for others, it does not. The proportion of samples that reject the null hypothesis is an estimate of the power of the test.

If you can write a SAS/IML function that implements a t test, you can perform the same power computation in PROC IML. The following SAS/IML program (based on Chapter 5 of Wicklin (2013) Simulating Data with SAS) defines a function that computes the t statistic for a difference of means. The function returns a binary value that indicates whether the test rejects the null hypothesis. To test the function, I call it on the same simulated data:

proc iml;
/* SAS/IML function for a pooled t test.
   X : column vector with n1 elements from a population with mean mu_x. 
   Y : column vector with n2 elements from a population with mean mu_y.
   Compute the pooled t statistic for the difference of means and return whether 
   the statistic rejects the null hypothesis 
      H0: mu_x = mu_y
   at the alpha signif level. Formulas from the SAS documentation: https://bit.ly/30rr2GV
*/
start TTestH0(x, y, alpha=0.05);
   n1 = nrow(X);    n2 = nrow(Y);             /* sample sizes */
   meanX = mean(x); varX = var(x);            /* mean & var of each sample */
   meanY = mean(y); varY = var(y);
   DF = n1 + n2 - 2;
   /* pooled variance: https://blogs.sas.com/content/iml/2020/06/29/pooled-variance.html */
   poolVar = ((n1-1)*varX + (n2-1)*varY) / DF;
   /* Compute test statistic, critical value, and whether we reject H0 */
   t = (meanX - meanY) / sqrt(poolVar*(1/n1 + 1/n2));
   *print (meanX - meanY)[L="Diff (1-2)"] (sqrt(poolVar))[L="PoolSD"] t;
   t_crit = quantile('t', 1-alpha/2, DF); 
   RejectH0 = (abs(t) > t_crit);              /* binary 0|1 for 2-sided test */
   return RejectH0;                           
finish;
 
use TwoSample;
read all var 'x' into x where(group=1);
read all var 'x' into y where(group=2);
close;
 
reject = TTestH0(x, y);
print reject;

To confirm that the computations are the same as in PROC TTEST, I uncommented the PRINT statement in the module. The computations are the same as for PROC TTEST. The test fails to reject the null hypothesis for these data.

Notice that the SAS/IML function is both concise (less than 10 lines) and mimics the formulas that you see in textbooks or documentation. This "natural syntax" is one of the advantages of the SAS/IML language.

Simulation to estimate power in PROC IML

The SAS/IML function looks like a typical function for a univariate analysis, but it contains a surprise. Without making any modifications. you can call the function to run thousands of t tests. The module is written in a vectorized manner. If you send in matrices that have B columns, the module will run t tests for each column and return a binary row vector that contains B values.

Take a close look at the SAS/IML function. The MEAN and VAR functions operate on columns of a matrix and return row vectors. Consequently, the meanX, meanY, varX, and varY variables are row vectors. Therefore, the poolVar and t variables are also row vectors, and the RejectH0 variable is a binary row vector.

To demonstrate that the function can analyze many samples in a single call, the following statements simulate 100,000 random samples from the N(15,5) and N(16,5) distributions. Each column of the X and Y matrices is a sample. One call to the TTestH0 function computes 100,000 t tests. The "colon" subscript reduction operator computes the mean of the vector, which is the proportion of tests that reject the null hypothesis. As mentioned earlier, this is an estimate of the power of the test.

/* simulation of power */
B = 1e5;                  /* number of independent samples (number of cols) */
n1 = 20; n2 = 30;         /* sample sizes (number of rows) */
call randseed(321);
x = j(n1, B);             /* allocate array for samples from Group 1 */
y = j(n2, B);             /* allocate array for samples from Group 2 */
call randgen(x, 'Normal', &mu1, &sigma);   /* X ~ N(mu1,sigma) */
call randgen(y, 'Normal', &mu2, &sigma);   /* Y ~ N(mu2,sigma) */
reject = TTestH0(x, y);   /* result is row vector with B elements */
 
power = reject[:];
print power[L="Power (B=1E5)"];

The result is very close to the exact power as computed by PROC POWER.

Notice also that the SAS/IML function, when properly written, can test thousands of samples as easily as it tests one sample. The program does not require an explicit loop over the samples.

Summary

This article starts with a simulated sample that contains two groups. Although the groups are simulated from distributions that have different means, due to random sampling variation a t test fails to reject the null hypothesis that the means are equal. This is a Type 2 error. For random samples from the specified distributions, PROC POWER shows that the probability that the t test will commit a Type 2 error is 0.1. You can estimate this value by simulating many samples and computing the proportion that fails to reject the null hypothesis. The simulation agrees with the exact power.

I previously wrote a simulation of power by using the DATA step and BY-group processing in PROC TTEST. But in this article, I implement a t test by writing a SAS/IML function. You might be surprised to discover that you can pass many samples to the function, which runs a test on each column of the input matrices.

It is worth mentioning that you can repeat this computation for many parameter values to obtain a power curve. For example, you can vary the difference between the means to determine how the power depends on the magnitude of the difference between the means.

The post Use simulation to estimate the power of a statistical test appeared first on The DO Loop.

8月 102020
 

The most fundamental concept that students learning introductory SAS programming must master is how SAS handles data. This might seem like an obvious statement, but it is often overlooked by students in their rush to produce code that works. I often tell my class to step back for a moment and "try to think like SAS" before they even touch the keyboard. There are many key topics that students must understand in order to be successful SAS programmers. How does SAS compile and execute a program? What is the built-in loop that SAS uses to process data observation by observation? What are the coding differences when working with numeric and character data? How does SAS handle missing observations?

One concept that is a common source of confusion for students is how to tell SAS to treat rows versus columns. An example that we use in class is how to write a program to calculate a basic descriptive statistic, such as the mean. The approach that we discuss is to identify our goal, rows or columns, and then decide what SAS programming statements are appropriate by thinking like SAS. First, we decide if we want to calculate the mean of an observation (a row) or the mean of a variable (a column). We also pause to consider other issues such as the type of variable, in this case numeric, and how SAS evaluates missing data. Once these concepts are understood we can proceed with an appropriate method: using DATA step programming, a procedure such as MEANS, TABULATE, REPORT or SQL, and so on. For more detailed information about this example there is an excellent user group paper on this topic called "Many Means to a Mean" written by Shannon Pileggi for the Western Users of SAS Software conference in 2017. In addition, The Little SAS® Book and its companion book, Exercises and Projects for the Little SAS® Book, Sixth Edition address these types of topics in easy-to-understand examples followed up with thought-provoking exercises.

Here is an example of the type of question that our book of exercises and projects uses to address this type of concept.

Short answer question

  1. Is there a difference between calculating the mean of three variables X1, X2, and X3 using the three methods as shown in the following examples of code? Explain your answer.
    Avg1 = MEAN(X1,X2,X3);
    Avg2 = (X1 + X2 + X3) / 3;
    PROC MEANS; VAR X1 X2 X3; RUN;

Solution

In the book, we provide solutions for odd-numbered multiple choice and short answer questions, and hints for the programming exercises. Here is the solution for this question:

  1. The variable Avg1 that uses the MEAN function returns the mean of nonmissing arguments and will provide a mean value of X1, X2, and X3 for each observation (row) in the data set. The variable Avg2 that uses an arithmetic equation will also calculate the mean for each observation (row), but will return a missing value if any of the variables for that observation have a missing value. Using PROC MEANS will calculate the mean of nonmissing data for each variable (column) X1, X2, and X3 vertically.

For more information about The Little SAS Book and its companion book of exercises and projects, check out these blogs:

Learning to think like SAS was published on SAS Users.

8月 102020
 

A common operation in statistical data analysis is to center and scale a numerical variable. This operation is conceptually easy: you subtract the mean of the variable and divide by the variable's standard deviation. Recently, I wanted to perform a slight variation of the usual standardization:

  • Perform a different standardization for each level of a grouping variable.
  • Instead of using the sample mean and sample standard deviation, I wanted to center and scale the data by using values that I specify.

Although you can write SAS DATA step code to center and scale the data, PROC STDIZE in SAS contains options that handle both these cases.

Simulate data from two groups

Let's start by creating some example data. The following DATA step simulates samples drawn from two groups. The distribution of the first group is N(mu1, sigma) and the distribution of the second group is N(mu2, sigma). The first sample size is n1=0; the second size is n2=30. Notice that the data are ordered by the Group variable.

%let mu1 = 15;
%let mu2 = 16;
%let sigma = 5;
/* Create two samples N(mu1, sigma) and N(mu2, sigma), with samples sizes n1 and n2, respectively */
data TwoSample;
call streaminit(54321);
n1 = 20; n2 = 30;
Group = 1;
do i = 1 to n1;
   x = round(rand('Normal', &mu1, &sigma), 0.01);  output;
end;
Group = 2;
do i = 1 to n2;
   x = round(rand('Normal', &mu2, &sigma), 0.01);  output;
end;
keep Group x;
run;
 
ods graphics / width=500px height=360px;
title "Normal Data";
title2 "N(&mu1, &sigma) and N(&mu2, &sigma)";
proc sgpanel data=TwoSample noautolegend;
   panelby Group / columns=1;
   histogram x;
   density x / type=normal;
   colaxis grid;
run;

The graph shows a comparative histogram of the distribution of each group and overlays the density of the normal curve that best fits the data. You can run PROC MEANS to calculate that the estimates for the first group are mean=14.9 and StdDev=5.64. For the second group, the estimates are mean=15.15 and StdDev=4.27.

The "standard" standardization

Typically, you standardize data by using the sample mean and the sample standard deviation. You can do this by using PROC STDIZE and specify the METHOD=STD method (which is the default method). You can use the BY statement to apply the standardization separately to each group. The following statements standardize the data in each group by using the sample statistics:

/* use PROC STDIZE to standardize by the sample means and sample std dev */
proc stdize data=TwoSample out=ScaleSample method=std;
   by Group;
   var x;
run;
 
proc means data=ScaleSample Mean StdDev Min Max ndec=2; 
   class Group;
   var x;
run;

As expected, each sample now has mean=0 and unit standard deviation. Many multivariate analyses center and scale data in order to adjust for the fact that different variables are measured on different scales. PROC STDIZE has many other options for standardizing data.

Center and scale by using the DATA step

A second way to standardize the data is to use the DATA step to center and scale each variable and each group. You can overwrite the contents of each column, or (as I've done below), you can create a new variable that contains the standardized values. You need to specify values for the location and scale parameters. For these simulated data, I know the population values of the location and scale parameters. The following DATA step uses the parameters to center and scale the data:

/* because we know the population parameters, we can use them to center and scale the data */
/* DATA step method */
data ScaleData;
array mu[2] (&mu1 &mu2);    /* put parameter values into an array */
set TwoSample;
y = (x-mu[Group]) / σ
run;
 
proc means data=ScaleData Mean StdDev Min Max ndec=2;
   class Group;
   var y;
run;

Because I use the population parameters instead of the sample estimates, the transformed variable does not have zero mean nor unit variance.

Wide form: Use the LOCATION and SCALE statements in PROC STDIZE

You can perform the same calculations by using PROC STDIZE. If you look up the documentation for the METHOD= option on the PROC STDIZE statement, you will see that the procedure supports the METHOD=IN(ds) method, which enables you to read parameters from a data set. I will focus on the LOCATION and SCALE parameters; see the documentation for other options.

You can specify the parameters in "wide form" or in "long form". In wide form, each column specifies a parameter and you can include multiple rows if you want to perform a BY-group analysis. You can use the LOCATION and SCALE statements to specify the name of the variables that contain the parameters. The following DATA step creates a data set that has three variables: Group, Mu, and Sigma. Each row specifies the location and scale parameter for centering and scaling data in the levels of the Group variable.

data ScaleWide;
Group=1; Mu=&mu1; Sigma=σ output;
Group=2; Mu=&mu2; Sigma=σ output;
run;
 
proc stdize data=TwoSample out=StdWide method=in(ScaleWide);
   by Group;
   var x;
   location mu;    /* column name METHOD=IN data set */
   scale    sigma; /* column name METHOD=IN data set */
run;
 
proc means data=StdWide Mean StdDev Min Max ndec=2; 
   class Group;
   var x;
run;

The output is identical to the output from the previous section and is not shown.

Long form: Use the _TYPE_ variable

You can also specify the location and scale parameters in long form. For a long-form specification, you need to create a data set that contains a variable named _TYPE_. The parameters for centering and scaling are specified in rows where _TYPE_="LOCATION" and _TYPE_="SCALE", respectively. You can also include one or more BY-group variables (if you are using the BY statement) and one or more columns whose names are the same as the variables you intend to transform.

For example, the example data in this article has a BY-group variable named Group and an analysis variable named X. The following DATA step specifies the values to use to center (_TYPE_='LOCATION') and scale (_TYPE_='SCALE') the X variable for each group:

data ScaleParam;
length _TYPE_ $14;
input Group _TYPE_ x;
datalines;
1 LOCATION 15
1 SCALE     5
2 LOCATION 16
2 SCALE     5
;
 
proc stdize data=TwoSample out=Scale2 method=in(ScaleParam);
   by Group;
   var x;
run;
 
proc means data=Scale2; 
   class Group;
   var x;
run;

Again, the result (not shown) is the same as when the DATA step is used to center and scale the data.

Summary

This article gives examples of using PROC STDIZE in SAS/STAT to center and scale data. Most analysts want to standardize data in groups by using the sample means and sample standard deviations of each group. This is the default behavior of PROC STDIZE. However, you can also center and scale data by using values that are different from the group statistics. For example, you might want to use a grand mean and a pooled standard deviation for the groups. Or, perhaps you are using simulated data and you know the population mean and variance of each group. Regardless, there are three ways to center and scale the data when you know the location and scale parameters:

  • Manually: You can use the DATA step or PROC IML or PROC SQL to write a formula that centers and scales the data.
  • Wide Form: PROC STDIZE supports the LOCATION and SCALE statements. If you use the METHOD=IN(ds) option on the PROC STDIZE statement, you can read the parameters from a data set.
  • Long Form: If the input data set contains a variable named _TYPE_, PROC STDIZE will use the parameter values where _TYPE_='LOCATION' to center the data and where _TYPE_='SCALE' to scale the data.

The post 4 ways to standardize data in SAS appeared first on The DO Loop.