Statistical Programming

11月 272017
 

When you run an optimization, it is often not clear how to provide the optimization algorithm with an initial guess for the parameters. A good guess converges quickly to the optimal solution whereas a bad guess might diverge or require many iterations to converge. Many people use a default value (such as 1) for all initial guesses. I have previously written about other ways to obtain an initial guess for an optimization.

In the special case of maximum likelihood estimation, you use a trick to provide a good guess to the optimization algorithm. For maximum likelihood estimation, the objective function is the log-likelihood function of a distribution. Consequently, you can use the method of moments to provide the initial guess for the parameters, which often results in fast convergence to the maximum likelihood estimates. The faster convergence can save you considerable time when you are analyzing thousands of data sets, such as during a simulation study.

The method of moments

Suppose you believe that some data can be modeled by a certain distribution such as a lognormal, Weibull, or beta distribution. A common way to estimate the distribution parameters is to compute the maximum likelihood estimates (MLEs). These are the values of the parameters that are "most likely" to have generated the observed data. To compute the MLEs you first construct the log-likelihood function and then use numerical optimization to find the parameter values that maximize the log-likelihood function.

The method of moments is an alternative way to fit a model to data. For a k-parameter distribution, you write the equations that give the first k central moments (mean, variance, skewness, ...) of the distribution in terms of the parameters. You then replace the distribution's moments with the sample mean, variance, and so forth. You invert the equations to solve for the parameters in terms of the observed moments.

For example, suppose you believe that some data are distributed according to a two-parameter beta distribution. If you look up the beta distribution on Wikipedia, you will see that the mean and variance of a Beta(a, b) distribution can be expressed in terms of the two parameters as follows:

Mean and variance of the beta distribution

The first equation represents the expected (mean) value of a beta-distributed random variable X. The second equation represents the variance. If the sample mean of the data is m and the sample variance is v, you can "plug in" these values for the left-hand side of the equations and solve for the values of a and b that satisfy the equations. Solving the first equation for a yields a = b m / (1 - m). If you substitute that expression into the second equation and solve for b, you get b = m - 1 + (m/v)(1 - m)2. Those equations give the parameter estimates from the method of moments.

Maximum likelihood estimation: Using an arbitrary guess

The method of moments can lead to improved convergence when fitting a distribution. For comparison, first consider what happens if you use an arbitrary initial guess for the optimization, such as a = b = 1. The following statements use PROC NLMIXED to compute maximum likelihood estimates for the parameters of the beta distribution. (You can read about how to use PROC NLMIXED to construct MLEs. You can also compute MLEs by using SAS/IML.)

data Beta;
input x @@;
datalines;
0.58 0.23 0.41 0.69 0.64 0.12 0.07 0.19 
0.47 0.05 0.76 0.18 0.28 0.06 0.34 0.52 
0.42 0.48 0.11 0.39 0.24 0.25 0.22 0.05 
0.23 0.49 0.45 0.29 0.18 0.33 0.09 0.16 
0.45 0.27 0.31 0.41 0.30 0.19 0.27 0.57 
;
 
/* Compute parameter estimates for the beta distribution */
proc nlmixed data=Beta;
   parms a=1 b=1;                /* use arbitrary initial guesses */
   bounds a b > 0;
   LL = logpdf("beta", x, a, b); /* log-likelihood function */
   model x ~ general(LL);
run;
Solution to fitting a beta distribution in SAS

The MLEs are a = 1.8487 and b = 3.9552. The "Iteration History" table (not shown) shows that 8 iterations of a quasi-Newton algorithm were required to converge to the MLEs, and the objective function (the LOGPDF function) was called 27 times during the optimization.

The following graph shows a histogram of the data overlaid with a Beta(a=1.85, b=3.96) density curve. The curve fits the data well.

Histogram of data overlaid with a beta density curve, fitted by maximum likelihood estimation

Maximum likelihood estimation: Using method-of-moments for guesses

If you use the method-of-moments to provide an initial guess on the PARMS statement, the quasi-Newton algorithm might converge in fewer iterations. The following statements use PROC MEANS to compute the sample mean and variance, the use the DATA step to compute the method-of-moments estimates, as follows:

/* output mean and variance */
proc means data=Beta noprint;
   var x;
   output out=Moments mean=m var=v;  /* save sample moments */
run;
 
/* use method of moments to estimate parameters for the beta distribution */
data ParamEst;
   retain a b;               /* set order of output variables */
   set Moments;
   b = m*(1-m)**2/v + m - 1; /* estimate parameters by using sample moments */
   a = b*m / (1-m);
   keep a b;
run;

The method of moments estimates are a = 1.75 and b = 3.75, which are close to the MLE values. The DATA= option in the PARMS statement in PROC NLMIXED enables you to specify a data set that contains starting values for the parameters. The following call to PROC NLMIXED is exactly the same as the previous call except that the procedure uses the initial parameter values from the ParamEst data set:

proc nlmixed data=Beta;
   parms / data=ParamEst;        /* use initial guesses from method of moments */
   bounds a b > 0;
   LL = logpdf("beta", x, a, b); /* log-likelihood function */
   model x ~ general(LL);
run;
Iteration history for MLE for beta distribution. Initial guesses are maximum likelihood estimates.

The MLEs from this call (not shown) are exactly the same as before. However, the "Iteration History" table indicates that quasi-Newton algorithm converged in 5 iterations and required only 14 calls the objective function. This compares with 8 iterations and 27 calls for the previous computation. Thus the method of moments resulted in about half as many functional evaluations. In my experience this example is typical: Choosing initial parameters by using the method of moments often reduces the computational time that is required to obtain the MLEs.

It is worth mentioning that the PARMS option in PROC NLMIXED supports two nice features:

  • The DATA= option accepts both wide and long data sets. The example here is wide: there are k variables that supply the values of the k parameters. You can also specify a two-variable data set with the names "Parameter" and "Estimate."
  • The BYDATA option enables you to specify initial guesses for each BY group. In simulation studies, you might use BY-group processing to compute the MLEs for thousands of samples. The BYDATA option is essential to improving the efficiency of a large simulation study. It can also help ensure that the optimization algorithm converges when the parameters for the samples range over a large set of values.

Summary and cautions

In summary, this article shows how to use the method of moments to provide a guess for optimizing a log-likelihood function, which often speeds up the time required to obtain the MLEs. The method requires that you can write the equations that relate the central moments of a distribution (mean, variance, ...) to the parameters for the distribution. (For a distribution that has k parameters, use the equations for the first k moments.) You then "plug in" the sample moments and algebraically invert the equations to obtain the parameter values in terms of the sample moments. This is not always possible analytically; you might need to use numerical methods for part of the computation.

A word of caution. For small samples, the method of moments might produce estimates that do not make sense. For example, the method might produce a negative estimate for a parameter that must be positive. In these cases, the method might be useless, although you could attempt to project the invalid parameters into the feasible region.

The post The method of moments: A smart way to choose initial parameters for MLE appeared first on The DO Loop.

11月 222017
 

A statistical programmer read my article about the beta-binomial distribution and wanted to know how to compute the cumulative distribution (CDF) and the quantile function for this distribution. In general, if you know the PDF for a discrete distribution, you can also compute the CDF and quantile functions. This article shows how.

Continuous versus discrete distributions

Recall the four essential functions for any probability distribution: the probability density function (PDF), the cumulative distribution function (CDF), the quantile function, and the random-variate generator. SAS software provides the PDF, CDF, QUANTILE, and RAND function, which support for about 25 common families of distributions. However, there are infinitely many probability distributions. What can you do if you need a distribution that is not natively supported in SAS? The answer is that you can use tools in SAS (especially in SAS/IML software) to implement these functions yourself.

I have previously shown how to use numerical integration and root-finding algorithms to compute the CDF and quantile function for continuous distributions such as the folded normal distribution and the generalized Gaussian distribution. For discrete distributions, you can use a summation to obtain the CDF from the PDF. The CDF at X=x is the sum of the PDF values for all values of X that are less than or equal to x. The discrete CDF is a step function, so it does not have an inverse function. Given a probability, p, the quantile for p is defined as the smallest value of x for which CDF(x) ≥ p.

To simplify the discussion, assume that the random variable X takes values on the integers 0, 1, 2, .... Many discrete probability distributions in statistics satisfy this condition, including the binomial, Poisson, geometric, beta-binomial, and more.

Compute a CDF from a PDF

I will use SAS/IML software to implement the CDF and quantile functions. The following SAS/IML modules define the PDF function and the CDF function. If you pass in a scalar value for x, the CDF function computes the PDF from X=0 to X=x and sums up the individual probabilities. When x is a vector, the implementation is slightly more complicated, but the idea is the same.

proc iml;
/* PMF function for the beta-binomial distribution. See
   https://blogs.sas.com/content/iml/2017/11/20/simulate-beta-binomial-sas.html */
start PDFBetaBinom(x, nTrials, a, b);
   logPMF = lcomb(nTrials, x) + logbeta(x + a, nTrials - x + b) - logbeta(a, b);
   return ( exp(logPMF) );
finish;
 
/* CDF function for the beta-binomial distribution */
start CDFBetaBinom(x, nTrials, a, b);
   y = 0:max(x);                               /* all values of X less than max(x) */
   PMF = PDFBetaBinom(y, NTrials, a, b);       /* compute PMF(t) for t=0..max(x) */
   cumPDF = cusum(PMF);                        /* cdf(T) = sum( PMF(t), t=0..T ) */
   /* The input x can be a vector of values in any order.
      Use the LOC function to assign CDF(x) for each value of x */
   CDF = j(nrow(x), ncol(x), .);               /* allocate result (same size as x) */
   u = unique(x);                              
   do i = 1 to ncol(u);                        /* for each unique value of x... */
      idx = loc(x=u[i]);                       /*    find the indices */
      CDF[idx] = cumPDF[1+u[i]];               /*    assign the CDF   */
   end;
   return CDF;
finish;
 
/* test the CDF function */
a = 6; b = 4; nTrials = 10;
c = CDFBetaBinom(t, nTrials, a, b);

Notice the computational trick that is used to compute the CDF for a vector of values. If you want to compute the CDF at two or more values (say X=5 and X=7), you can compute the PDF for X=0, 1, ... M where M is the maximum value that you need (such as 7). Along the way, you end up computing the CDF for all smaller values.

You can write the CDF values to a SAS data set and graph them to visualize the CDF of the beta-binomial distribution, as shown below:

Beta-binomial cumulative distribution

Compute quantiles from the CDF

Given a probability value p, the quantile for p is smallest value of x for which CDF(x) ≥ p. Sometimes you can find a formula that enables you to find the quantile directly, but, in general, the definition of a quantile requires that you compute the CDF for X=0, 1, 2,... and so forth until the CDF equals or exceeds p.

The support of the beta-binomial distribution is the set {0, 1, ..., nTrials}. The following function first computes the CDF function for all values. It then looks at each value of p and returns the corresponding value of X that satisfies the quantile definition.

start QuantileBetaBinom(_p, nTrials, a, b);
   p = colvec(_p);                         /* ensure input is column vector */
   y = 0:nTrials;                          /* all possible values of X   */
   CDF = CDFBetaBinom(y, NTrials, a, b);   /* all possible values of CDF */
   q = j(nrow(p), 1, .);                   /* allocate result vector */
   do i = 1 to nrow(p);
      idx = loc( p[i] <= CDF );            /* find all x such that p <= CDF(x) */ 
      q[i] = idx[1] - 1;                   /* smallest index. Subtract 1 from index to get x */
   end;
   return ( q );
finish;
 
p = {0.2, 0.5, 0.8};
q = QuantileBetaBinom(p, nTrials, a, b);
print p q;

The output shows the quantile values for p = {0.2, 0.5, 0.8}. You can find the quantiles visually by using the graph of the CDF function, shown in the previous section. For each value of p on the vertical axis, draw a horizontal line until the line hits a bar. The x value of the bar is the quantile for p.

Notice that the QuantileBetaBinom function uses the fact that the beta-binomial function has finite support. The function computes the CDF at the values 0, 1, ..., nTrials. From those values you can "look up" the quantile for any p. If you have distribution with infinite support (such as the geometric distribution) or very large finite support (such as nTrials=1E9), it is more efficient to apply an iterative process such as bisection to find the quantiles.

Summary

In summary, you can compute the CDF and quantile functions for a discrete distribution directly from the PDF. The process was illustrated by using the beta-binomial distribution. The CDF at X=x is the sum of the PDF evaluated for all values less than x. The quantile for p is the smallest value of x for which CDF(x) ≥ p.

The post Compute the CDF and quantiles of discrete distributions appeared first on The DO Loop.

11月 152017
 

Did you know that a SAS/IML function can recover from a run-time error? You can specify how to handle run-time errors by using a programming technique that is similar to the modern "try-catch" technique, although the SAS/IML technique is an older implementation.

Preventing errors versus handling errors

In general, SAS/IML programmers should try to detect potential problems and prevent errors from occurring. For example, before you compute the square root of a number, you should test whether the number is greater than or equal to zero. If not, you can handle the bad input. By testing the input value before you call the SQRT function, you prevent a run-time error.

However, sometimes you don't know whether a computation will fail until you attempt it. Other times, the test to determine whether an error will occur is very expensive, and it is cheaper to attempt the computation and handle the error if it occurs.

A canonical example is computing the inverse of a square matrix. It is difficult to test whether a given matrix is numerically singular. You can compute the rank of the matrix or the condition number of the matrix, but both computations are expensive because they rely on the SVD or eigenvalue decomposition. Cheaper methods include computing a determinant or using Gaussian elimination, but these methods can be numerically unstable and are not good indicators of whether a matrix is numerically singular.

Run-time error in SAS/IML modules

To understand how to handle run-time errors in SAS/IML modules, recall the following facts about how to use the execution queue in SAS/IML modules.

In summary, when a run-time error occurs in a module, the module pauses and waits for additional commands. However, if there are statements in the execution queue, those statements will be executed.

You can use these facts to handle errors. At the top of the module, use the PUSH statement to add error-handling statements into the execution queue. If an error occurs, the module pauses, sees that there are statements in the queue, and executes those statements. If one of the statements contains a RESUME statement, the module resumes execution.

Push, pause, execute, and resume

Let's apply these ideas to computing the inverse of a square matrix. The following SAS/IML function attempts to compute the inverse of the input matrix. If the matrix is singular, the computation fails and the module handles that error. The main steps of the modules are as follows:

  1. Define the statements that will be executed in the event of an error. The last executable statement should be the RESUME statement. (Technical point: The RESUME statement is only valid when the module is paused. Surround the statements with IF-THEN logic to prevent the error-handling statements from running when the module exits normally.)
  2. Push these statements into the execution queue.
  3. If an error occurs, the statements in the queue will be executed. The RESUME statement clears the error and resumes execution of the module. (If no error occurs, the queue is flushed after the RETURN statement.)
proc iml;
/* If the matrix is not invertible, return a missing value. 
   Otherwise, return the inverse.
*/
start InvEx(A); 
   errFlag = 1;           /* set flag. Assume we will get an error :-) */   
   on_error = "if errFlag then do;  AInv = .;  resume; end;";
   call push(on_error);   /* PUSH code that will be executed if an error occurs */
   AInv = inv(A);         /* if error, AInv set to missing and function resumes */
   errFlag = 0;           /* remove flag for normal exit from module */
   return ( AInv );
finish;
 
A1 = {1 1,
      0 1};
B1 = InvEx(A1);
A2 = {1 1,
      1 1};
B2 = InvEx(A2);
 
print B1, B2;

The function is called first with a nonsingular matrix, A1. The INV function does not fail, so the module exits normally. When the module exits, it flushes the execution queue. Because ErrFlag=0, the pushed statements have no effect. B1 is equal to the inverse of A1.

The function is called next with a singular matrix, A2. The INV function encounters an error and the module pauses. When the module pauses, it executes the statements in the queue. Because ErrFlag=1, the statements set AINV to a missing value and resume the module execution. The module exits normally and the B2 value is set to missing.

Summary

In summary, you can implement statements that handle run-time errors in SAS/IML modules. The key is to push error-handling code into a queue. The error-handling statements should include a RESUME statement. If the module pauses due to an error, the statements in the queue are executed and the module resumes execution. Be aware that the queue is flushed when the module exits normally, so use a flag variable and IF-THEN logic to control how the statements behave when there is no error.

The post Catch run-time errors in SAS/IML programs appeared first on The DO Loop.

11月 132017
 

Debugging is the bane of every programmer. SAS supports a DATA step debugger, but that debugger can't be used for debugging SAS/IML programs. In lieu of a formal debugger, many SAS/IML programmers resort to inserting multiple PRINT statements into a function definition. However, there is an easier way to query the values of variables inside a SAS/IML function: Use the PAUSE statement to halt the execution of program inside the misbehaving function. You can then interactively submit PRINT statements or any other valid statements. When you are finished debugging, you can submit the RESUME statement and the function will resume execution. (Or you can submit the STOP statement to halt execution and return to the main scope of the program.)

The PAUSE statement

The SAS/IML language is interactive. The PAUSE statement pauses the execution of a SAS/IML module. While the program is paused you can print or assign local variables inside the module. When you are ready for the module to resume execution, you submit the RESUME statement. (The SAS/IML Studio application works slightly differently. The PAUSE statement brings up a dialog box that you can use to submit statements. You press the Resume button to resume execution.)

For example, suppose you write a function called 'Func' whose purpose is to compute the sum of squares of the elements in a numeric vector. While testing the function, you discover that the answer is not correct when you pass in a row vector. You decide to insert a PAUSE statement ("set a breakpoint") near the end of the module, just before the RETURN statement, as follows:

proc iml;
/* in SAS/IML, the CALL EXECUTE statement runs immediately */
start Func(x);
   y = x`*x;
 
  /* Debugging tip: Use PAUSE to enter INTERACTIVE mode INSIDE module! */
  pause "Inside 'Func'; submit RESUME to resume computations.";
  /* Execution pauses until you submit RESUME, then continues from the next statement. */
 
   return (y);
finish;
 
w = Func( {1 2 3} );

When you run the program, it prints Inside 'Func'; submit RESUME to resume computations. The program then waits for you to enter commands. The program will remain paused until you submit the RESUME statement (include a semicolon!).

Because the program has paused inside the function, you can query or set the values of local variables. For this function, there are only two local variables, x and y. Highlight the following statement, and press F3 to submit it.

print y;     /* inside module; print local variable */

The output shows that the value of y is a matrix, not a scalar, which indicates that the expression x`*x does not compute the sum of squares for this input vector. You might recall that SAS/IML has a built-in SSQ function, which computes the sum of squares. Highlight the following statement and press F3 to submit it:

y = ssq(x);  /* assign local variable inside module */

This assignment has overwritten the value of the local variable, y. When you submit a RESUME statement, the function will resume execution and return the value of y. The program is now at the main scope, which you can demonstrate by printing the value of w:

resume;      /* resume execution. Return to main scope */
print w;     /* print result at main scope */

To make the change permanent, you must edit the function definition and redefine the module. Be sure to remove the PAUSE statement when you are finished debugging: If you run a program in batch mode that contains a PAUSE statement, the program will forever wait for input!

In conclusion, the PAUSE statement enables you to pause the execution of a program inside a module and interactively query and change the local variables in the module. This can help you to debug a function. After you finish investigating the state of the local variables, you can submit the RESUME statement, which tells the function to continue executing from the line after the PAUSE statement. (Or submit STOP to exit the function.) The PAUSE statement can be a useful alternative to inserting many PRINT statements inside a function during the debugging phase of program development.

The post A tip for debugging SAS/IML modules: The PAUSE statement appeared first on The DO Loop.

10月 232017
 

A common question on discussion forums is how to compute a principal component regression in SAS. One reason people give for wanting to run a principal component regression is that the explanatory variables in the model are highly correlated which each other, a condition known as multicollinearity. Although principal component regression (PCR) is a popular technique for dealing with almost collinear data, PCR is not a cure-all. This article shows how to compute a principal component regression in SAS; a subsequent article discusses the problems with PCR and presents alternative techniques.

Multicollinearity in regression

Near collinearity among the explanatory variables in a regression model requires special handling because:

  • The crossproduct matrix X`X is ill-conditioned (nearly singular), where X is the data matrix.
  • The standard errors of the parameter estimates are very large. The variance inflation factor (VIF), which is computed by PROC REG, is one way to measure how collinearities inflate the variances of the parameter estimates.
  • The model parameters are highly correlated, which makes interpretation of the parameters difficult.

Principal component regression keeps only the most important principal components and discards the others. This means that you compute the principal components for the explanatory variables and drop the components that correspond to the smallest eigenvalues of X`X. If you keep k principal components, then those components enable you to form a rank-k approximation to the crossproduct matrix. If you regress the response variable onto those k components, you obtain a PCR. Usually the parameter estimates are expressed in terms of the original variables, rather than in terms of the principal components.

In SAS there are two easy ways to compute principal component regression:

  • The PLS procedure supports the METHOD=PCR to perform principal component regression. You can use the NFAC= option to determine the number of principal components to keep.
  • The MODEL statement in PROC REG supports the PCOMIT= option. (This option is read as "PC omit.") The argument to the PCOMIT= option is the number of principal components to drop (omit) from the regression.

Notice that neither of these methods calls PROC PRINCOMP. You could call PROC PRINCOMP, but it would be more complicated than the previous methods. You would have to extract the first principal components (PCs), then use PROC REG to compute the regression coefficients for the PCs, then use matrix computations to convert the parameter estimates from the PCs to the original variables.

Principal component regression is also sometimes used for general dimension reduction. Instead of projecting the response variable onto a p-dimensional space of raw variables, PCR projects the response onto a k-dimensional space where k is less than p. For dimension reduction, you might want to consider another approach such as variable selection by using PROC GLMSELECT or PROC HPGENSELECT. The reason is that the PCR model retains all of the original variables whereas variable selection procedures result in models that have fewer variables.

Use PROC PLS for principal component regression

I recommend using the PLS procedure to compute a principal component regression in SAS. As mentioned previously, you need to use the METHOD=PCR and NFAC= options. The following data for 31 men at a fitness center is from the documentation for PROC REG. The goal of the study is to predict oxygen consumption from age, weight, and various physiological measurements before and during exercise. The following call to PROC PLS computes a PCR that keeps four principal components:

data fitness;
   input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
   datalines;
44 89.47 44.609 11.37 62 178 182   40 75.07 45.313 10.07 62 185 185
44 85.84 54.297  8.65 45 156 168   42 68.15 59.571  8.17 40 166 172
38 89.02 49.874  9.22 55 178 180   47 77.45 44.811 11.63 58 176 176
40 75.98 45.681 11.95 70 176 180   43 81.19 49.091 10.85 64 162 170
44 81.42 39.442 13.08 63 174 176   38 81.87 60.055  8.63 48 170 186
44 73.03 50.541 10.13 45 168 168   45 87.66 37.388 14.03 56 186 192
45 66.45 44.754 11.12 51 176 176   47 79.15 47.273 10.60 47 162 164
54 83.12 51.855 10.33 50 166 170   49 81.42 49.156  8.95 44 180 185
51 69.63 40.836 10.95 57 168 172   51 77.91 46.672 10.00 48 162 168
48 91.63 46.774 10.25 48 162 164   49 73.37 50.388 10.08 67 168 168
57 73.37 39.407 12.63 58 174 176   54 79.38 46.080 11.17 62 156 165
52 76.32 45.441  9.63 48 164 166   50 70.87 54.625  8.92 48 146 155
51 67.25 45.118 11.08 48 172 172   54 91.63 39.203 12.88 44 168 172
51 73.71 45.790 10.47 59 186 188   57 59.08 50.545  9.93 49 148 155
49 76.32 48.673  9.40 56 186 188   48 61.24 47.920 11.50 52 170 176
52 82.78 47.467 10.50 53 170 172
;
 
proc pls data=fitness method=PCR nfac=4;          /* PCR onto 4 factors */
   model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / solution;
run;

The output includes the parameter estimates table, which gives the estimates for the four-component regression in terms of the original variables. Another table (not shown) shows that the first four principal components explain 93% of the variation in the explanatory variables and 78% of the variation in the response variable.

For another example of using PROC PLS to combat collinearity, see Yu (2011), "Principal Component Regression as a Countermeasure against Collinearity."

Use PROC REG for principal component regression

I recommend PROC PLS for principal component regression, but you can also compute a PCR by using the PCOMIT= option on the MODEL statement in PROC REG. However, the parameter estimates are not displayed in any table but must be written to OUTEST= data set, as follows:

proc reg data=fitness plots=none outest=PE; /* write PCR estimates to PE data set */
   model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
         / PCOmit=2;       /* omit 2 PCs ==> keep 6-2=4 PCs */
quit;
 
proc print data=PE(where=(_Type_="IPC")) noobs;
   var Intercept--MaxPulse;
run;

Notice that the PCOMIT=2 option specifies that two PCs should be dropped, which is equivalent to keeping four components in this six-variable model. The parameter estimates are written to the PE data set and are displayed by PROC PRINT. The estimates the same as those found by PROC PLS. In the PE data, the PCR estimates are indicated by the value "IPC" for the _TYPE_ variable, which stands for incomplete principal component regression. The word "incomplete" indicates that not all the principal components are used.

It is worth noting that even though the principal components themselves are based on centered and scaled data, the parameter estimates are reported for the original (raw) variables. It is also worth noting that you can use the OUTSEB option on the PROC REG statement to obtain standard errors for the parameter estimates.

Should you use principal component regression?

This article shows you how to perform principal component regression in SAS by using PROC PLS with METHOD=PCR. However, I must point out that there are statistical drawbacks to using principal component regression. The primary issue is that principal component regression does not use any information about the response variable when choosing the principal components. Before you decide to use PCR, I urge you to read my next post about the drawbacks with the technique. You can then make an informed decision about whether you want to use principal component regression for your data.

The post Principal component regression in SAS appeared first on The DO Loop.

9月 182017
 
Toe bone connected to the foot bone,
Foot bone connected to the leg bone,
Leg bone connected to the knee bone,...
             — American Spiritual, "Dem Bones"

Last week I read an interesting article on Robert Kosara's data visualization blog. Kosara connected the geographic centers of the US zip codes in the lower 48 states, in order, from 01001 (Agawam, MA) to 99403 (Clarkston, WA). Since the SasHelp.zipcode data set is one of the sample data sets that is distributed with SAS, it is a simple matter to recreate the graph with SAS. The following SAS statements sort the data and exclude certain American territories before graphing the locations of the US zip code in the contiguous US. (Click on a graph to enlarge it.):

proc sort data=sashelp.zipcode(where=(StateCode 
     NOT IN ("PR", "FM", "GU", "MH", "MP", "PW", "VI")  /* exclude territories */
     AND ZIP_Class = " "))                              /* exclude "special" zip codes */
     out=zipcode(keep=ZIP x y City StateCode);
by zip;
run;
 
title "Path of US Zip Codes in Numerical Order";
proc sgplot data=zipcode noborder;
   where StateCode NOT IN ("AK", "HI");                 /* contiguous US */
   series x=X y=Y / group=StateCode;
   xaxis  display=none;
   yaxis  display=none;
run;
Map formed by connecting the US ZIP codes in order

Note that the coordinates are "raw" longitudes and latitudes; no projection is used. The graph is interesting. On this scale, the western states clearly show that the zip codes form small clusters within the states. This fact is also true for the Eastern states, but it is harder to see because of the greater density of zip codes in those states. Kosara includes an interactive map that enables you to zoom in on regions of interest. To create a static "zoomed-in" version in SAS, you can use WHERE processing to subset the data. You can also add state boundaries and labels to obtain a close-up map, such as this map of Utah (UT), Colorado (CO), Arizona (AZ), and New Mexico (NM):

Commect zip codes in order for four western US states

This graph shows that there are about a dozen zip-code clusters within each of these states. This reflects the hierarchical nature of zip codes. By design, the first digit in a zip code specifies a region of the country, such as the Southeast or the Midwest. The next two digits determine the Sectional Center Facility (SCF) code, which is a region within a state. From looking at the map, I conjecture that those clusters of jagged line segments are approximations of the SCFs.

You can use the following SAS code to generate the SCF from the zip code. The subsequent call to PROC FREQ counts the number of zip codes in each SCF in Utah. Some SCFs have many postal delivery zones within them (for example, 840xx) whereas others have relatively few (for example, 844xx). Note that the SCFs are not necessarily contiguous: Utah does not (yet) have zip codes of the form 842xx.

data Zip1;
   length SCF $5.;
   set zipcode;
   FloorZip = floor(zip/100);         /* round down to nearest 100 */
   SCF = putn(FloorZip,"Z3.")||"xx";  /* Sectional Center Facility, eg, 841xx */
   keep x y zip StateCode City SCF;
run;
proc freq data=Zip1;
   where StateCode='UT';
   tables SCF / nocum;
run;
Counts of zip codes within each SCF code in Utah

If you choose a point within each Sectional Center Facility and connect those points in order, you can obtain a much less cluttered diagram that shows the basic progression of the hierarchy of zip codes. The SCFs can zig-zag across a state and do not necessarily follow a geographical progression such as north-south or east-west. The following image connects the location of the first zip code in each SCF region in Utah. The individual zip-code centers are shown as markers that are colored by the SCF.

Map that connects the Sectional Center Facility (SCF) codes in Utah

For states that have more than a dozen SCFs, the five-character labels can obscure the path of the SCFs. If you don't care about the actual zip-code prefixes but you just want to visualize the progression, you can label positions along the path by integers. For example, there are 25 SCFs in Florida. The following graph visualizes the regions. The first SCF (320xx) is labeled '1' and the last SCF (349xx) is labeled '25'.

Map that connects the Sectional Center Facility (SCF) codes in Florida

Lastly, the following graph shows the progression of Sectional Center Facilities at the national level. You can see certain large "jumps" across multiple states. These are present in the original map of zip codes but are obscured by the complexity of thousands of crisscrossing line segments. Two large jumps that are apparent are a diagonal line from Montana in the Pacific Northwest (prefix '59') down to Illinois (prefix '60'). Another big jump is from Nebraska in the Midwest (prefix '69') down to Louisiana (prefix '70') in the South-Central region.

Map showing US Sectional Center Facilitiey (SCF) codes connected in order

In summary, this article uses SAS to reproduce the fascinating image on Robert Kosara's blog. Kosara's image is the path obtained by connecting the geographic centers of each zip code. By zooming in on individual states, you can identify clusters of zip codes, which correspond to Sectional Center Facilities (SCFs). By choosing one zip code from each SCF and connecting those locations in order, you obtain a simplified version of the path that connects major postal zones.

If you would like to explore these data yourself, you can download the SAS program that analyzes the zip code data.

The post The path of zip codes appeared first on The DO Loop.

9月 182017
 
Toe bone connected to the foot bone,
Foot bone connected to the leg bone,
Leg bone connected to the knee bone,...
             — American Spiritual, "Dem Bones"

Last week I read an interesting article on Robert Kosara's data visualization blog. Kosara connected the geographic centers of the US zip codes in the lower 48 states, in order, from 01001 (Agawam, MA) to 99403 (Clarkston, WA). Since the SasHelp.zipcode data set is one of the sample data sets that is distributed with SAS, it is a simple matter to recreate the graph with SAS. The following SAS statements sort the data and exclude certain American territories before graphing the locations of the US zip code in the contiguous US. (Click on a graph to enlarge it.):

proc sort data=sashelp.zipcode(where=(StateCode 
     NOT IN ("PR", "FM", "GU", "MH", "MP", "PW", "VI")  /* exclude territories */
     AND ZIP_Class = " "))                              /* exclude "special" zip codes */
     out=zipcode(keep=ZIP x y City StateCode);
by zip;
run;
 
title "Path of US Zip Codes in Numerical Order";
proc sgplot data=zipcode noborder;
   where StateCode NOT IN ("AK", "HI");                 /* contiguous US */
   series x=X y=Y / group=StateCode;
   xaxis  display=none;
   yaxis  display=none;
run;
Map formed by connecting the US ZIP codes in order

Note that the coordinates are "raw" longitudes and latitudes; no projection is used. The graph is interesting. On this scale, the western states clearly show that the zip codes form small clusters within the states. This fact is also true for the Eastern states, but it is harder to see because of the greater density of zip codes in those states. Kosara includes an interactive map that enables you to zoom in on regions of interest. To create a static "zoomed-in" version in SAS, you can use WHERE processing to subset the data. You can also add state boundaries and labels to obtain a close-up map, such as this map of Utah (UT), Colorado (CO), Arizona (AZ), and New Mexico (NM):

Commect zip codes in order for four western US states

This graph shows that there are about a dozen zip-code clusters within each of these states. This reflects the hierarchical nature of zip codes. By design, the first digit in a zip code specifies a region of the country, such as the Southeast or the Midwest. The next two digits determine the Sectional Center Facility (SCF) code, which is a region within a state. From looking at the map, I conjecture that those clusters of jagged line segments are approximations of the SCFs.

You can use the following SAS code to generate the SCF from the zip code. The subsequent call to PROC FREQ counts the number of zip codes in each SCF in Utah. Some SCFs have many postal delivery zones within them (for example, 840xx) whereas others have relatively few (for example, 844xx). Note that the SCFs are not necessarily contiguous: Utah does not (yet) have zip codes of the form 842xx.

data Zip1;
   length SCF $5.;
   set zipcode;
   FloorZip = floor(zip/100);         /* round down to nearest 100 */
   SCF = putn(FloorZip,"Z3.")||"xx";  /* Sectional Center Facility, eg, 841xx */
   keep x y zip StateCode City SCF;
run;
proc freq data=Zip1;
   where StateCode='UT';
   tables SCF / nocum;
run;
Counts of zip codes within each SCF code in Utah

If you choose a point within each Sectional Center Facility and connect those points in order, you can obtain a much less cluttered diagram that shows the basic progression of the hierarchy of zip codes. The SCFs can zig-zag across a state and do not necessarily follow a geographical progression such as north-south or east-west. The following image connects the location of the first zip code in each SCF region in Utah. The individual zip-code centers are shown as markers that are colored by the SCF.

Map that connects the Sectional Center Facility (SCF) codes in Utah

For states that have more than a dozen SCFs, the five-character labels can obscure the path of the SCFs. If you don't care about the actual zip-code prefixes but you just want to visualize the progression, you can label positions along the path by integers. For example, there are 25 SCFs in Florida. The following graph visualizes the regions. The first SCF (320xx) is labeled '1' and the last SCF (349xx) is labeled '25'.

Map that connects the Sectional Center Facility (SCF) codes in Florida

Lastly, the following graph shows the progression of Sectional Center Facilities at the national level. You can see certain large "jumps" across multiple states. These are present in the original map of zip codes but are obscured by the complexity of thousands of crisscrossing line segments. Two large jumps that are apparent are a diagonal line from Montana in the Pacific Northwest (prefix '59') down to Illinois (prefix '60'). Another big jump is from Nebraska in the Midwest (prefix '69') down to Louisiana (prefix '70') in the South-Central region.

Map showing US Sectional Center Facilitiey (SCF) codes connected in order

In summary, this article uses SAS to reproduce the fascinating image on Robert Kosara's blog. Kosara's image is the path obtained by connecting the geographic centers of each zip code. By zooming in on individual states, you can identify clusters of zip codes, which correspond to Sectional Center Facilities (SCFs). By choosing one zip code from each SCF and connecting those locations in order, you obtain a simplified version of the path that connects major postal zones.

If you would like to explore these data yourself, you can download the SAS program that analyzes the zip code data.

The post The path of zip codes appeared first on The DO Loop.

8月 162017
 

Visualizing the correlations between variables often provides insight into the relationships between variables. I've previously written about how to use a heat map to visualize a correlation matrix in SAS/IML, and Chris Hemedinger showed how to use Base SAS to visualize correlations between variables.

Recently a SAS programmer asked how to construct a bar chart that displays the pairwise correlations between variables. This visualization enables you to quickly identify pairs of variables that have large negative correlations, large positive correlations, and insignificant correlations.

In SAS, PROC CORR can computes the correlations between variables, which are stored in matrix form in the output data set. The following call to PROC CORR analyzes the correlations between all pairs of numeric variables in the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. Because of missing values, some pairwise correlations use more observations than others.

ods exclude all;
proc corr data=sashelp.Heart;      /* pairwise correlation */
   var _NUMERIC_;
   ods output PearsonCorr = Corr;  /* write correlations, p-values, and sample sizes to data set */
run;
ods exclude none;

The CORR data set contains the correlation matrix, p-values, and samples sizes. The statistics are stored in "wide form," with few rows and many columns. As I previously discussed, you can use the HEATMAPCONT subroutine in SAS/IML to quickly visualize the correlation matrix:

proc iml;
use Corr;
   read all var "Variable" into ColNames;  /* get names of variables */
   read all var (ColNames) into mCorr;     /* matrix of correlations */
   ProbNames = "P"+ColNames;               /* variables for p-values are named PX, PY, PZ, etc */
   read all var (ProbNames) into mProb;    /* matrix of p-values */
close Corr;
 
call HeatmapCont(mCorr) xvalues=ColNames yvalues=ColNames
     colorramp="ThreeColor" range={-1 1} title="Pairwise Correlation Matrix";
Heat map of correlations between variables

The heat map gives an overall impression of the correlations between variables, but it has some shortcomings. First, you can't determine the magnitudes of the correlations with much precision. Second, it is difficult to compare the relative sizes of correlations. For example, which is stronger: the correlation between systolic and diastolic blood pressure or the correlation between weight and MRW? (MRW is a body-weight index.)

These shortcomings are resolved if you present the pairwise correlations as a bar chart. To create a bar chart, it is necessary to convert the output into "long form." Each row in the new data set will represent a pairwise correlation. To identify the row, you should also create a new variable that identifies the two variables whose correlation is represented. Because the correlation matrix is symmetric and has 1 on the diagonal, the long-form data set only needs the statistics for the lower-triangular portion of the correlation matrix.

Let's extract the data in SAS/IML. The following statements construct a new ID variable that identifies each new row and extract the correlations and p-values for the lower-triangular elements. The statistics are written to a SAS data set called CorrPairs. (In Base SAS, you can transform the lower-triangular statistics by using the DATA step and arrays, similar to the approach in this SAS note; feel free to post your Base SAS code in the comments.)

numCols = ncol(mCorr);                /* number of variables */
numPairs = numCols*(numCols-1) / 2;
length = 2*nleng(ColNames) + 5;       /* max length of new ID variable */
PairNames = j(NumPairs, 1, BlankStr(length));
i = 1;
do row= 2 to numCols;                 /* construct the pairwise names */
   do col = 1 to row-1;
      PairNames[i] = strip(ColNames[col]) + " vs. " + strip(ColNames[row]);
      i = i + 1;
   end;
end;
 
lowerIdx = loc(row(mCorr) > col(mCorr));  /* indices of lower-triangular elements */
Corr = mCorr[ lowerIdx ];
Prob = mProb[ lowerIdx ];
Significant = choose(Prob > 0.05, "No ", "Yes");  /* use alpha=0.05 signif level */
 
create CorrPairs var {"PairNames" "Corr" "Prob" "Significant"};
append;
close;
QUIT;

You can use the HBAR statement in PROC SGPLOT to construct the bar chart. This bar chart contains 45 rows, so you need to make the graph tall and use a small font to fit all the labels without overlapping. The call to PROC SORT and the DISCRETEORDER=DATA option on the YAXIS statement ensure that the categories are displayed in order of increasing correlation.

proc sort data=CorrPairs;  by Corr;  run;
 
ods graphics / width=600px height=800px;
title "Pairwise Correlations";
proc sgplot data=CorrPairs;
hbar PairNames / response=Corr group=Significant;
refline 0 / axis=x;
yaxis discreteorder=data display=(nolabel) 
      labelattrs=(size=6pt) fitpolicy=none 
      offsetmin=0.012 offsetmax=0.012  /* half of 1/k, where k=number of catgories */
      colorbands=even colorbandsattrs=(color=gray transparency=0.9);
xaxis grid display=(nolabel);
keylegend / position=topright location=inside across=1;
run;
Bar chart of pairwise correlations between variables

The bar chart (click to enlarge) enables you to see which pairs of variables are highly correlated (positively and negatively) and which have correlations that are not significantly different from 0. You can use additional colors or reference lines if you want to visually emphasize other features, such as the correlations that are larger than 0.25 in absolute value.

The bar chart is not perfect. This example, which analyzes 10 variables, is very tall with 45 rows. Among k variables there are k(k-1)/2 correlations, so the number of pairwise correlations (rows) increases quadratically with the number of variables. In practice, this chart would be unreasonably tall when there are 14 or 15 variables (about 100 rows).

Nevertheless, for 10 or fewer variables, a bar chart of the pairwise correlations provides an alternative visualization that has some advantages over a heat map of the correlation matrix. What do you think? Would this graph be useful in your work? Leave a comment.

The post Use a bar chart to visualize pairwise correlations appeared first on The DO Loop.

8月 092017
 

Recently, I was asked whether SAS can perform a principal component analysis (PCA) that is robust to the presence of outliers in the data. A PCA requires a data matrix, an estimate for the center of the data, and an estimate for the variance/covariance of the variables. Classically, these estimates are the mean and the Pearson covariance matrix, respectively, but if you replace the classical estimates by robust estimates, you can obtain a robust PCA.

This article shows how to implement a classical (non-robust) PCA by using the SAS/IML language and how to modify that classical analysis to create a robust PCA.

A classical principal component analysis in SAS

the FACTOR procedure.

Let's use PROC PRINTCOMP perform a simple PCA. The PROC PRINCOMP results will be the basis of comparison when we implement the PCA in PROC IML. The following example is taken from

proc princomp data=Crime /* add COV option for covariance analysis */
              outstat=PCAStats_CORR out=Components_CORR 
              plots=score(ncomp=4);
   var Murder Rape Robbery Assault Burglary Larceny Auto_Theft;
   id State;
   ods select Eigenvectors ScreePlot '2 by 1' '4 by 3';
run;
 
proc print data=Components_CORR(obs=5);
   var Prin:;
run;

To save space, the output from PROC PRINCOMP is not shown, but it includes a table of the eigenvalues and principal component vectors (eigenvectors) of the correlation matrix, as well as a plot of the scores of the observations, which are the projection of the observations onto the principal components. The procedure writes two data sets: the eigenvalues and principal components are contained in the PCAStats_CORR data set and the scores are contained in the Components_CORR data set.

A classical principal component analysis in SAS/IML

Assume that the data consists of n observations and p variables and assume all values are nonmissing. If you are comfortable with multivariate analysis, a principal component analysis is straightforward: the principal components are the eigenvectors of a covariance or correlation matrix, and the scores are the projection of the centered data onto the eigenvector basis. However, if it has been a few years since you studied PCA, Abdi and Williams (2010) provides a nice overview of the mathematics. The following matrix computations implement the classical PCA in the SAS/IML language:

proc iml;
reset EIGEN93;       /* use "9.3" algorithm, not vendor BLAS (SAS 9.4m3) */
varNames = {"Murder" "Rape" "Robbery" "Assault" "Burglary" "Larceny" "Auto_Theft"};
use Crime;
  read all var varNames into X;  /* X = data matrix (assume nonmissing) */
close;
 
S = cov(X);                      /* estimate of covariance matrix */
R = cov2corr(S);                 /* = corr(X) */
call eigen(D, V, R);             /* D=eigenvalues; columns of V are PCs */
scale = T( sqrt(vecdiag(S)) );   /* = std(X) */
B = (X - mean(X)) / scale;       /* center and scale data about the mean */
scores = B * V;                  /* project standardized data onto the PCs */
 
print V[r=varNames c=("Prin1":"Prin7") L="Eigenvectors"];
print (scores[1:5,])[c=("Prin1":"Prin7") format=best9. L="Scores"];
Classical principal component analysis in SAS

The principal components (eigenvectors) and scores for these data are identical to the same quantities that were produced by PROC PRINCOMP. In the preceding program I could have directly computed R = corr(X) and scale = std(X), but I generated those quantities from the covariance matrix because that is the approach used in the next section, which computes a robust PCA.

If you want to compute the PCA from the covariance matrix, you would use S in the EIGEN call, and define B = X - mean(X) as the centered (but not scaled) data matrix.

A robust principal component analysis

This section is based on a similar robust PCA computation in Wicklin (2010). The main idea behind a robust PCA is that if there are outliers in the data, the covariance matrix will be unduly influenced by those observations. Therefore you should use a robust estimate of the covariance matrix for the eigenvector computation. Also, the arithmetic mean is unduly influenced by the outliers, so you should center the data by using a robust estimate of center before you form the scores.

SAS/IML software provides

/* get robust estimates of location and covariance */
N = nrow(X); p = ncol(X);        /* number of obs and variables */
optn = j(8,1,.);                 /* default options for MCD */
optn[4]= floor(0.75*N);          /* override default: use 75% of data for robust estimates */
call MCD(rc,est,dist,optn,x);    /* compute robust estimates */
RobCov = est[3:2+p, ];           /* robust estimate of covariance */
RobLoc = est[1, ];               /* robust estimate of location */
 
/* use robust estimates to perform PCA */
RobCorr = cov2corr(RobCov);      /* robust correlation */
call eigen(D, V, RobCorr);       /* D=eigenvalues; columns of V are PCs */
RobScale = T( sqrt(vecdiag(RobCov)) ); /* robust estimates of scale */
B = (x-RobLoc) / RobScale;       /* center and scale data */
scores = B * V;                  /* project standardized data onto the PCs */

Notice that the SAS/IML code for the robust PCA is very similar to the classical PCA. The only difference is that the robust estimates of covariance and location (from the MCD call) are used instead of the classical estimates.

If you want to compute the robust PCA from the covariance matrix, you would use RobCov in the EIGEN call, and define B = X - RobLoc as the centered (but not scaled) data matrix.

Comparing the classical and robust PCA

Classical and robust principal component scores for crime data, computed in SAS

You can create a score plot to compare the classical scores to the robust scores. The Getting Started example in the PROC PRINCOMP documentation shows the classical scores for the first three components. The documentation suggests that Nevada, Massachusetts, and New York could be outliers for the crime data.

You can write the robust scores to a SAS data set and used the SGPLOT procedure to plot the scores of the classical and robust PCA on the same scale. The first and third component scores are shown to the right, with abbreviated state names serving as labels for the markers. (Click to enlarge.) You can see that the robust first-component scores for California and Nevada are separated from the other states, which makes them outliers in that dimension. (The first principal component measures overall crime rate.) The robust third-component scores for New York and Massachusetts are also well-separated and are outliers for the third component. (The third principal component appears to contrast murder with rape and auto theft with other theft.)

Because the crime data does not have huge outliers, the robust PCA is a perturbation of the classical PCA, which makes it possible to compare the analyses. For data that have extreme outliers, the robust analysis might not resemble its classical counterpart.

If you run an analysis like this on your own data, remember that eigenvectors are not unique and so there is no guarantee that the eigenvectors for the robust correlation matrix will be geometrically aligned with the eigenvectors from the classical correlation matrix. For these data, I multiplied the second and fourth robust components by -1 because that seems to make the score plots easier to compare.

Summary

In summary, you can implement a robust principal component analysis by using robust estimates for the correlation (or covariance) matrix and for the "center" of the data. The MCD subroutine in SAS/IML language is one way to compute a robust estimate of the covariance and location of multivariate data. The SAS/IML language also provides ways to compute eigenvectors (principal components) and project the (centered and scaled) data onto the principal components.

You can download the SAS program that computes the analysis and creates the graphs.

As discussed in Hubert, Rousseeuw, and Branden (2005), the MCD algorithm is most useful for 100 or fewer variables. They describe an alternative computation that can support a robust PCA in higher dimensions.

References

The post Robust principal component analysis in SAS appeared first on The DO Loop.

7月 052017
 

A SAS customer asked how to use SAS to conduct a Z test for the equality of two proportions. He was directed to the SAS Usage Note "Testing the equality of two or more proportions from independent samples." The note says to "specify the CHISQ option in the TABLES statement of PROC FREQ to compute this test," and then adds "this is equivalent to the well-known Z test for comparing two independent proportions."

You might wonder why a chi-square test for association is equivalent to a Z test for the equality of proportions. You might also wonder if there is a direct way to test the equality of proportions. This article implements the well-known test for proportions in the DATA step and compares the results to the chi-square test results. It also shows how to get this test directly from PROC FREQ by using the RISKDIFF option.

A chi-square test for association in SAS

The SAS Usage Note poses the following problem: Suppose you want to compare the proportions responding "Yes" to a question in independent samples of 100 men and 100 women. The number of men responding "Yes" is observed to be 30 and the number of women responding Yes was 45.

You can create the data by using the following DATA step, then call PROC FREQ to analyze the association between the response variable and gender.

data Prop;
length Group $12 Response $3;
input Group Response N;
datalines;
Men          Yes  30
Men          No   70
Women        Yes  45
Women        No   55
;
 
proc freq data=Prop order=data;
   weight N;
   tables Group*Response / chisq;
run;
Test for association of two categorical variables in SAS

As explained in the PROC FREQ documentation, the Pearson chi-square statistic indicates an association between the variables in the 2 x 2 table. The results show that the chi-square statistic (for 1 degree of freedom) is 4.8, which corresponds to a p-value of 0.0285. The test indicates that we should reject the null hypothesis of no association at the 0.05 significance level.

As stated in the SAS Usage Note, this association test is equivalent to a Z test for whether the proportion of males who responded "Yes" equals the proportion of females who responded "Yes." The equivalence relies on a fact from probability theory: a chi-square random variable with 1 degree of freedom is the square of a random variable from the standard normal distribution. Thus the square root of the chi-square statistic is the Z statistic (up to a sign) that you get from the test of equality of two proportion. Therefore the Z statistic should be z = ±sqrt(4.8) = ±2.19. The p-value is unchanged.

Z test for the equality of two proportions: A DATA step implmentation

For comparison, you can implement the classical Z test by applying the formulas from a textbook or from the course material from Penn State, which includes a section about comparing two proportions. The following DATA step implements the Z test for equality of proportions:

/* Implement the Z test for pre-summarized statistics. Specify the group proportions and sizes. 
   For formulas, see https://onlinecourses.science.psu.edu/stat414/node/268 */
%let alpha = 0.05;
%let N1    = 100;   /* total trials in Group1 */
%let Event1=  30;   /* Number of events in Group1  */
%let N2    = 100;   /* total trials in Group2 */
%let Event2=  45;   /* Number of events in Group2  */
 
%let Side  =   2;   /* use L, U, or 2 for lower, upper, or two-sided test */
title "Test of H0: p1=p2 vs Ha: p1^=p2"; /* change for Side=L or U */
 
data zTestProp;
p1Hat = &Event1 / &N1;                /* observed proportion in Group1 */
var1  = p1Hat*(1-p1Hat) / &N1;        /* variance in Group1 */
p2Hat = &Event2 / &N2;                /* observed proportion in Group2 */
var2  = p2Hat*(1-p2Hat) / &N2;        /* variance in Group2 */
/* use pooled estimate of p for test */
Diff = p1Hat - p2Hat;                 /* estimate of p1 = p2 */
pHat = (&Event1 + &Event2) / (&N1 + &N2);
pVar = pHat*(1-pHat)*(1/&N1 + 1/&N2); /* pooled variance */
SE   = sqrt(pVar);                    /* estimate of standard error */
Z = Diff / SE;    
 
Side = "&Side";
if Side="L" then                      /* one-sided, lower tail */
   pValue = cdf("normal", z);
else if Side="U" then                 /* one-sided, upper tail */
   pValue = sdf("normal", Z);         /* SDF = 1 - CDF */
else if Side="2" then
   pValue = 2*(1-cdf("normal", abs(Z))); /* two-sided */
format pValue PVALUE6.4 Z 7.4;
label pValue="Pr < Z";
drop var1 var2 pHat pVar;
run;
 
proc print data=zTestProp label noobs; run;
Test for equality of two independent proportions in SAS

The DATA step obtains a test statistic of Z = –2.19, which is one of the square roots of the chi-square statistic in the PROC FREQ output. Notice also that the p-value from the DATA step matches the p-value from the PROC FREQ output.

Test equality of proportions by using PROC FREQ

There is actually a direct way to test for the equality of two independent proportions: use the RISKDIFF option in the TABLES statement in PROC FREQ. In the documentation, binomial proportions are called "risks," so a "risk difference" is a difference in proportions. (Also, a "relative risk" (the RELRISK option) measures the ratio of two proportions.) Equality of proportions is equivalent to testing whether the difference of proportions (risks) is zero.

As shown in the documentation, PROC FREQ supports many options for comparing proprtions. You can use the following suboptions to reproduce the classical equality of proportions test:

  1. EQUAL requests an equality test for the difference in proportion. By default, the Wald interval (METHOD=WALD) is used, but you can choose other intervals.
  2. VAR=NULL specifies how to estimate the variance for the Wald interval.
  3. (optional) CL=WALD outputs the Wald confidence interval for the difference.

Combining these options gives the following direct computation of the difference between two proportions:

proc freq data=Prop order=data;
   weight N;
   tables Group*Response / riskdiff(equal var=null cl=wald); /* Wald test for equality */
run;
Test for difference of proportions in SAS

The 95% (Wald) confidence interval is shown in the first table. The confidence interval is centered on the point estimate of the difference (-0.15). The interval does not contain 0, so the difference is significantly different from 0 at the 0.05 significance level.

The second table show the result of the Wald equality test. The "ASE (H0)" row gives the estimate for the (asymptotic) standard error, assuming the null hypothesis. The Z score and the two-sided p-value match the values from the DATA step computation, and the interpretation is the same.

Summary

In summary, the SAS Usage Note correctly states that the chi-square test of association is equivalent to the Z test for the equality of proportion. To run the Z test explicitly, this article uses the SAS DATA step to implement the test when you have summary statistics. As promised, the Z statistic is one of the square roots of the chi-square statistic and the p-values are the same. The DATA step removes some of the mystery regarding the equivalence between these two tests.

However, writing DATA step code cannot match the convenience of a procedure. For raw or pre-summarized data, you can use the RISKDIFF option in PROC FREQ to run the same test (recast as a difference of proportions or "risks"). To get exactly the same confidence intervals and statistics as the classical test (which is called the Wald test), you need to add a few suboptions. The resulting output matches the DATA step computations.

The post Test for the equality of two proportions in SAS appeared first on The DO Loop.