rick wicklin

11月 222017
 

A statistical programmer read my article about the beta-binomial distribution and wanted to know how to compute the cumulative distribution (CDF) and the quantile function for this distribution. In general, if you know the PDF for a discrete distribution, you can also compute the CDF and quantile functions. This article shows how.

Continuous versus discrete distributions

Recall the four essential functions for any probability distribution: the probability density function (PDF), the cumulative distribution function (CDF), the quantile function, and the random-variate generator. SAS software provides the PDF, CDF, QUANTILE, and RAND function, which support for about 25 common families of distributions. However, there are infinitely many probability distributions. What can you do if you need a distribution that is not natively supported in SAS? The answer is that you can use tools in SAS (especially in SAS/IML software) to implement these functions yourself.

I have previously shown how to use numerical integration and root-finding algorithms to compute the CDF and quantile function for continuous distributions such as the folded normal distribution and the generalized Gaussian distribution. For discrete distributions, you can use a summation to obtain the CDF from the PDF. The CDF at X=x is the sum of the PDF values for all values of X that are less than or equal to x. The discrete CDF is a step function, so it does not have an inverse function. Given a probability, p, the quantile for p is defined as the smallest value of x for which CDF(x) ≥ p.

To simplify the discussion, assume that the random variable X takes values on the integers 0, 1, 2, .... Many discrete probability distributions in statistics satisfy this condition, including the binomial, Poisson, geometric, beta-binomial, and more.

Compute a CDF from a PDF

I will use SAS/IML software to implement the CDF and quantile functions. The following SAS/IML modules define the PDF function and the CDF function. If you pass in a scalar value for x, the CDF function computes the PDF from X=0 to X=x and sums up the individual probabilities. When x is a vector, the implementation is slightly more complicated, but the idea is the same.

proc iml;
/* PMF function for the beta-binomial distribution. See
   https://blogs.sas.com/content/iml/2017/11/20/simulate-beta-binomial-sas.html */
start PDFBetaBinom(x, nTrials, a, b);
   logPMF = lcomb(nTrials, x) + logbeta(x + a, nTrials - x + b) - logbeta(a, b);
   return ( exp(logPMF) );
finish;
 
/* CDF function for the beta-binomial distribution */
start CDFBetaBinom(x, nTrials, a, b);
   y = 0:max(x);                               /* all values of X less than max(x) */
   PMF = PDFBetaBinom(y, NTrials, a, b);       /* compute PMF(t) for t=0..max(x) */
   cumPDF = cusum(PMF);                        /* cdf(T) = sum( PMF(t), t=0..T ) */
   /* The input x can be a vector of values in any order.
      Use the LOC function to assign CDF(x) for each value of x */
   CDF = j(nrow(x), ncol(x), .);               /* allocate result (same size as x) */
   u = unique(x);                              
   do i = 1 to ncol(u);                        /* for each unique value of x... */
      idx = loc(x=u[i]);                       /*    find the indices */
      CDF[idx] = cumPDF[1+u[i]];               /*    assign the CDF   */
   end;
   return CDF;
finish;
 
/* test the CDF function */
a = 6; b = 4; nTrials = 10;
c = CDFBetaBinom(t, nTrials, a, b);

Notice the computational trick that is used to compute the CDF for a vector of values. If you want to compute the CDF at two or more values (say X=5 and X=7), you can compute the PDF for X=0, 1, ... M where M is the maximum value that you need (such as 7). Along the way, you end up computing the CDF for all smaller values.

You can write the CDF values to a SAS data set and graph them to visualize the CDF of the beta-binomial distribution, as shown below:

Beta-binomial cumulative distribution

Compute quantiles from the CDF

Given a probability value p, the quantile for p is smallest value of x for which CDF(x) ≥ p. Sometimes you can find a formula that enables you to find the quantile directly, but, in general, the definition of a quantile requires that you compute the CDF for X=0, 1, 2,... and so forth until the CDF equals or exceeds p.

The support of the beta-binomial distribution is the set {0, 1, ..., nTrials}. The following function first computes the CDF function for all values. It then looks at each value of p and returns the corresponding value of X that satisfies the quantile definition.

start QuantileBetaBinom(_p, nTrials, a, b);
   p = colvec(_p);                         /* ensure input is column vector */
   y = 0:nTrials;                          /* all possible values of X   */
   CDF = CDFBetaBinom(y, NTrials, a, b);   /* all possible values of CDF */
   q = j(nrow(p), 1, .);                   /* allocate result vector */
   do i = 1 to nrow(p);
      idx = loc( p[i] <= CDF );            /* find all x such that p <= CDF(x) */ 
      q[i] = idx[1] - 1;                   /* smallest index. Subtract 1 from index to get x */
   end;
   return ( q );
finish;
 
p = {0.2, 0.5, 0.8};
q = QuantileBetaBinom(p, nTrials, a, b);
print p q;

The output shows the quantile values for p = {0.2, 0.5, 0.8}. You can find the quantiles visually by using the graph of the CDF function, shown in the previous section. For each value of p on the vertical axis, draw a horizontal line until the line hits a bar. The x value of the bar is the quantile for p.

Notice that the QuantileBetaBinom function uses the fact that the beta-binomial function has finite support. The function computes the CDF at the values 0, 1, ..., nTrials. From those values you can "look up" the quantile for any p. If you have distribution with infinite support (such as the geometric distribution) or very large finite support (such as nTrials=1E9), it is more efficient to apply an iterative process such as bisection to find the quantiles.

Summary

In summary, you can compute the CDF and quantile functions for a discrete distribution directly from the PDF. The process was illustrated by using the beta-binomial distribution. The CDF at X=x is the sum of the PDF evaluated for all values less than x. The quantile for p is the smallest value of x for which CDF(x) ≥ p.

The post Compute the CDF and quantiles of discrete distributions appeared first on The DO Loop.

11月 202017
 

This article shows how to simulate beta-binomial data in SAS and how to compute the density function (PDF). The beta-binomial distribution is a discrete compound distribution. The "binomial" part of the name means that the discrete random variable X follows a binomial distribution with parameters N (number of trials) and p, but there is a twist: The parameter p is not a constant value but is a random variable that follows the Beta(a, b) distribution.

The beta-binomial distribution is used to model count data where the counts are "almost binomial" but have more variance than can be explained by a binomial model. Therefore this article also compares the binomial and beta-binomial distributions.

Simulate data from the beta-binomial distribution

To generate a random value from the beta-binomial distribution, use a two-step process. The first step is to draw p randomly from the Beta(a, b) distribution. Then you draw x from the binomial distribution Bin(p, N). The beta-binomial distribution is not natively supported by the RAND function SAS, but you can call the RAND function twice to simulate beta-binomial data, as follows:

/* simulate a random sample from the beta-binomial distribution */
%let SampleSize = 1000;
data BetaBin;                         
a = 6; b = 4; nTrials = 10;           /* parameters */
call streaminit(4321);
do i = 1 to &SampleSize;
   p = rand("Beta", a, b);            /* p[i] ~ Beta(a,b)  */
   x = rand("Binomial", p, nTrials);  /* x[i] ~ Bin(p[i], nTrials) */
   output;
end;
keep x;
run;

The result of the simulation is shown in the following bar chart. The expected values are overlaid. The next section shows how to compute the expected values.

Beta-binomial distribution and expected values in SAS

The PDF of the beta-binomial distribution

The Wikipedia article about the beta-binomial distribution contains a formula for the PDF of the distribution. Since the distribution is discrete, some references prefer to use "PMF" (probability mass function) instead of PDF. Regardless, if X is a random variable that follows the beta-binomial distribution then the probability that X=x is given by
PDF of the beta-binomial distribution
where B is the complete beta function.

The binomial coefficients ("N choose x") and the beta function are defined in terms of factorials and gamma functions, which get big fast. For numerical computations, it is usually more stable to compute the log-transform of the quantities and then exponentiate the result. The following DATA step computes the PDF of the beta-binomial distribution. For easy comparison with the distribution of the simulated data, the DATA step also computes the expected count for each value in a random sample of size N. The PDF and the simulated data are merged and plotted on the same graph by using the VBARBASIC statement in SAS 9.4M3. The graph was shown in the previous section.

data PDFBetaBinom;           /* PMF function */
a = 6; b = 4; nTrials = 10;  /* parameters */
do x = 0 to nTrials;
   logPMF = lcomb(nTrials, x) +
            logbeta(x + a, nTrials - x + b) -
            logbeta(a, b);
   PMF = exp(logPMF);        /* probability that X=x */
   EX = &SampleSize * PMF;   /* expected value in random sample */
   output;
end;
keep x PMF EX;
run;
 
/* Merge simulated data and PMF. Overlay PMF on data distribution. */
data All;
merge BetaBin PDFBetaBinom(rename=(x=t));
run;
 
title "The Beta-Binomial Distribution";
title2 "Sample Size = &SampleSize";
proc sgplot data=All;
   vbarbasic x / barwidth=1 legendlabel='Simulated Sample';      /* requires SAS 9.4M3 */
   scatter x=t y=EX /  legendlabel='Expected Value'
      markerattrs=GraphDataDefault(symbol=CIRCLEFILLED size=10);
   inset "nTrials = 10"  "a = 6"  "b = 4" / position=topleft border;
   yaxis grid;
   xaxis label="x" integer type=linear;             /* force TYPE=LINEAR */
run;

Compare the binomial and beta-binomial distributions

One application of the beta-binomial distribution is to model count data that are approximately binomial but have more variance ("thicker tails") than the binomial model predicts. The expected value of a Beta(a, b) distribution is a/(a + b), so let's compare the beta-binomial distribution to the binomial distribution with p = a/(a + b).

The following graph overlays the two PDFs for a = 6, b = 4, and nTrials = 10. The blue distribution is the binomial distribution with p = 6/(6 + 4) = 0.6. The pink distribution is the beta-binomial. You can see that the beta-binomial distribution has a shorter peak and thicker tails than the corresponding binomial distribution. The expected value for both distributions is 6, but the variance of the beta-binomial distribution is greater. Thus you can use the beta-binomial distribution as an alternative to the binomial distribution when the data exhibit greater variance than expected under the binomial model (a phenomenon known as overdispersion).

Compare the binomial and beta-binomial distributions with the same mean. The beta-binomial distribution has greater variance than the binomial.

Summary

The beta-binomial distribution is an example of a compound distribution. You can simulate data from a compound distribution by randomly drawing the parameters from some distribution and then using those random parameters to draw the data. For the beta-binomial distribution, the probability parameter p is drawn from a beta distribution and then used to draw x from a binomial distribution where the probability of success is the value of p. You can use the beta-binomial distribution to model data that have greater variance than expected under the binomial model.

The post Simulate data from the beta-binomial distribution in SAS appeared first on The DO Loop.

11月 152017
 

Did you know that a SAS/IML function can recover from a run-time error? You can specify how to handle run-time errors by using a programming technique that is similar to the modern "try-catch" technique, although the SAS/IML technique is an older implementation.

Preventing errors versus handling errors

In general, SAS/IML programmers should try to detect potential problems and prevent errors from occurring. For example, before you compute the square root of a number, you should test whether the number is greater than or equal to zero. If not, you can handle the bad input. By testing the input value before you call the SQRT function, you prevent a run-time error.

However, sometimes you don't know whether a computation will fail until you attempt it. Other times, the test to determine whether an error will occur is very expensive, and it is cheaper to attempt the computation and handle the error if it occurs.

A canonical example is computing the inverse of a square matrix. It is difficult to test whether a given matrix is numerically singular. You can compute the rank of the matrix or the condition number of the matrix, but both computations are expensive because they rely on the SVD or eigenvalue decomposition. Cheaper methods include computing a determinant or using Gaussian elimination, but these methods can be numerically unstable and are not good indicators of whether a matrix is numerically singular.

Run-time error in SAS/IML modules

To understand how to handle run-time errors in SAS/IML modules, recall the following facts about how to use the execution queue in SAS/IML modules.

In summary, when a run-time error occurs in a module, the module pauses and waits for additional commands. However, if there are statements in the execution queue, those statements will be executed.

You can use these facts to handle errors. At the top of the module, use the PUSH statement to add error-handling statements into the execution queue. If an error occurs, the module pauses, sees that there are statements in the queue, and executes those statements. If one of the statements contains a RESUME statement, the module resumes execution.

Push, pause, execute, and resume

Let's apply these ideas to computing the inverse of a square matrix. The following SAS/IML function attempts to compute the inverse of the input matrix. If the matrix is singular, the computation fails and the module handles that error. The main steps of the modules are as follows:

  1. Define the statements that will be executed in the event of an error. The last executable statement should be the RESUME statement. (Technical point: The RESUME statement is only valid when the module is paused. Surround the statements with IF-THEN logic to prevent the error-handling statements from running when the module exits normally.)
  2. Push these statements into the execution queue.
  3. If an error occurs, the statements in the queue will be executed. The RESUME statement clears the error and resumes execution of the module. (If no error occurs, the queue is flushed after the RETURN statement.)
proc iml;
/* If the matrix is not invertible, return a missing value. 
   Otherwise, return the inverse.
*/
start InvEx(A); 
   errFlag = 1;           /* set flag. Assume we will get an error :-) */   
   on_error = "if errFlag then do;  AInv = .;  resume; end;";
   call push(on_error);   /* PUSH code that will be executed if an error occurs */
   AInv = inv(A);         /* if error, AInv set to missing and function resumes */
   errFlag = 0;           /* remove flag for normal exit from module */
   return ( AInv );
finish;
 
A1 = {1 1,
      0 1};
B1 = InvEx(A1);
A2 = {1 1,
      1 1};
B2 = InvEx(A2);
 
print B1, B2;

The function is called first with a nonsingular matrix, A1. The INV function does not fail, so the module exits normally. When the module exits, it flushes the execution queue. Because ErrFlag=0, the pushed statements have no effect. B1 is equal to the inverse of A1.

The function is called next with a singular matrix, A2. The INV function encounters an error and the module pauses. When the module pauses, it executes the statements in the queue. Because ErrFlag=1, the statements set AINV to a missing value and resume the module execution. The module exits normally and the B2 value is set to missing.

Summary

In summary, you can implement statements that handle run-time errors in SAS/IML modules. The key is to push error-handling code into a queue. The error-handling statements should include a RESUME statement. If the module pauses due to an error, the statements in the queue are executed and the module resumes execution. Be aware that the queue is flushed when the module exits normally, so use a flag variable and IF-THEN logic to control how the statements behave when there is no error.

The post Catch run-time errors in SAS/IML programs appeared first on The DO Loop.

11月 132017
 

Debugging is the bane of every programmer. SAS supports a DATA step debugger, but that debugger can't be used for debugging SAS/IML programs. In lieu of a formal debugger, many SAS/IML programmers resort to inserting multiple PRINT statements into a function definition. However, there is an easier way to query the values of variables inside a SAS/IML function: Use the PAUSE statement to halt the execution of program inside the misbehaving function. You can then interactively submit PRINT statements or any other valid statements. When you are finished debugging, you can submit the RESUME statement and the function will resume execution. (Or you can submit the STOP statement to halt execution and return to the main scope of the program.)

The PAUSE statement

The SAS/IML language is interactive. The PAUSE statement pauses the execution of a SAS/IML module. While the program is paused you can print or assign local variables inside the module. When you are ready for the module to resume execution, you submit the RESUME statement. (The SAS/IML Studio application works slightly differently. The PAUSE statement brings up a dialog box that you can use to submit statements. You press the Resume button to resume execution.)

For example, suppose you write a function called 'Func' whose purpose is to compute the sum of squares of the elements in a numeric vector. While testing the function, you discover that the answer is not correct when you pass in a row vector. You decide to insert a PAUSE statement ("set a breakpoint") near the end of the module, just before the RETURN statement, as follows:

proc iml;
/* in SAS/IML, the CALL EXECUTE statement runs immediately */
start Func(x);
   y = x`*x;
 
  /* Debugging tip: Use PAUSE to enter INTERACTIVE mode INSIDE module! */
  pause "Inside 'Func'; submit RESUME to resume computations.";
  /* Execution pauses until you submit RESUME, then continues from the next statement. */
 
   return (y);
finish;
 
w = Func( {1 2 3} );

When you run the program, it prints Inside 'Func'; submit RESUME to resume computations. The program then waits for you to enter commands. The program will remain paused until you submit the RESUME statement (include a semicolon!).

Because the program has paused inside the function, you can query or set the values of local variables. For this function, there are only two local variables, x and y. Highlight the following statement, and press F3 to submit it.

print y;     /* inside module; print local variable */

The output shows that the value of y is a matrix, not a scalar, which indicates that the expression x`*x does not compute the sum of squares for this input vector. You might recall that SAS/IML has a built-in SSQ function, which computes the sum of squares. Highlight the following statement and press F3 to submit it:

y = ssq(x);  /* assign local variable inside module */

This assignment has overwritten the value of the local variable, y. When you submit a RESUME statement, the function will resume execution and return the value of y. The program is now at the main scope, which you can demonstrate by printing the value of w:

resume;      /* resume execution. Return to main scope */
print w;     /* print result at main scope */

To make the change permanent, you must edit the function definition and redefine the module. Be sure to remove the PAUSE statement when you are finished debugging: If you run a program in batch mode that contains a PAUSE statement, the program will forever wait for input!

In conclusion, the PAUSE statement enables you to pause the execution of a program inside a module and interactively query and change the local variables in the module. This can help you to debug a function. After you finish investigating the state of the local variables, you can submit the RESUME statement, which tells the function to continue executing from the line after the PAUSE statement. (Or submit STOP to exit the function.) The PAUSE statement can be a useful alternative to inserting many PRINT statements inside a function during the debugging phase of program development.

The post A tip for debugging SAS/IML modules: The PAUSE statement appeared first on The DO Loop.

11月 082017
 

A SAS programmer wanted to display a table in which the rows have different formats. An example is shown below. The programmer wanted columns that represent statistics and rows that represent variables. She wanted to display formats (such as DOLLAR) for some variables—but only for certain statistics. For example, the number of nonmissing observations (N) should never use the format of the variable, whereas the minimum, maximum, and mean values should.

Format rows of a table

In SAS you can easily apply a format to a column, but not to a row. However, SAS provides two techniques that enable you to control the appearance of cells in tables:

Both PROC TEMPLATE and PROC REPORT have a learning curve. If you are in a hurry, an alternative solution is to use the DATA step to create a factoid. Recall that a factoid is an ODS table that contains character values and (formatted) numbers in the same column. A factoid can display mixed-type data by converting every value to a string. This conversion process gives you complete control over the format of each cell in the table. Consequently, you can use context to format some numbers differently from other numbers.

Creating a custom factoid

Suppose that you want to generate the table shown at the top of this article. You can obtain the statistics from PROC MEANS and then transpose the data. The following statements produce an output data set that contains descriptive statistics for three variables in the Sashelp.Cars data:

proc means data=sashelp.cars noprint;
var Cylinders EngineSize MSRP;
output out=stats(drop=_TYPE_ _FREQ_
                 rename=(_STAT_=Label1)
                 where=(Label1 ^= "STD") );
run;
 
proc print data=stats noobs;
run;
Descriptive statistics with formats

I have highlighted the first row of the output data to indicate why this output is not suitable for a report. The output variables retain the format of the original variables. However, the 'N' statistic is the sample size, and a sample of size '$428' does not make sense! Even '428.000' is a strange value to see for a sample size. It would be great to format the first row with an intrger format such as 8.0.

The next DATA step creates three character variables (CVALUE1-CVALUE3) that contain formatted values of the three numerical variables in the data. The SELECT-WHEN statement does the following:

  • For each observation and for each of the three numerical variables:
    • If the LABEL1 variable is 'N', then apply the 8.0 format to the value of the numerical variable.
    • Otherwise use the VVALUE function to get the formatted value for the numerical variable.
data Factoid;
set stats;
array nValue[3] Cylinders EngineSize MSRP;      /* numerical variables */
array cValue[3] $12.;                           /* cValue[i] is formatted version of nValue[i] */
/* with a little effort you could assign labels dynamically */
label cValue1="Cylinders" cValue2="EngineSize" cValue3="MSRP";
 
do i = 1 to dim(nValue);
   select (Label1);
      when ('N')    cValue[i] = put(nvalue[i], 8.0);
      otherwise     cValue[i] = vvalue(nvalue[i]);
   end;
end;
run;
 
proc print data=Factoid label noobs;
   var Label1 cValue:;
run;

The output displays the character columns, which contain formatted numbers. With additional logic, you could display fewer decimal values for the MEAN row.

Transpose the factoid

The previous table contains the information that the programmer wants. However, the programmer asked to display the transpose of the previous table. For completeness, here are SAS statements that transpose the table:

proc transpose data=Factoid out=TFactoid;
   var cValue:;
   ID Label1;
   IDLABEL Label1;
run;
 
proc print data=TFactoid(drop=_NAME_) noobs label;
   label _LABEL_ = "Variable";
run;

The final table is displayed at the top of the article.

Summary

In summary, this article shows one way to display a column of data in which each cell potentially has a different format. The trick is to create a factoid, which means that you create a character copy of numeric data. As you create the copy, you can use the PUT function to apply a custom format, or you can use the VVALUE function to get the existing formatted value.

In general, you should try to avoid creating copies of data, so use PROC TEMPLATE and PROC REPORT if you know how. However, if you don't know how to use those tools, the factoid provides an alternative way to control the formatted values that appear in a table. Although not shown in this article, a factoid also enables you to display character and numeric values in the same column.

The post How to format rows of a table in SAS appeared first on The DO Loop.

11月 082017
 

A SAS programmer wanted to display a table in which the rows have different formats. An example is shown below. The programmer wanted columns that represent statistics and rows that represent variables. She wanted to display formats (such as DOLLAR) for some variables—but only for certain statistics. For example, the number of nonmissing observations (N) should never use the format of the variable, whereas the minimum, maximum, and mean values should.

Format rows of a table

In SAS you can easily apply a format to a column, but not to a row. However, SAS provides two techniques that enable you to control the appearance of cells in tables:

Both PROC TEMPLATE and PROC REPORT have a learning curve. If you are in a hurry, an alternative solution is to use the DATA step to create a factoid. Recall that a factoid is an ODS table that contains character values and (formatted) numbers in the same column. A factoid can display mixed-type data by converting every value to a string. This conversion process gives you complete control over the format of each cell in the table. Consequently, you can use context to format some numbers differently from other numbers.

Creating a custom factoid

Suppose that you want to generate the table shown at the top of this article. You can obtain the statistics from PROC MEANS and then transpose the data. The following statements produce an output data set that contains descriptive statistics for three variables in the Sashelp.Cars data:

proc means data=sashelp.cars noprint;
var Cylinders EngineSize MSRP;
output out=stats(drop=_TYPE_ _FREQ_
                 rename=(_STAT_=Label1)
                 where=(Label1 ^= "STD") );
run;
 
proc print data=stats noobs;
run;
Descriptive statistics with formats

I have highlighted the first row of the output data to indicate why this output is not suitable for a report. The output variables retain the format of the original variables. However, the 'N' statistic is the sample size, and a sample of size '$428' does not make sense! Even '428.000' is a strange value to see for a sample size. It would be great to format the first row with an intrger format such as 8.0.

The next DATA step creates three character variables (CVALUE1-CVALUE3) that contain formatted values of the three numerical variables in the data. The SELECT-WHEN statement does the following:

  • For each observation and for each of the three numerical variables:
    • If the LABEL1 variable is 'N', then apply the 8.0 format to the value of the numerical variable.
    • Otherwise use the VVALUE function to get the formatted value for the numerical variable.
data Factoid;
set stats;
array nValue[3] Cylinders EngineSize MSRP;      /* numerical variables */
array cValue[3] $12.;                           /* cValue[i] is formatted version of nValue[i] */
/* with a little effort you could assign labels dynamically */
label cValue1="Cylinders" cValue2="EngineSize" cValue3="MSRP";
 
do i = 1 to dim(nValue);
   select (Label1);
      when ('N')    cValue[i] = put(nvalue[i], 8.0);
      otherwise     cValue[i] = vvalue(nvalue[i]);
   end;
end;
run;
 
proc print data=Factoid label noobs;
   var Label1 cValue:;
run;

The output displays the character columns, which contain formatted numbers. With additional logic, you could display fewer decimal values for the MEAN row.

Transpose the factoid

The previous table contains the information that the programmer wants. However, the programmer asked to display the transpose of the previous table. For completeness, here are SAS statements that transpose the table:

proc transpose data=Factoid out=TFactoid;
   var cValue:;
   ID Label1;
   IDLABEL Label1;
run;
 
proc print data=TFactoid(drop=_NAME_) noobs label;
   label _LABEL_ = "Variable";
run;

The final table is displayed at the top of the article.

Summary

In summary, this article shows one way to display a column of data in which each cell potentially has a different format. The trick is to create a factoid, which means that you create a character copy of numeric data. As you create the copy, you can use the PUT function to apply a custom format, or you can use the VVALUE function to get the existing formatted value.

In general, you should try to avoid creating copies of data, so use PROC TEMPLATE and PROC REPORT if you know how. However, if you don't know how to use those tools, the factoid provides an alternative way to control the formatted values that appear in a table. Although not shown in this article, a factoid also enables you to display character and numeric values in the same column.

The post How to format rows of a table in SAS appeared first on The DO Loop.

11月 062017
 
A factoid in SAS is a table that displays numeric and chanracter values in a single column

Have you ever seen the "Fit Summary" table from PROC LOESS, as shown to the right? Or maybe you've seen the "Model Information" table that is displayed by some SAS analytical procedures? These tables provide brief interesting facts about a statistical procedure, hence they are called factoids.

In SAS, a "factoid" has a technical meaning:

  • A factoid is an ODS table that can display numerical and character values in the same column.
  • A factoid displays row headers in a column. In most ODS styles, the row headers are displayed in the first column by using a special style, such as a light-blue background color in the HTMLBlue style.
  • A factoid does not display column headers because the columns display disparate and potentially unrelated information.

I want to emphasize the first item in the list. Since variables in a SAS data set must be either character or numeric, you might wonder how to access the data that underlies a factoid. You can use the ODS OUTPUT statement to look at the data object behind any SAS table, as shown below:

proc loess data=sashelp.cars plots=none;
   model mpg_city = weight;
   ods output FitSummary=Fit(drop=SmoothingParameter);
run;
 
proc print data=Fit noobs;
run;
The data object that underlies a factoid in SAS

The PROC PRINT output shows how the factoid manages to display characters and numbers in the same column. The underlying data object contains three columns. The LABEL1 column contains the row headers, which identify each row. The CVALUE1 column is the column that is displayed in the factoid. It is a character column that contains character strings and formatted copies of the numbers in the NVALUE1 column. The NVALUE1 column contains the raw numeric value of every number in the table. Missing values represent rows for which the table displays a character value.

All factoids use the same naming scheme and display the LABEL1 and CVALUE1 columns. The form of the data is important when you want to use the numbers from a factoid in a SAS program. Do not use the CVALUE1 character variable to get numbers! Those values are formatted and possibly truncated, as you can see by looking at the "Smoothing Parameter" row. Instead, read the numbers from the NVALUE1 variable, which stores the full double-precision number.

For example, if you want to use the AICC statistic (the last row) in a program, read it from the NVALUE1 column, as follows:

data _NULL_;
   set Fit(where=( Label1="AICC" ));  /* get row with AICC value */
   call symputx("aicc", NValue1);     /* read value of NValue1 variable into a macro */
run;
%put &=aicc;                          /* display value in log */
AICC=3.196483775

Some procedures produce factoids that display multiple columns. For example, PROC CONTENTS creates the "Attributes" table, which is a factoid that displays four columns. The "Attributes table displays two columns of labels and two columns of values. When you use the ODS OUTPUT statement to create a data set, the variables for the first two columns are LABEL1, CVALUE1, and NVALUE1. The variables for the second two columns are LABEL2, CVALUE2, and NVALUE2.

Be aware that the values in the LABEL1 (and LABEL2) columns depend on the LOCALE= option for your SAS session. This means that some values in the LABEL1 column might be translated into French, German, Korean, and so forth. When you use a WHERE clause to extract a value, be aware that the WHERE clause might be invalid in other locales. If you suspect that your program will be run under multiple locales, use the _N_ automatic variable, such as if _N_=14 then call symputx("aicc", NValue1);. Compared with the WHERE clause, using the _N_ variable is less readable but more portable.

Now that you know what a factoid is, you will undoubtedly notice them everywhere in your SAS output. Remember that if you need to obtain numerical values from a factoid, use the ODS OUTPUT statement to create a data set. The NVALUE1 variable contains the full double-precision numbers in the factoid. The CVALUE1 variable contains character values and formatted versions of the numbers.

The post What is a factoid in SAS? appeared first on The DO Loop.

11月 012017
 

A SAS/IML programmer asked whether you can pass the name of a function as an argument to a SAS/IML module and have the module call the function that is passed in. The answer is "yes." The basic idea is to create a string that represents the function call and then use CALL EXECUTE to evaluate the function.

Suppose that you want to create a "super function" (call it EvalFunc) that can compute several univariate statistics such as the mean, the median, the standard deviation, and so forth. Each of these descriptive statistics are calculated by a SAS/IML function that takes one argument. You can therefore define the EvalFunc function to take two arguments: the name of a SAS/IML function and a vector of data. For example, the syntax m = EvalFunc("mean", x) would evaluate mean(x) whereas the syntax s = EvalFunc("std", x) would evaluate std(x). The following SAS/IML function creates a string of the form "z=func(x);" and uses CALL EXECUTE to run that statement:

proc iml;
start EvalFunc(func, x);
   cmd = "z = " + func + "(x);";    /* z = func(x); */
   call execute( cmd );
   return z;
finish;

You can test the EvalFunc module by passing the names of some SAS/IML functions that compute descriptive statistics:

y = {1,2,3,4,5,6,7,8,20};
stats = {"mean" "median" "std" "var" "skewness" "kurtosis"};
results = j(1, ncol(stats), .);
do i = 1 to ncol(stats);
   results[i] = EvalFunc(stats[i], y);
end;
 
print results[colname=stats];

In the example, there is no error checking. It assumes that you pass in a valid function name that takes one argument. Notice that CALL EXECUTE runs immediately in SAS/IML and runs in the local scope. This is in contrast to the way that CALL EXECUTE works in the SAS DATA step. The DATA step pushes statements onto a queue and executes the queue after the DATA step completes.

If you add some IF-THEN/ELSE logic, you could support alternate values of the func argument. For example, if upcase(func)="Q1", you could define cmd = "call qntl(z, x, 0.25);". When you pass that string to CALL EXECUTE, it calls the QNTL function to compute the first quantile (25th percentile).

If you use the SAS/IML Studio application, you can use the ALIAS statement to directly call a function based on the name and signature of that function. The ALIAS function is not supported in PROC IML.

In summary, this article shows that you can pass the name of a function to a SAS/IML module and use CALL EXECUTE to evaluate the function.

The post Evaluate a function by using the function name in SAS/IML appeared first on The DO Loop.

10月 302017
 

This article demonstrates a SAS programming technique that I call Kuhfeld's template modification technique. The technique enables you to dynamically modify an ODS template and immediately call the modified template to produce a new graph or table. By following the five steps in this article, you can implement the technique in less than 20 lines of SAS code.

Kuhfeld's Template Modification Technique (TMT)

The template modification technique (TMT) is one of several advanced techniques that Warren Kuhfeld created and promoted in a series of articles that culminated with Kuhfeld (2016). (Slides are also available.) The technique that I call the TMT is a DATA _NULL_ program that modifies a SAS-supplied template, uses CALL EXECUTE to compiles it "on the fly," and immediately calls the modified template to render a new graph (pp. 12ff in the paper; p. 14 of his slides).

When I attended Warren's 2016 presentation, I felt a bit overwhelmed by the volume of information. It was difficult for me to see the forest because of all the interconnected "trees." Graph customization is easier to understand if you restrict your attention only to GTL templates. You can make simple modifications to a template (for example, changing the value of an option) by using the following five steps:

  1. Prepend a private item store to the ODS search path. The private item store will store the modified template.
  2. Write the existing template to a temporary text file.
  3. Run a DATA step that reads the existing template one line at a time. Check each line to see if it contains the option that you want to change. If so, modify the line to change the template's behavior.
  4. Compile the modified template into the private item store.
  5. Call PROC SGRENDER to render the graph by using the new template.

This article is intentionally brief. For more information see the chapter "ODS Graphics Template Modification" in the SAS/STAT documentation.

Most of Kuhfeld's examples use the TMT to modify ODS templates that are distributed with SAS. These templates can be complex and often contain conditional logic. In an effort to produce a simpler example, the next section creates a simple template that uses a heat map to display a response variable on a uniform grid.

Example: A template that specifies a color ramp

Programmers who use the GTL to create graphs have probably encountered dynamic variables and run-time macro variables. You can use these to specify certain options at run time, rather than hard-coding information into a template. However, the documentation states that "dynamic variables ... cannot resolve to statement or option keywords." In particular, the GTL does not support changing the color ramp at run time: whatever color ramp is hard-coded in the template is the color ramp that is used. (Note: If you have SAS 9.4M3, the HEATMAPPARM statement in PROC SGPLOT supports specifying a color ramp at run time.)

However, you can use Kuhfeld's TMT to modify the color ramp at run time. The following GTL defines a template named "HEATMAP." The template uses the HEATMAPPARM statement and hard-codes the COLORMODEL= option to be the "ThreeColorRamp" in the current ODS style. The template is adapted from the article "Creating a basic heat map in SAS."

/* Construct example template: a basic heat map with continuous color ramp */
proc template; 
define statgraph HEATMAP;
dynamic _X _Y _Z;           /* dynamic variables */
 begingraph;
 layout overlay;
    heatmapparm x=_X y=_Y colorresponse=_Z /  /* specify variables at run time */
       name="heatmap" primary=true xbinaxis=false ybinaxis=false
       colormodel = THREECOLORRAMP;           /* <== hard-coded color ramp */
    continuouslegend "heatmap";
  endlayout;
endgraph;
end;
run;

When the PROC TEMPLATE step runs, the HEATMAP template is stored in a location that is determined by your ODS search path. You can run ods path show; to display the search path. You can discover where the HEATMAP template is stored by running proc template; list HEATMAP; run;.

The following DATA step defines (x,y) data on a uniform grid and defines z to be a cubic function of x and y. The heat map shows an approximate contour plot of the cubic function:

/* Construct example data: A cubic function z = f(x,y) on regular grid */
do x = -1 to 1 by 0.05;
   do y = -1 to 1 by 0.05;
      z = x**3 - y**2 - x + 0.5;
      output;
   end;
end;
run;
 
/* visualize the data using the built-in color ramp */
proc sgrender data=Cubic template=Heatmap; 
   dynamic _X='X' _Y='Y' _Z='Z';
run;
Heat map rendered with original template

The remainder of this article shows how to use Kuhfeld's TMT to create a new template which is identical to the HEATMAP template except that it uses a different color ramp. Kuhfeld's technique copies the existing template line by line but replaces the text "THREECOLORRAMP" with text that defines a new custom color ramp.

In the following program, capital letters are used for the name of the existing template, the name of the new template, and the value of the new color ramp. Most of the uncapitalized terms are "boilerplate" for the TMT, although each application will require unique logic in the DATA _NULL_ step.

Step 1: Prepend a private item store to the ODS search path

PROC TEMPLATE writes templates to the first (writable) item store on the ODS search path. PROC SGRENDER looks through the ODS search path until it locates a template with a given name. The first step is to prepend a private item store to the search path, as shown by the following statement:

/* 1. Modify the search path for ODS. Prepend an item store in WORK */
ods path (prepend) work.modtemp(update);

The ODS PATH statement ensures that the modified template will be stored in WORK.MODTEMP. You can provide any name you want. The name WORK.TEMPLAT is a convention that is used in the SAS documentation. Because you might already have WORK.TEMPLAT on your ODS path, I used a different name. (Note: Any item stores in WORK are deleted when SAS exits.) The SAS/STAT documentation provides more information about the template search path.

Step 2: Write the existing template to a temporary text file

You need to write the source code for the HEATMAP template to a plain text file so that it can be read by the SAS DATA step. You can use the FILENAME statement to point to a hard-coded location (such as filename t "C:/temp/tmplt.txt";), but I prefer to use the SAS keyword TEMP, which is more portable. The TEMP keyword tells SAS to create a temporary file name in a location that is independent of the operating system and the file system:

/* 2. Write body of existing template to text file */
filename _tmplt TEMP;             /* create temporary file and name */
proc template;
   source HEATMAP / file=_tmplt;  /* write template source to text file */
run;

If you omit the FILE= option, the template source is written to the SAS log. The text file contains only the body of the HEATMAP template.

Steps 3 and 4: Create a new template from the old template

The next step reads and modifies the existing template line by line. There are many SAS character functions that enable you to discover whether some keyword exists in a line of text. Examples include the INDEX and FIND family of functions, the PRXCHANGE function (which uses regular expressions), and the TRANWRD function.

The goal of the current example is to replace the word "THREECOLORRAMP" by another string. The TRANWRD function replaces the string if it is found; the line is unchanged if the word is not found. Therefore it is safe to call the TRANWRD function on every line in the existing template.

You can use the CALL EXECUTE function to place each (possibly modified) line of the template onto a queue, which will be executed when the DATA step completes. However, the text file does not start with a PROC TEMPLATE statement nor end with a RUN statement. You must add those statements to the execution queue in order to compile the template, as follows:

/* one possible way to specify a new color ramp */
%let NEWCOLORRAMP = (blue cyan yellow red);
 
/* 3. Read old template; make modifications.
   4. Use CALL EXECUTE to compile (modified) template with a new name. */
data _null_;                       /* Modify graph template to replace default color ramp */
   infile _tmplt end=eof;          /* read from text file; last line sets EOF flag to true */
   input;                          /* read one line at a time */
   if _n_ = 1 then 
       call execute('proc template;');      /* add 'PROC TEMPLATE;' before */
 
   if findw(_infile_, 'define statgraph') then
      _infile_ = "define statgraph NEWHEATMAP;";  /* optional: assign new template name */
   /* use TRANWRD to make substitutions here */
   _infile_ = tranwrd(_infile_, 'THREECOLORRAMP', "&NEWCOLORRAMP"); /* replace COLORMODEL= option */
 
   call execute(_infile_);                  /* push line onto queue; compile after DATA step */
   if eof then call execute('run;');        /* add 'RUN;' at end */
run;

The DATA step copies the existing template source, but modifies the template name and the COLORMODEL= option. The CALL EXECUTE statements push each (potentially modified) statement onto a queue that is executed when the DATA step exits. The result is that PROC TEMPLATE compiles a new template and stores it in the WORK.MODTEMP item store. Recall that the TRANWRD function and other string-manipulation functions are case sensitive, so be sure to use the exact capitalization that exists in the template. In other words, this technique requires that you know details of the template that you wish to modify!

If you are using this technique to modify one of the built-in SAS templates that are used by SAS procedure, do not modify the template name.

Inspect the modified template

Before you attempt to use the new template, you might want to confirm that it is correct. The following call to PROC TEMPLATE shows the location and creation date for the new template. The source for the template is displayed in the log so that you can verify that the substitutions were made correctly.

proc template;
   list NEWHEATMAP / stats=created; /* location of the modified template */
   source NEWHEATMAP;       /* confirm that substitutions were performed */
run;

Step 5: Render the graph by using the new template

The final step is to call PROC SGRENDER to use the newly created template to create the new graph:

/* 5. Render graph by using new template */
proc sgrender data=Cubic template=NEWHEATMAP; 
   dynamic _X='X' _Y='Y' _Z='Z';
run;
Render graph with modified template

The output demonstrates that the original color ramp ("ThreeColorRamp") has been replaced by a four-color ramp that uses the colors blue, cyan, yellow, and red.

When is this necessary?

You might wonder if the result is worth all this complexity and abstraction. Couldn't I have merely copied the template into a program editor and changed the color ramp? Yes, you can use a program editor to modify a template for your personal use. However, this programming technique enables you to write macros, tasks, and SAS/IML functions that other people can use. For example, you could encapsulate the program in this article into a %CreateHeatmap macro that accepts a color ramp and data set as an argument. Or in SAS/IML you can write modules such as the HEATMAPCONT and HEATMAPDISC subroutines, which enable SAS/IML programmers to specify any color ramp to visualize a matrix.

If you work in a regulated industry and are modifying templates that SAS distributes, this technique provides reproducibility. A reviewer or auditor can run a program to reproduce a graph.

Summary

Kuhfeld's template modification technique (TMT) enables you to write a SAS program that modifies an ODS template. The TMT combines many SAS programming techniques and can be difficult to comprehend when you first see it. This article breaks down the technique into five simpler steps. You can use the TMT to edit ODS templates that are distributed by SAS and used in SAS procedures, or (as shown in this article) to change options at run time for a template that you wrote yourself. You can download the SAS program for this article.

Happy Halloween to all my readers. I hope you'll agree that Kuhfeld's TRICKS are a real TREAT!

The post A SAS programming technique to modify ODS templates appeared first on The DO Loop.

10月 252017
 

This article describes the advantages and disadvantages of principal component regression (PCR). This article also presents alternative techniques to PCR.

In a previous article, I showed how to compute a principal component regression in SAS. Recall that principal component regression is a technique for handling near collinearities among the regression variables in a linear regression. The PCR algorithm in most statistical software is more correctly called "incomplete" PCR because it uses only a subset of the principal components. Incomplete PCR means that you compute the principal components for the explanatory variables, keep only the first k principal components (which explain most of the variance among the regressors), and regress the response variable onto those k components.

The principal components that are dropped correspond to the near collinearities. Consequently, the standard errors of the parameter estimates are reduced, although the tradeoff is that the estimates are biased, and "the bias increases as more [principal components]are droppped" (Jackson, p. 276).

Some arguments in this article are from J. E. Jackson's excellent book, A User's Guide to Principal Components, (1991, pp. 271–278). Jackson introduces PCR and then immediately cautions against using it (p. 271). He writes that PCR "is a widely used technique," but "it also has some serious drawbacks." Let's examine the advantages and disadvantages of principal component regression.

Advantages of principal component regression

Principal component regression is a popular and widely used method. Advantages of PCR include the following:

  • PCR can perform regression when the explanatory variables are highly correlated or even collinear.
  • PCR is intuitive: you replace the basis {X1, X2, ..., Xp} with an orthogonal basis of principal components, drop the components that do not explain much variance, and regress the response onto the remaining components.
  • PCR is automatic: The only decision you need to make is how many principal components to keep.
  • The principal components that are dropped give insight into which linear combinations of variables are responsible for the collinearities.
  • PCR has a discrete parameter, namely the number of components kept. This parameter is very interpretable in terms of geometry (linear dimensions kept) and in terms of linear algebra (low-rank approximations).
  • You can run PCR when there are more variables than observations (wide data).

Drawbacks of principal component regression

The algorithm that is currently known as PCR is actually a misinterpretation of the original ideas behind PCR (Jolliffe, 1982, p. 201). When Kendall and Hotelling first proposed PCR in the 1950s, they proposed "complete" PCR, which means replacing the original variables by all the principal components, thereby stabilizing the numerical computations. Which principal components are included in the final model is determined by looking at the significance of the parameter estimates. By the early 1980s, the term PCR had changed to mean "incomplete PCR."

The primary argument against using (incomplete) principal component regression can be summarized in a single sentence: Principal component regression does not consider the response variable when deciding which principal components to drop. The decision to drop components is based only on the magnitude of the variance of the components.

There is no a priori reason to believe that the principal components with the largest variance are the components that best predict the response. In fact, it is trivial to construct an artificial example in which the best predictor is the last component, which will surely be dropped from the analysis. (Just define the response to be the last principal component!) More damning, Jolliffe (1982, p. 302) presents four examples from published papers that advocate PCR, and he shows that some of the low-variance components (which were dropped) have greater predictive power than the high-variance components that were kept. Jolliffe concludes that "it is not necessary to find obscure or bizarre data in order for the last few principal components to be important in principal component regression. Rather it seems that such examples may be rather common in practice."

There is a hybrid version of PCR that enables you to use cross validation and the predicted residual sum of squares (PRESS) criterion to select how many components to keep. (In SAS, the syntax is proc pls method=PCR cv=one cvtest(stat=press).) Although this partially addresses the issue by including the response variable in the selection of components, it is still the case that the first k components are selected and the last p – k are dropped. The method never keeps the first, third, and sixth components, for example.

Alternatives to principal component regression

Some alternatives to principal component regression include the following:

  • Ridge regression: In ridge regression, a diagonal matrix is added to the X`X matrix so that it becomes better conditioned. This results in biased parameter estimates. You can read an explanation of ridge regression and how to compute it by using PROC REG in SAS.
  • Complete PCR: As mentioned previously, use the PCs as the variables and keep the components whose parameter estimates are significant.
  • Complete PCR with variable selection: Use the PCs as the variables and use the variable-selection techniques to decide which components to retain. However, if your primary goal is variable reduction, then use variable-selection techniques on the original variables.
  • Partial Least Squares (PLS): Partial least square regression is similar to PCR in that both select components that explain the most variance in the model. The difference is that PLS incorporates the response variable. That is, the components that are produced are those that explain the most variance in the explanatory AND response variables. In SAS, you can compute a PLS regression by using PROC PLS with METHOD=PLS or METHOD=SIMPLS. You will probably also want to use the CV and CVTEST options.

Summary

In summary, principal component regression is a technique for computing regressions when the explanatory variables are highly correlated. It has several advantages, but the main drawback of PCR is that the decision about how many principal components to keep does not depend on the response variable. Consequently, some of the variables that you keep might not be strong predictors of the response, and some of the components that you drop might be excellent predictors. A good alternative is partial least squares regression, which I recommend. In SAS, you can run partial least squares regression by using PROC PLS with METHOD=PLS.

References

The post Should you use principal component regression? appeared first on The DO Loop.