15.1

2月 102021
 

One of the first things I learned in SAS was how to use PROC PRINT to display parts of a data set. I usually do not need to see all the data, so my favorite way to use PROC PRINT is to use the OBS= data set option to display the first few rows. For example, I often display the first five rows of a SAS data set as follows:

proc print data=Sashelp.Class(obs=5);
    * VAR Weight Height Age;  /* optional: the VAR statement specifies variables */
run;

By using the OBS= data set option, you can display only a few observations. This enables you to take a quick peek at the values of your data. As shown in the comment, you can optionally use the VAR statement to display only certain columns. (Use the FIRSTOBS= option if you want more control over the rows that are printed.)

Display the rows of a data table in SAS/IML

In a SAS/IML program, data are either stored in a table or in a matrix. If the data are in a table, you can use the TABLEPRINT subroutine to display the data. The NUMOBS= option enables you to display only a few rows:

proc iml;
TblClass = TableCreateFromDataset("sashelp", "class");
run TablePrint(TblClass) numobs=5;

The TABLEPRINT subroutine supports many options for printing, including the VAR= option for specifying only certain columns.

Display the rows of a matrix in SAS/IML

How can you display only a portion of a SAS/IML matrix? I often write statements like this to print only the first few rows:

/* read numerical data into the X matrix */
use Sashelp.Class; read all var _NUM_ into X[c=varNames]; close;
print (X[1:5,]);    /* print only rows 1 through 5 */

This works, but there are a few things that I don't like. My primary complaint is that X[1:5,] is a temporary matrix and therefore has no name. The rows are printed, but there is no header that tells me where the data came from. My second complaint is that the output does not indicate which rows are being displayed. Consequently, sometimes I include information in the label and add row and column headers:

print (X[1:5,])[rowname=(1:5) colname=varNames label="Top of X"];

The output now includes the information that I want, but that is a LOT of typing, especially if I want to display similar information for other matrices. Even if I use the abbreviated version of the PRINT options (R=, C=, and L=), it is cumbersome to type.

By the way, this PRINT statement demonstrates a new feature of SAS/IML 15.1 (which was released with SAS 9.4M6), which is that the ROWNAME= and COLNAME= options on the PRINT statement support numerical vectors. If you have an earlier version of SAS/IML, you can use rowname=(char(1:5)).

HEAD: A module to print the top rows of a matrix

There's a saying among computer programmers: if you find yourself writing the same statements again and again, create a function to do it. So, let's write a SAS/IML module to print the top rows of a matrix. Because there is a UNIX command called 'head' that displays the top lines of a file, I will use the same name.

Many years ago, I blogged about how to write a HEAD subroutine, although my emphasis was on how to use default arguments in SAS/IML functions. The following routine is a richer version of the previous function:

/* Print the first n rows of a matrix. Optionally, display names of columns */
start Head(x, n=5, colname=);
   m = min(n, nrow(x));         /* make sure n isn't too big */
   idx = 1:m;                   /* the rows to print */
   name = parentname("x");      /* name of symbol in calling environment */
   if name=" " then name = "Temp";  /* the parent name of a temporary variable is " "*/
   labl = "head(" + name + ") rows=" + strip(char(m));  /* construct the label */
   if isSkipped(colname) then  /* print the top rows */
      print (x[idx,])[r=idx label=labl];
   else 
      print (x[idx,])[r=idx c=colname label=labl];
finish;
 
run Head(X) colname=varNames;  /* example: call the HEAD module */

The HEAD subroutine uses three features of user-defined modules that you might not know about:

The result is a short way to display the top few rows of a matrix.

TAIL: A module to print the bottom rows of a matrix

Although I usually want to print the top row of a matrix, it is easy to modify the HEAD module to display the last n rows of a matrix. The following module, called TAIL, is almost identical to the HEAD module.

/* Print the last n rows of a matrix. Optionally, display names of columns */
start Tail(x, n=5, colname=);
   m = min(n, nrow(x));         /* make sure n isn't too big */
   idx = (nrow(x)-m+1):nrow(x); /* the rows to print */
   name = parentname("x");      /* name of symbol in calling environment */
   if name=" " then name = "Temp";  /* the parent name of a temporary variable is " "*/
   labl = "tail(" + name + ") rows=" + strip(char(m));  /* construct the label */
   if isSkipped(colname) then  /* print the bottom rows */
      print (x[idx,])[r=idx label=labl];
   else 
      print (x[idx,])[r=idx c=colname label=labl];
finish;
 
run Tail(X) colname=varNames;

Summary

Most SAS programmers know how to use the OBS= option in PROC PRINT to display only a few rows of a SAS data set. When writing and debugging programs in the SAS/IML matrix language, you might want to print a few rows of a matrix. This article presents the HEAD module, which displays the top rows of a matrix. For completeness, the article also defines the TAIL module, which displays the bottom rows of a matrix. If you find these modules useful, you can incorporate them into your SAS/IML programs.

For more tips and techniques related to SAS/IML modules, see the article "Everything you wanted to know about writing SAS/IML modules."

The post Print the top rows of your SAS data appeared first on The DO Loop.

7月 312019
 

Sometimes a little thing can make a big difference. I am enjoying a new enhancement of SAS/IML 15.1, which enables you to use a numeric vector as the column header or row header when you print a SAS/IML matrix. Prior to SAS/IML 15.1, you had to first use the CHAR or PUTN function to manually apply a format to the numeric values. You could then use the formatted values as column or row headers. Now you can skip the formatting step.

A common situation in which I want to use numeric values as a column header is when I am using the TABULATE function to compute the frequencies for a discrete numerical variable. For example, the Cylinders variable in the Sashelp.Cars data set indicates the number of cylinders in each vehicle. You might want to know how many vehicles in the data are four-cylinder, six-cylinder, eight-cylinder, and so forth. In SAS/IML 15.1, you can use a numeric variable (such as the levels of the Cylinders variable) directly in the COLNAME= option of the PRINT statement, as follows:

proc iml;
use Sashelp.Cars;
read all var "Cylinders" into X;
close;
 
call tabulate(level, freq, X);      /* count number of obs for each level of X */
/* New in SAS/IML 15.1: use numeric values as COLNAME= option to PRINT statement */
print freq[colname=level];

Prior to SAS/IML 15.1, you had to convert numbers to character values by applying a format. This is commonly done by using the CHAR function, which applies a W.D format. This technique is still useful, but optional. You can also use the PUTN function to apply any format you want, such as COMMA., DATE., TIME., and so forth. For example, if you like Roman numerals, you could apply the ROMAN. format!

labels = char(level, 2);         /* apply w.d format */
print freq[colname=labels];
 
labels = putn(level, "ROMAN.");  /* second arg is name of format */
print freq[colname=labels];

You can also use numeric values for the COLNAME= option to obtain row headers in SAS/IML 15.1. Sure, it's a small enhancement, but I like it!

The post Use numeric values for column headers when printing a matrix appeared first on The DO Loop.

7月 172019
 

In the SAS/IML language, a matrix contains data of one type: numeric or character. If you want to create a SAS data set that contains mixed-type data (numeric and character), SAS/IML 15.1 provides support to write multiple matrices to a data set by using a single statement. Specifically, the CREATE FROM and APPEND FROM statements now support writing multiple matrices of any types. SAS/IML 15.1 was released as part of SAS 9.4m6.

Write mixed-type data from SAS/IML objects

With the new enhancements to the CREATE FROM and APPEND FROM statements, you now have four ways to write mixed type data to a SAS data set:

Write multiple matrices to a data set

In SAS/IML 15.1, you can specify multiple matrices on the CREATE FROM statement. The matrices can be any type. In the following example, X matrix is a numeric matrix and C is a character matrix:

/* read numeric and character vars in one call */
proc iml;
NumerVarNames = {'N' 'N2' 'N3'};
X = { 1  2  3,
      2  4  6,
      3  6  9,
      4  8 12};
charVarNames = {'Animal' 'Flower'};
C = {'Rat'   'Iris', 
     'Pig'   'Rose',
     'Goat'  'Daisy', 
     'Duck'  'Lily'};
 
/* SAS/IML 15.1: write multiple matrices of any type to a SAS data sets */
AllNames = NumerVarNames || CharVarNames;
create MyData from X C [colname=AllNames];  /* specify multiple matrices */
append from X C;                            /* repeat matrix names */
close;
QUIT;
 
proc print data=MyData noobs;
run;

Although the new enhancements to the CREATE FROM and APPEND FROM statements enable you to write mixed-type data to a SAS data set, you can also write multiple matrices regardless of the types. For example, you can use the same technique to write multiple numeric matrices.

Notice that if you want to specify the names of the data set variables, you use a single COLNAME= option at the end of the CREATE FROM statement.

The post Write numeric and character matrices to a data set from SAS/IML appeared first on The DO Loop.

2月 272019
 

Box plots are a great way to compare the distributions of several subpopulations of your data. For example, box plots are often used in clinical studies to visualize the response of patients in various cohorts. This article describes three techniques to visualize responses when the cohorts have a nested or hierarchical structure, such as experimental treatments nested inside of clinics. The techniques are:

  • Use PROC SGPLOT (or PROC SGPANEL): Use the VBOX statement to visualize the nested structure. You can use the CATEGORY= option to specify the "outer" variable and the GROUP= option to specify the "inner" (nested) variable.
  • Use PROC GLM: For a model that has exactly two categorical variables, one nested in the other, PROC GLM automatically creates a nested box plot.
  • Use PROC BOXPLOT: You can use PROC BOXPLOT to create a nested box plot. The procedure supports several options that can enhance the visualization.

An example of nested data: Leaves on plants

Did you know that turnip greens are an excellent source of calcium? The following example is from the PROC NESTED documentation and is based on data analyzed in Snedecor and Cochran (Statistical Methods, 6th ed., 1967, p. 286). The original data is from an experiment in which four random turnip green plants are selected, then three leaves are randomly selected on each plant. From each leaf, two 100-mg samples were selected and used to determine the amount of calcium. (The units for calcium are percentage of dry weight.) Because the box plot for a two-element sample is trivial, I made up four additional fake measurements of calcium for each leaf, as shown in the following DATA step:

/* PROC NESTED example. First two measurements are real data. The last four are fake. */
data Turnip;
do Plant=1 to 4;
   do Leaf=1 to 3;
      do Sample=1 to 6;
         input Calcium @@;  output;
      end;
   end;
end;
/* 
|--REAL--|  |----- FAKE ------|  */
datalines;
3.28 3.09   3.26 3.19 3.27 3.01 
3.52 3.48   3.53 3.47 3.50 3.49 
2.88 2.80   2.98 2.70 2.28 2.81 
2.46 2.44   2.58 2.30 2.61 2.48 
1.87 1.92   1.97 1.90 1.86 1.90 
2.19 2.19   2.21 2.17 2.23 2.19 
2.77 2.66   2.79 2.63 2.72 2.69 
3.74 3.44   3.75 3.45 3.73 3.34 
2.55 2.55   2.59 2.51 2.49 2.61 
3.78 3.87   3.79 3.81 3.88 3.76 
4.07 4.12   4.23 4.27 4.01 4.08 
3.31 3.31   3.30 3.34 3.33 3.30 
;
 
title 'Calcium Concentration in Turnip Leaves';
title2 'Leaves Nested Within Plants';
footnote J=L 'Based on Snedecor and Cochran (1967, p. 286)';
proc sgplot data=Turnip;
   vbox Calcium / Group=Leaf category=Plant;
   xaxis discreteorder=data;
run;
Nested boxplots created by PROC SGPLOT in SAS

This simple visualization uses the VBOX statement in PROC SGPLOT. As I explained previously, you can use the CATEGORY= and GROUP= options to display the distribution of calcium for the joint levels of the two categorical variables. The CATEGORY= option specifies the horizontal variable; the GROUP= option specifies the levels of a second variable. In this case, the levels of the LEAF variable are nested inside the levels of the PLANT variable. The result is shown. Colors are used to identify the level of the GROUP= variable.

If there were additional levels of nesting (for examples, multiple farms), you could use the SGPANEL procedure and include the additional variables on the PANELBY statement.

Visualize a nested ANOVA model

You can improve the previous graph by visually dividing the leaves in one plant from the leaves in another. You can use PROC GLM to automatically display the divisions when you analyze a model for which the explanatory categorical variables are nested. The following call to PROC GLM specifies a nested mode. The "NestPlot" graph is created automatically when you use ODS GRAPHICS:

ods graphics on;
proc glm data=Turnip;
   class Plant Leaf;
   model Calcium = Leaf(Plant);   /* Leaf nested in Plant */
quit;
Nested boxplots created by PROC GLM in SAS

As you can see, the graph is the same except that it contains vertical lines that divide one plant from another. I like this version of the graph better.

Box plots for independent units

If you think about it, the color in the previous plot is not necessary. You might even consider it misleading because the leaf values are unrelated across plants. "Leaf 1" for "Plant 1" has no relationship to "Leaf 1" for the other plants. To emphasize that fact, you could relabel the leaves by using the values 1 through 12, where leaves 1–3 are from Plant=1, leaves 4–6 are from Plant=2, and so forth.

Labeling the leaves that way is necessary if you want to use to BOXPLOT procedure to visualize the data. The BOXPLOT procedure supports nested categories and the syntax is similar to the GLM syntax:

data Turnip2;
set Turnip;
LeafID = Leaf + (Plant-1)*3;   /* label samples 1, 2, 3, ..., 12 */
run;
 
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote;
run;
Nested boxplots created by PROC BOXPLOT in SAS

The output from the BOXPLOT procedure uses column headers to indicate the plants. This is similar to the visualization that I used to label categories for tropical storms and hurricanes.

You can use three features of SAS/STAT 15.1 (SAS 9.4M6) to add vertical reference lines, to color the background, and to center the plant labels in the column headers. The options are the BLOCKREF option, the BLOCKREFFILL option, and the BLOCKVALUEPOS= option, respectively.

/* Options for extending the column headers into the plot region.
   These options require SAS/STAT 15.1 */
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote
                      blockref blockreffill blockvaluepos=center;
run;
Nested boxplots created by PROC BOXPLOT in SAS

I like this graph because it uses colors sparingly and does not require a legend. Also, if the number of samples is large (such as 30 or 50), PROC BOXPLOT will automatically produce a series of graphs, each displaying a portion of the data.

In summary, this article shows three ways to create box plots of responses for nested categorical data. The simplest is to use the VBOX statement in PROC SGPLOT, although if there are additional categorical variables in the model, you can create a lattice of box plots by using the PANELBY statement in PROC SGPANEL. The GLM procedure enables you to perform an ANOVA analysis for the data and also create a visualization. The BOXPLOT procedure provides nice headers and (in SAS/STAT 15.1) colored strips for each level of the "outer" categories.

The post 3 ways to create nested box plots in SAS appeared first on The DO Loop.

2月 272019
 

Box plots are a great way to compare the distributions of several subpopulations of your data. For example, box plots are often used in clinical studies to visualize the response of patients in various cohorts. This article describes three techniques to visualize responses when the cohorts have a nested or hierarchical structure, such as experimental treatments nested inside of clinics. The techniques are:

  • Use PROC SGPLOT (or PROC SGPANEL): Use the VBOX statement to visualize the nested structure. You can use the CATEGORY= option to specify the "outer" variable and the GROUP= option to specify the "inner" (nested) variable.
  • Use PROC GLM: For a model that has exactly two categorical variables, one nested in the other, PROC GLM automatically creates a nested box plot.
  • Use PROC BOXPLOT: You can use PROC BOXPLOT to create a nested box plot. The procedure supports several options that can enhance the visualization.

An example of nested data: Leaves on plants

Did you know that turnip greens are an excellent source of calcium? The following example is from the PROC NESTED documentation and is based on data analyzed in Snedecor and Cochran (Statistical Methods, 6th ed., 1967, p. 286). The original data is from an experiment in which four random turnip green plants are selected, then three leaves are randomly selected on each plant. From each leaf, two 100-mg samples were selected and used to determine the amount of calcium. (The units for calcium are percentage of dry weight.) Because the box plot for a two-element sample is trivial, I made up four additional fake measurements of calcium for each leaf, as shown in the following DATA step:

/* PROC NESTED example. First two measurements are real data. The last four are fake. */
data Turnip;
do Plant=1 to 4;
   do Leaf=1 to 3;
      do Sample=1 to 6;
         input Calcium @@;  output;
      end;
   end;
end;
/* 
|--REAL--|  |----- FAKE ------|  */
datalines;
3.28 3.09   3.26 3.19 3.27 3.01 
3.52 3.48   3.53 3.47 3.50 3.49 
2.88 2.80   2.98 2.70 2.28 2.81 
2.46 2.44   2.58 2.30 2.61 2.48 
1.87 1.92   1.97 1.90 1.86 1.90 
2.19 2.19   2.21 2.17 2.23 2.19 
2.77 2.66   2.79 2.63 2.72 2.69 
3.74 3.44   3.75 3.45 3.73 3.34 
2.55 2.55   2.59 2.51 2.49 2.61 
3.78 3.87   3.79 3.81 3.88 3.76 
4.07 4.12   4.23 4.27 4.01 4.08 
3.31 3.31   3.30 3.34 3.33 3.30 
;
 
title 'Calcium Concentration in Turnip Leaves';
title2 'Leaves Nested Within Plants';
footnote J=L 'Based on Snedecor and Cochran (1967, p. 286)';
proc sgplot data=Turnip;
   vbox Calcium / Group=Leaf category=Plant;
   xaxis discreteorder=data;
run;
Nested boxplots created by PROC SGPLOT in SAS

This simple visualization uses the VBOX statement in PROC SGPLOT. As I explained previously, you can use the CATEGORY= and GROUP= options to display the distribution of calcium for the joint levels of the two categorical variables. The CATEGORY= option specifies the horizontal variable; the GROUP= option specifies the levels of a second variable. In this case, the levels of the LEAF variable are nested inside the levels of the PLANT variable. The result is shown. Colors are used to identify the level of the GROUP= variable.

If there were additional levels of nesting (for examples, multiple farms), you could use the SGPANEL procedure and include the additional variables on the PANELBY statement.

Visualize a nested ANOVA model

You can improve the previous graph by visually dividing the leaves in one plant from the leaves in another. You can use PROC GLM to automatically display the divisions when you analyze a model for which the explanatory categorical variables are nested. The following call to PROC GLM specifies a nested mode. The "NestPlot" graph is created automatically when you use ODS GRAPHICS:

ods graphics on;
proc glm data=Turnip;
   class Plant Leaf;
   model Calcium = Leaf(Plant);   /* Leaf nested in Plant */
quit;
Nested boxplots created by PROC GLM in SAS

As you can see, the graph is the same except that it contains vertical lines that divide one plant from another. I like this version of the graph better.

Box plots for independent units

If you think about it, the color in the previous plot is not necessary. You might even consider it misleading because the leaf values are unrelated across plants. "Leaf 1" for "Plant 1" has no relationship to "Leaf 1" for the other plants. To emphasize that fact, you could relabel the leaves by using the values 1 through 12, where leaves 1–3 are from Plant=1, leaves 4–6 are from Plant=2, and so forth.

Labeling the leaves that way is necessary if you want to use to BOXPLOT procedure to visualize the data. The BOXPLOT procedure supports nested categories and the syntax is similar to the GLM syntax:

data Turnip2;
set Turnip;
LeafID = Leaf + (Plant-1)*3;   /* label samples 1, 2, 3, ..., 12 */
run;
 
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote;
run;
Nested boxplots created by PROC BOXPLOT in SAS

The output from the BOXPLOT procedure uses column headers to indicate the plants. This is similar to the visualization that I used to label categories for tropical storms and hurricanes.

You can use three features of SAS/STAT 15.1 (SAS 9.4M6) to add vertical reference lines, to color the background, and to center the plant labels in the column headers. The options are the BLOCKREF option, the BLOCKREFFILL option, and the BLOCKVALUEPOS= option, respectively.

/* Options for extending the column headers into the plot region.
   These options require SAS/STAT 15.1 */
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote
                      blockref blockreffill blockvaluepos=center;
run;
Nested boxplots created by PROC BOXPLOT in SAS

I like this graph because it uses colors sparingly and does not require a legend. Also, if the number of samples is large (such as 30 or 50), PROC BOXPLOT will automatically produce a series of graphs, each displaying a portion of the data.

In summary, this article shows three ways to create box plots of responses for nested categorical data. The simplest is to use the VBOX statement in PROC SGPLOT, although if there are additional categorical variables in the model, you can create a lattice of box plots by using the PANELBY statement in PROC SGPANEL. The GLM procedure enables you to perform an ANOVA analysis for the data and also create a visualization. The BOXPLOT procedure provides nice headers and (in SAS/STAT 15.1) colored strips for each level of the "outer" categories.

The post 3 ways to create nested box plots in SAS appeared first on The DO Loop.

2月 252019
 

When I run a bootstrap analysis, I create graphs to visualize the distribution of the bootstrap statistics. For example, in my article about how to bootstrap the difference of means in a two-sample t test, I included a histogram of the bootstrap distribution and added reference lines to indicate a confidence interval for the difference of means.

For t tests, the TTEST procedure supports the BOOTSTRAP statement, which automates the bootstrap process for standard one- and two-sample t tests. A new feature in SAS/STAT 15.1 (SAS 9.4M6) is that the TTEST procedure supports the PLOTS=BOOTSTRAP option, which automatically creates histograms, Q-Q plots, and scatter plots of various bootstrap distributions.

To demonstrate the new PLOTS=BOOTSTRAP option, I will use the same example that I used to demonstrate the BOOTSTRAP statement. The data are the sedans and SUVs in the Sashelp.Cars data. The research question is to estimate the difference in fuel efficiency, as measured by miles per gallon during city driving. The bootstrap analysis enables you to visualize the approximate sampling distribution of the difference-of-mean statistic and its standard deviation. The following statements create the data and run PROC TTEST to generate the analysis. The PLOTS=BOOTSTRAP option generates the bootstrap graphs. The BOOTSTRAP statement request 5,000 bootstrap resamples and 95% confidence intervals, based on the percentiles of the bootstrap distribution:

/* create data set that has two categories: 'Sedan' and 'SUV' */
data Sample;
set Sashelp.Cars(keep=Type MPG_City);
if Type in ('Sedan' 'SUV');
run;
 
ods trace on;
title "Bootstrap Estimates with Percentile CI";
proc ttest data=Sample plots=bootstrap;
   class Type;
   var MPG_City;
   bootstrap / seed=123 nsamples=5000 bootci=percentile;
run;

The TTEST procedure creates seven tables and eight graphs. The previous article displayed and discussed several tables, so here I display only two of the graphs.

Graph of the bootstrap distribution for the difference of means. The green region indicates the 95% percentile confidence interval, based on the bootstrap samples. Computed by using the PLOTS=BOOSTRAP option in PROC TTEST in SAS/STAT 15.1.

The first graph is a histogram of the bootstrap distribution of the difference between the sample means for each vehicle type. The distribution appears to be symmetric and approximately normal. The middle of the distribution is close to -5, which is the point estimate for the difference in MPG_City between the SUVs and the sedans in the data. How much should we trust that point estimate? The green region indicates that 95% of the bootstrap samples had differences that were in the green area underneath the histogram, which is approximately [-5.9, -4.1]. This is a 95% confidence interval for the difference.

Similar histograms (not shown) are displayed for two other statistics: the pooled standard deviation (of the difference between means) and the Satterthwaite standard deviation. The procedure also creates Q-Q plots of the bootstrap distributions.

Scatter plot of the joint bootstrap distribution for the difference of means and the standard deviation of the difference of means. Computed by using the PLOTS=BOOSTRAP option in PROC TTEST in SAS/STAT 15.1.

The second graph is a scatter plot that shows the joint bootstrap distribution of the mean difference and the pooled standard deviations of the difference, based on 5,000 bootstrap samples. You can see that the statistics are slightly correlated. A sample that has a large (absolute) mean difference also tends to have a relatively large standard deviation. The 95% prediction region for the joint distribution indicates how these two statistics co-vary among random samples.

In summary, the TTEST procedure in SAS/STAT 15.1 supports a new PLOTS=BOOTSTRAP option, which automatically creates many graphs that help you to visualize the bootstrap distributions of the statistics. If you are conducting a bootstrap analysis for a t test, I highly recommend using the plots to visualize the bootstrap distributions and results

The post Graphs of bootstrap statistics in PROC TTEST appeared first on The DO Loop.

2月 202019
 

Last year I published a series of blogs posts about how to create a calibration plot in SAS. A calibration plot is a way to assess the goodness of fit for a logistic model. It is a diagnostic graph that enables you to qualitatively compare a model's predicted probability of an event to the empirical probability. I am happy to report that in SAS/STAT 15.1 (SAS 9.4M6), you can create a calibration plot automatically by using the PLOTS=CALIBRATION option on the PROC LOGISTIC statement.

Calibration plots for a model of a binary response

To demonstrate how to create a calibration plot by using PROC LOGISTIC, consider the simulated data that I analyzed in "Calibration plots in SAS." The data contain a binary response variable, Y, which depends quadratically on a uniformly distributed explanatory variable, X. The following call to PROC LOGISTIC fits a quadratic the model to the data. The new GOF option requests an extensive set of goodness-of-fit statistics and the PLOTS=CALIBRATION option requests a calibration plot:

/* NEW in SAS/STAT 15.1 (SAS 9.4M6): PLOTS=CALIBRATION option in PROC LOGISTIC */
title "Calibration Plot for a Quadratic Model";
title2 "Created by PROC LOGISTIC";
proc logistic data=LogiSim plots=calibration(CLM ShowObs);
   model y(Event='1') = x x*x / GOF;      /* New in 15.1: More goodness-of-fit statistics */
run;
Calibration plot for a quadratic logistic model, created by PROC LOGISTIC in SAS

The calibration plot is shown. (Click to enlarge.) The plot contains a gray diagonal line, which represents perfect calibration. If most of the predicted responses agree with the observed responses, then the blue curve should be close to the diagonal line. That is the case in this example. The light blue band is a 95% confidence region for the loess fit and is created by using the CLM option.

Because I used the SHOWOBS option, the calibration plot displays tiny histograms along the top and bottom of the plot. The histograms indicate the distribution of the Y=0 and Y=1 responses. The article "Use a fringe plot to visualize binary data in logistic models" explains more about how fringe plots can add insight to graphs that involve a binary response variable.

The lower right corner of the calibration plot contains one of the many goodness-of-fit statistics that are computed when you use the GOF option on the MODEL statement. A small p-value would indicate a lack of fit. In this case, there is no reason to suspect a lack of fit. The following table shows other goodness-of-fit tests. None of the p-values are small, so none of the tests indicate lack of fit.

Goodness-of-fit statistics for a quadratic logistic model, created by PROC LOGISTIC in SAS

Calibration plots for a polytomous response

An exciting feature of the calibration plots in PROC LOGISTIC is that you can use them for a polytomous response model. Derr (2013) fits a proportional odds model that predicts the probability of the severity of black-lung disease from the length of exposure to coal dust in 371 coal miners. The response variable, Severity, has the levels 'Severe', 'Moderate', and 'Normal'. The following statement create the data and model and request calibration plots for the model.

/* Data, from McCullagh and Nelder (1989, p. 179), used in Derr (2013, p. 8-10).
   The severity of pneumoconiosis (black lung disease) in coal miners
   and the number of years of exposure.
*/
data Coal; 
input Severity $ @@; 
do i=1 to 8; 
   input Exposure freq @@; 
   log10Exposure=log10(Exposure); 
   output; 
end; 
datalines; 
Normal   5.8 98 15 51 21.5 34 27.5 35 33.5 32 39.5 23 46 12 51.5 4 
Moderate 5.8  0 15  2 21.5  6 27.5  5 33.5 10 39.5  7 46  6 51.5 2 
Severe   5.8  0 15  1 21.5  3 27.5  8 33.5  9 39.5  8 46 10 51.5 5 
;
 
title 'Severity of Black Lung vs Log10(Years Exposure)';
proc logistic data=Coal rorder=data plots=Calibration(CLM);
   freq freq; 
   model Severity(descending) = log10Exposure; 
   effectplot / noobs individual;
run;
Panel of calibration plots for a polytomous proportional-odd model, created by PROC LOGISTIC in SAS

Derr (2013) discusses the results of the analysis, which are not shown here. I've displayed only the calibration plot for the model. Notice that PROC LOGISTIC creates a panel of three calibration plots, one for each response level. The calibration curves all lie close to the diagonal, so the diagnostic plots do not indicate a lack of calibration for any part of the model.

Summary

In summary, the PLOTS=CALIBRATION option in SAS/STAT 15.1 enables you to automatically create a calibration plot. The calibration plot is a diagnostic plot that qualitatively compares a model's predicted and empirical probabilities. You can use the PLOTS=CALIBRATION option on the PROC LOGISTIC statement to create a calibration plot. The CALIBRATION option supports several suboptions, which you can read about in the documentation for the PROC LOGISTIC statement.

You can download the SAS code used in this article, which includes SAS code that demonstrates how to create a calibration plot manually.

The post An easier way to create a calibration plot in SAS appeared first on The DO Loop.

2月 182019
 
Maybe if we think and wish and hope and pray
It might come true.
Oh, wouldn't it be nice?

The Beach Boys

Months ago, I wrote about how to use the EFFECT statement in SAS to perform regression with restricted cubic splines. This is the modern way to use splines in a regression analysis in SAS, and it replaces the need to use older macros such as Frank Harrell's %RCSPLINE macro. I shared my blog post with a colleague at SAS and mentioned that the process could be simplified. In order to specify the placement of the knots as suggested by Harrell (Regression Modeling Strategies, 2010 and 2015), I had to use PROC UNIVARIATE to get the percentiles of the explanatory variable. "Wouldn't it be nice," I said, "if the EFFECT statement could perform that computation automatically?"

I am happy to report that the 15.1 release of SAS/STAT (SAS 9.4M6) includes a new option that makes it easy to place internal knots at percentiles of the data. You can now use the KNOTMETHOD=PERCENTILELIST option on the EFFECT statement to place knots. For example, the following statement places five internal knots at percentiles that are recommended in Harrell's book:
EFFECT spl = spline(x / knotmethod=percentilelist(5 27.5 50 72.5 95));

An example of using restricted cubic in regression in SAS

Restricted cubic splines are also called "natural cubic splines." This section shows how to perform a regression fit by using restricted cubic splines in SAS.

For the example, I use the same Sashelp.Cars data that I used in the previous article. For clarity, the following SAS DATA step renames the Weight and MPG_City variables to X and Y, respectively. If you want to graph the regression curve, you can sort the data by the X variable, but this step is not required to perform the regression.

/* create (X,Y) data from the Sashelp.Cars data. Sort by X for easy graphing. */
data Have;
   set sashelp.cars;
   rename mpg_city = Y  weight = X  model = ID;
run;
 
proc sort data=Have;  by X;  run;

The following call to PROC GLMSELECT includes an EFFECT statement that generates a natural cubic spline basis using internal knots placed at specified percentiles of the data. The MODEL statement fits the regression model and the OUTPUT statement writes an output data set that contains the predicted values. The SGPLOT procedure displays a graph of the regression curve overlaid on the data:

/* fit data by using restricted cubic splines using SAS/STAT 15.1 (SAS 9.4M6) */
ods select ANOVA ParameterEstimates SplineKnots;
proc glmselect data=Have;
   effect spl = spline(X/ details naturalcubic basis=tpf(noint)
             knotmethod=percentilelist(5 27.5 50 72.5 95); /* new in SAS/STAT 15.1 (SAS 9.4M6)  */
   model Y = spl / selection=none;       /* fit model by using spline effects */
   output out=SplineOut predicted=Fit;   /* output predicted values */
quit;
 
title "Restricted Cubic Spline Regression";
title2 "Five Knots Placed at Percentiles";
proc sgplot data=SplineOut noautolegend;
   scatter x=X y=Y;
   series x=X y=Fit / lineattrs=(thickness=3 color=red);
run;

In summary, the new KNOTMETHOD=PERCENTILELIST option on the EFFECT statement simplifies the process of using percentiles of a variable to place internal knots for a spline basis. The example shows knots placed at the 5th, 27.5th, 50th, 72.5th, and 95th percentiles of an explanatory variable. These heuristic values are recommended in Harrell's book. For more details about the EFFECT statement and how the location of knots affects the regression fit, see my previous article "Regression with restricted cubic splines in SAS."

You can download the complete SAS program that generates this example, which requires SAS/STAT 15.1 (SAS 9.4M6). If you have an earlier release of SAS, the program also shows how to perform the same computations by calling PROC UNIVARIATE to obtain the location of the knots.

The post An easier way to perform regression with restricted cubic splines in SAS appeared first on The DO Loop.