Tips and Techniques

12月 032018
 

When a graph includes several markers or line styles, it is often useful to create a legend that explains the relationship between the data and the symbols, color, and line styles in the graph. The SGPLOT procedure does a good job of automatically creating and placing a legend for most graphs. However, sometimes it is useful to override the procedure's default choices. This article describes five tips that you can use to customize the content and placement of legends. The tips are:

  1. Suppress the legend by using the NOAUTOLEGEND option.
  2. Choose which components of the graph appear in the legend by using a KEYLEGEND statement and the NAME= option.
  3. Position the legend by using the LOCATION= and POSITION= option on the KEYLEGEND statement.
  4. Exclude one or more items from a legend by using the EXCLUDE= option on the KEYLEGEND statement (requires SAS 9.4M3).
  5. Consolidate one or more items by using the LEGENDITEM statement (requires SAS 9.4M5).

1. Suppress the legend

By default, the SGPLOT procedure displays a legend when there are multiple plots that are overlaid in the graph. This can be caused by multiple statements or by using the GROUP= option on a statement. If the information in the default legend is redundant, and you might want to suppress it. For example, the following legend is unnecessary because the title explains the data and the regression line. You can uncomment the NOAUTOLEGEND option to suppress the legend.

title "Linear Regression for Weight and Height";
title2 "The legend is unnecessary";
proc sgplot data=Sashelp.Class /* NOAUTOLEGEND */;
   scatter x=Height y=Weight;
   reg x=Height y=Weight / nomarkers;
   footnote J=L "Use the NOAUTOLEGEND option to suppress the legend";
run;
footnote;
You can use the NOAUTOLEGEND option on the PROC SGPLOT statement in SAS to suppress a legend

2. Choose which components appear in the legend

In some graphs that overlay multiple components, some components are self -explanatory and do not need to appear in the legend. You can choose which components appear in the legend by using the NAME= option on the statements and using the KEYLEGEND statement to specify the contents of the legend. For example, the following statements create a graph that consists of a scatter plot, a confidence ellipse, and a regression line. If you only want the confidence ellipse and regression line to appear in the legend, use the NAME= option to identify each component and use the KEYLEGEND statement to specify the contents of the legend:

title "Weight versus Height";
title2 "Overlay Least Squares Fit and Confidence Ellipse";
proc sgplot data=Sashelp.Class;
   scatter x=Height y=Weight / name="scatter";
   ellipse x=Height y=Weight / name="ellipse";
   reg x=Height y=Weight     / name="reg" nomarkers lineattrs=GraphData2;
   keylegend "reg" "ellipse"; /* list item in the order you want them */
run;
In the SGPLOT procedure in SAS, you can use the NAME= option to name components and the KEYLEGEND statement to specify which components should appear in a legend

3. Position the legend

The KEYLEGEND statement supports the LOCATION= and POSTITION= options, which enable you to place the legend almost anywhere in the graph. The LOCATION= option controls whether the legend appears inside or outside of the graph area. The POSITION= option controls the placement of the legend on the graph (left, right, top, bottom,...). However, I can never remember which option controls which attribute! Therefore, I created a mnemonic, which I hope will help you remember, too:

  • The LOCATION= option contains the substring 'CAT'. A CAT likes to go INSIDE and OUTSIDE the house. Therefore, the valid keywords for the LOCATION= option are INSIDE and OUTSIDE.
  • The POSITION= option contains the substring 'SIT'. You can SIT on the LEFT or RIGHT side of a couch. (Also, "position" can be used as a verb to mean "place on a page.") Therefore, the valid keywords for the POSITION= option are BOTTOM, BOTTOMLEFT, BOTTOMRIGHT, LEFT, RIGHT, TOP, TOPLEFT, and TOPRIGHT. (Some other graphical elements support a CENTER position, but not the legend.)

The following graph is the same as in the previous example, except that the location of the legend is inside the graph area and the position of the legend is in the lower-right corner. When you move the legend to the left or right side of the graph, it is often useful to use the ACROSS=1 option to force the legend to list the items vertically. Similarly, if you position the legend at the top or bottom of a graph, you might want to use the DOWN=1 option to list the items horizontally.

   keylegend "reg" "ellipse" / location=inside position=bottomright across=1;
In the SGPLOT procedure in SAS, you can use the LOCATION= and POSITION= options to position a legend

4. Exclude items from a legend

When you use the GROUP= option to display groups, you might want to exclude some of the group categories from the legend. The KEYLEGEND statement supports the EXCLUDE= option that you can use to exclude certain items. Three situations come to mind:

  • The group levels contain missing values. You might want to exclude the missing values from the legend by using KEYLEGEND / EXCLUDE=(" ");.
  • The purpose of the graph is to focus on one or two subgroups. If so, it might make sense to label only those groups. For example, if the purpose of a graph is to show income disparity between blacks and whites, you might decide not to include Asians or Hispanics in the legend: EXCLUDE=("Asian" "Hispanic");.
  • The group is binary. If a graph shows the results of a clinical trial and the legend includes the marker shape for the patients who died, it should be clear that the other marker shape represents patients who survived: EXCLUDE=("Alive");. An example is shown below
ods graphics / attrpriority=none;
title "Patient Status";
proc sgplot data=Sashelp.Heart(obs=200 where=(Systolic<=200));
   styleattrs datasymbols=(X CircleFilled);
   scatter x=Systolic y=Diastolic / group=Status;
   keylegend / exclude=("Alive");
run;
In the SGPLOT procedure in SAS, you can use the EXCLUDE= option on the KEYLEGEND statement to suppress a legend item, such as a category in a group

5. Customize items in a legend

The previous section shows how to exclude one or more levels in a categorical variable that is specified on the GROUP= option. You also might want to customize the items that appear in the legend in order to combine, for example, marker and line attributes. A situation where this comes up is when you want to overlay a group of curves on a scatter plot.

The LEGENDITEM statement (supported in SAS 9.4M5) enables you to specify what combination of markers and line patterns you want to appear for every item in a legend. It is a "super customization" statement that gives you complete control over the legend items.

The following statements show how to use the LEGENDITEM statement to create a customized legend. By default, if you use the REG statement with the GROUP= option, the legend will show only the colors and line patterns for the regression lines. In the following example, I have used the ATTRPRIORITY=NONE option to force the marker symbols to differ between groups. I want the legend to show not only the colors and patterns of the regression lines but also the marker symbols for each group:

/* ensure order of BP_Status is High, Normal, Optimal */
proc sort data=Sashelp.Heart(obs=200 where=(Systolic<=200)) out=Heart;
   by BP_Status;
run;
 
ods graphics / attrpriority=none;
title "Patients by Blood Pressure Status";
proc sgplot data=Heart;
   styleattrs datalinepatterns=(solid) ;
   reg x=Systolic y=Diastolic / group=BP_Status;   
   legenditem type=markerline name="H" /
      label="High" lineattrs=GraphData1 markerattrs=GraphData1; 
   legenditem type=markerline name="N" /
      label="Normal" lineattrs=GraphData2 markerattrs=GraphData2; 
   legenditem type=markerline name="O" /
      label="Optimal" lineattrs=GraphData3 markerattrs=GraphData3; 
   keylegend "O" "N" "H" / title="BP Status"; 
run;
In the SGPLOT procedure in SAS, you can use the LEGENDITEM statement to create customized items in a legend

In summary, PROC SGPLOT in SAS supports several ways to create, suppress, position, and customize the items in a legend. Do you have a favorite way to customize a legend in PROC SGPLOT? Leave a comment!

The post 5 tips for customizing legends in PROC SGPLOT in SAS appeared first on The DO Loop.

11月 192018
 

You might know that you can use the ODS SELECT statement to display only some of the tables and graphs that are created by a SAS procedure. But did you know that you can use a WHERE clause on the ODS SELECT statement to display tables that match a pattern? This article shows how to use wildcards, regular expressions, and pattern matching to select ODS tables in SAS.

ODS SELECT: Filter the output from SAS procedures

A SAS procedure might produce a dozen or more tables. You might be interested in displaying a subset of those tables. Recall that you can use the ODS TRACE ON statement to obtain a list of all the tables and graphs that a procedure creates. You can then use the ODS SELECT or the ODS EXCLUDE statement to control which tables and graphs are displayed.

Here's an example from the SAS/STAT documentation. The following PROC LOGISTIC call creates 27 tables and graphs, most of which are related to ROC curves. The ODS TRACE ON statement displays the names of each output object in the SAS log:

data roc;
   input alb tp totscore popind @@;
   totscore = 10 - totscore;
   datalines;
3.0 5.8 10 0   3.2 6.3  5 1   3.9 6.8  3 1   2.8 4.8  6 0
3.2 5.8  3 1   0.9 4.0  5 0   2.5 5.7  8 0   1.6 5.6  5 1
3.8 5.7  5 1   3.7 6.7  6 1   3.2 5.4  4 1   3.8 6.6  6 1
4.1 6.6  5 1   3.6 5.7  5 1   4.3 7.0  4 1   3.6 6.7  4 0
2.3 4.4  6 1   4.2 7.6  4 0   4.0 6.6  6 0   3.5 5.8  6 1
3.8 6.8  7 1   3.0 4.7  8 0   4.5 7.4  5 1   3.7 7.4  5 1
3.1 6.6  6 1   4.1 8.2  6 1   4.3 7.0  5 1   4.3 6.5  4 1
3.2 5.1  5 1   2.6 4.7  6 1   3.3 6.8  6 0   1.7 4.0  7 0
3.7 6.1  5 1   3.3 6.3  7 1   4.2 7.7  6 1   3.5 6.2  5 1
2.9 5.7  9 0   2.1 4.8  7 1   2.8 6.2  8 0   4.0 7.0  7 1
3.3 5.7  6 1   3.7 6.9  5 1   3.6 6.6  5 1
;
 
ods graphics on;
ods trace on;
proc logistic data=roc;
   model popind(event='0') = alb tp totscore / nofit;
   roc 'Albumin' alb;
   roc 'K-G Score' totscore;
   roc 'Total Protein' tp;
   roccontrast reference('K-G Score') / estimate e;
run;

The SAS log displays the names of the tables and graphs. A portion of log is shown below:

Output Added:
-------------
Name:       OddsRatios
Label:      Odds Ratios
Template:   Stat.Logistic.OddsRatios
Path:       Logistic.ROC3.OddsRatios
-------------

Output Added:
-------------
Name:       ROCCurve
Label:      ROC Curve
Template:   Stat.Logistic.Graphics.ROC
Path:       Logistic.ROC3.ROCCurve
-------------

Output Added:
-------------
Name:       ROCOverlay
Label:      ROC Curves
Template:   Stat.Logistic.Graphics.ROCOverlay
Path:       Logistic.ROCComparisons.ROCOverlay
-------------

Output Added:
-------------
Name:       ROCAssociation
Label:      ROC Association Statistics
Template:   Stat.Logistic.ROCAssociation
Path:       Logistic.ROCComparisons.ROCAssociation
-------------

Output Added:
-------------
Name:       ROCContrastCoeff
Label:      ROC Contrast Coefficients
Template:   Stat.Logistic.ROCContrastCoeff
Path:       Logistic.ROCComparisons.ROCContrastCoeff
-------------

Only a few of the 27 ODS objects are shown here. Notice that each ODS object has four properties: a name, a label, a template, and a path. Most of the time, the name is used on the ODS SELECT statement to filter the output. For example, if you want to display only the ROC curves and the overlay of the ROC curves, you can put the following statement prior to the RUN statement in the procedure:

   ods select ROCCurve ROCOverlay;     /* specify the names literally */

Use a WHERE clause in the ODS SELECT statement

Often the ODS objects that you want to display are related to each other. In the LOGISTIC example, you might want to display all the information about ROC curves. Fortunately, the SAS developers often use a common prefix or suffix, such as 'ROC', in the names of the ODS objects. That means that you can display all ROC-related tables and graphs be selecting the ODS objects whose name (or path) contains 'ROC' as a substring.

You can use the WHERE clause to select ODS objects whose name (or label or path) matches a particular pattern. The object's name is available in a special variable named _NAME_. Similarly, the object's label and path are available in variables named _LABEL_ and _PATH_, respectively. You cannot match patterns in the template string; there is no _TEMPLATE_ variable.

In SAS, the following operators and functions are useful for matching strings:

  • The CONTAINS keyword matches strings that contains a specified substring. The question mark (?) is an equivalent way to specify the CONTAINS operator.
  • The LIKE keyword matches strings to a pattern. The underscore (_) is a wildcard that matches any character. The percent sign (%) is a wildcard that matches one or more characters.
  • The "begins with" operators (=: and in:) match strings that begin with a certain pattern or set of patterns, respectively.
  • SAS functions such as FIND, INDEX, and SUBSTR can be used to match patterns.
  • The SAS PRXMATCH function, which enables you to use Perl regular expressions to match patterns.

For example, the following statements select ODS tables and graphs from the previous PROC LOGISTIC call. You can put one of these statements before the RUN statement in the procedure:

   /* use any one of the following statements inside the PROC LOGISTIC call */
   ods select where=(_name_ =: 'ROC');                 /* name starts with 'ROC' */
   ods select where=(_name_ like 'ROC%');              /* name starts with 'ROC' */
   ods select where=(_path_ ? 'ROC');                  /* path contains 'ROC' */
   ods select where=(_label_ ? 'ROC');                 /* label contains 'ROC' */
   ods select where=(_name_ in: ('Odds', 'ROC'));      /* name starts with 'Odds' or 'ROC' */
   ods select where=(substr(_name_,4,8)='Contrast');   /* name has subtring 'Contrast' at position 4 */

For additional examples of using pattern matching to select ODS objects, see Warren Kuhfeld's graphics-focused blog post and the section of the SAS/STAT User's Guide that discusses selecting ODS graphics.

Use PRXMATCH to match regular expressions

Although the CONTAIN and LIKE operators are often sufficient for selecting a table, SAS provides the powerful PRXMATCH function for more complex pattern-matching tasks. The PRXMATCH function uses Perl regular expressions to match strings. SAS provides a Perl Regular Expression "cheat sheet" that summarizes the syntax and commons search queries for the PRXMATCH function.

You can put any of the following statements inside the PROC LOGISTIC call:

   /* use any one of the following PRXMATCH expressions inside the PROC LOGISTIC call */
   ods select where=(prxmatch('/ROC/',       _name_)); /* name contains 'ROC' anywhere */
   ods select where=(prxmatch('/^ROC/',      _name_)); /* name starts with 'ROC' */
   ods select where=(prxmatch('/Odds|^ROC/', _name_)); /* name contains 'Odds' anywhere or 'ROC' at the beginning */
   ods select where=(prxmatch('/ROC/',      _name_)=0);     /* name does NOT contain 'ROC' anywhere */
   ods select where=(prxmatch('/Logistic\.ROC2/', _path_)); /* escape special wildcard character '.' */

In summary, the WHERE= option on the ODS SELECT (and ODS EXCLUDE) statement is quite powerful. Many SAS programmers know how to list the names of tables and graphs on the ODS SELECT statement to display only a subset of the output. However, the WHERE= option enables you to use wildcards and regular expressions to select objects whose names or paths match a certain pattern. This can be a quick and efficient way to select tables that are related to each other and share a common prefix or suffix in their name.

The post Select ODS tables by using wildcards and regular expressions in SAS appeared first on The DO Loop.

10月 222018
 

A SAS programmer asked how to rearrange elements of a matrix. The rearrangement he wanted was rather complicated: certain blocks of data needed to move relative to other blocks, but the values within each block were to remain unchanged. It turned out that the mathematical operation he needed is called a block transpose. The BTRAN function in SAS/IML performs block-transpose operations, so the complicated rearrangement was easy to implement with a judicious call to the BTRAN function.

This article discusses the block-transpose operation and gives an example. It also shows how a block transpose can conveniently transform wide data into long data and vice versa.

The block-transpose operation

In general, a block matrix can contain blocks of various sizes. However, the BTRAN function requires that all blocks be the same size. Specifically, suppose A is a block matrix where each block is an n x m matrix. That means that A is an (r n) x (c m) matrix for some whole numbers r and c. The BTRAN function transposes the blocks to create a (c n) x (r m) block matrix.

The idea is more easily explained by a picture. The following image shows a matrix A that is composed of six blocks (of the same size). The block-transpose operation rearranges the blocks but leaves the elements within the blocks unchanged.

A block-transpose operation transposes the blocks but leaves the contents of each block unchanged

The BTRAN function in SAS/IML

Let's see how the block transpose works on an example in the SAS/IML language. The following statements create a 4 x 12 matrix. I have overlaid some grid lines to help you visualize this matrix as a 2 x 3 block matrix, where each block is 2 x 4.

proc iml;
x = repeat(1:12,2);
x = x // (10+x);
print x;
A 4x12 matrix visualized as a 2x3 block matrix. Each block is 2x4.

You can use the BTRAN function to apply a block transpose. The first argument is the matrix that contains the data. The second and third arguments specify the size of the blocks:

y = btran(x, 2, 4);    /* block size is 2x4 */
print y;
A 6x6 matrix visualized as a 3x2 block matrix. Each block is 2x4.

The output is a 6 x 6 matrix, visualized as a 3 x 2 block matrix.

Block transpose for wide-to-long transforms

If the number of rows in the block equals the number of rows in the data, then the BTRAN function stacks columns. In particular, if the block size is n x 1, the BTRAN function stacks multiple variables into a single column, just like the SHAPECOL function. For example, the Sashelp.Iris data contains four variables named SepalLength, SepalWidth, PetalLength, and PetalWidth. The following SAS/IML statements stack the four variables into a single column and create a new column that identifies the name of the original variable:

proc iml;
/* wide to long */
varNames = {"SepalLength" "SepalWidth" "PetalLength" "PetalWidth"};
use Sashelp.Iris;
   read all var "Species";        /* 150 x 3 */
   read all var varNames into X;  /* 150 x 4 */
close;
n = nrow(X); m = ncol(X);         /* n = 150; m = 4 */
 
ID = repeat(Species, m);          /* repeat the vars that are not transposed */
Var = shapecol(repeat(varNames, n), n*m); /* create column that contains variable names */
Value = btran(X, n, 1);           /* vector with 150*4 rows */ 
 
create test var {ID "Var" "Value"}; append; close;

Block transpose for wide-to-long transforms

In a similar way, you can transpose balanced data from long form to wide form. (Recall that "balanced data" means that each group has the same number of observations.) For example, the Sashelp.Iris data contains a Species variable. The first 50 observations have the value 'Setosa', the next 50 have the value 'Versicolor', and the last 50 have the value 'Virginica'. You can use a WHERE clause or a BY statement to analyze each species separately, but suppose you want to create a new data set that contains only 50 observations and has variables named SepalLengthSetosa, SepalLengthVersicolor, SepalLengthVirginica, and so forth for the other variables. The BTRAN function makes this easy: just specify a block size of 50 x 4, as follows.

/* create pairwise combinations of var names and group levels */
BlockRows = 50;
w = btran(X, BlockRows, m);                  /* 50 x 12 matrix */
 
s = Species[ do(1, n, BlockRows) ];          /* s = {'Setosa' 'Versicolor', 'Virginica'} */
vNames = rowcatc(expandgrid(varNames, s));   /* combine var names and species */
print vNames, w[c=vNames L="Measurenents (cm)"];

In summary, the BTRAN function is a useful function when you need to rearrange blocks of data without changing the values in a block. For balanced designs, you can use the BTRAN function to convert data between wide and long formats.

The post Transpose blocks to reshape data appeared first on The DO Loop.

10月 082018
 

This article compares several ways to find the elements that are common to multiple sets. I test which method is the fastest in the SAS/IML language. However, all algorithms are intrinsically fast, which raises an important question: when is it worth the time and effort to optimize an algorithm?

The idea for this topic came from reading a blog post by 'mikefc' at the "Cool but Useless" blog about the relative performance of various ways to intersect vectors in R. Mikefc used vectors that contained between 10,000 and 100,000 elements and whose intersection was only a few dozen elements. In this article, I increase the sizes of the vectors by an order of magnitude and time the performance of the intersection function in SAS/IML. I also ensure that the sets share a large set of common elements so that the intersection is sizeable.

The intersection of large sets

In general, finding the intersection between sets is a fast operation. In languages such as R, MATLAB, and SAS/IML, you can store the sets in vectors and use built-in functions to form the intersection. Because the intersection of two sets is always smaller than (or equal to) the size of the original sets, computing the "intersection of intersections" is usually faster than computing the intersection of the original (larger) sets.

The following SAS/IML program creates eight SAS/IML vectors that have between 200,000 and 550,000 elements. The elements in the vectors are positive integers, although they could also be character strings or non-integers. The vectors contain a large subset of common elements so that the intersection is sizeable. The vectors A, B, ..., H are created as follows:

proc iml;
call randseed(123);
NIntersect = 5e4; 
common = sample(1:NIntersect, NIntersect, "replace"); /* elements common to all sets */
 
source = 1:1e6;        /* elements are positive integers less than 1 million */
BaseN = 5e5;           /* approx magnitude of vectors (sets) */
A = sample(source,     BaseN, "replace") || common; /* include common elements */
B = sample(source, 0.9*BaseN, "replace") || common;
C = sample(source, 0.8*BaseN, "replace") || common;
D = sample(source, 0.7*BaseN, "replace") || common;
E = sample(source, 0.6*BaseN, "replace") || common;
F = sample(source, 0.5*BaseN, "replace") || common;
G = sample(source, 0.4*BaseN, "replace") || common;
H = sample(source, 0.3*BaseN, "replace") || common;

The intersection of these vectors contains about 31,600 elements. You can implement the following methods that find the intersection of these vectors:

  1. The XSECT function in SAS/IML accepts up to 15 arguments. You can therefore make a single call to find the intersection.
  2. If you have dozens or hundreds of sets to intersect, you can loop over the sets and call the XSECT function multiple times. (Recall that the VALUE function enables you to loop over the names of the sets and access the vectors by names.) For example, you can let W1 be the intersection of the first two sets, let W2 be the intersection of W1 with the third set, and so forth. Because the size of the intersection is typically smaller than the size of the sets, later intersections require less time to compute than earlier intersections. (Although I only implement pairwise intersections, you could conceivably intersect more than two sets at a time.)
  3. As 'mikefc' mentions, it is quicker to intersect small sets than to intersect large sets. Thus you can preprocess the names of the sets and sort them by size. This heuristic might make a difference when the set sizes vary greatly.

The following SAS/IML statements implement these three methods and compute the time required to intersect each:

/* 1. one call to the built-in method */
t0  = time();
   w = xsect(a,b,c,d,e,f,g,h);
tBuiltin = time() - t0;
 
/* 2. loop over variable names and form pairwise intersections */
varName = "a":"h";
t0  = time();
   w = value(varName[1]);
   do i = 2 to ncol(varName);
     w = xsect(w, value(varName[i]));  /* compute pairwise intersection */
   end;
tPairwise = time() - t0;
 
/* 3. Sort by size of sets, then loop */
varName = "a":"h";
t0  = time();
   len = j(ncol(varName), 1);    
   do i = 1 to ncol(varName);
     len[i] = ncol(value(varName[i])); /* number of elements in each set */
   end;
   call sortndx(idx, len);             /* sort smallest to largest */
   sortName = varName[,idx];
   w = value(sortName[1]);
   do i = 2 to ncol(sortName);
     w = xsect(w, value(sortName[i])); /* compute pairwise intersection */
   end;
tSort = time() - t0;
 
print tBuiltin tPairwise tSort;

For this example data, each method takes about 0.5 seconds to find the intersection of eight sets that contain a total of 3,000,000 elements. If you rerun the analysis, the times will vary by a few hundredths of a second, so the times are essentially equal. ('mikefc' also discusses a few other methods, including a "tally method." The tally method is fast, but it relies on the elements being positive integers.)

Whether to optimize or not

There is a quote that I like: "In theory, there is no difference between theory and practice. But, in practice, there is." (Published by Walter J. Savitch, who claims to have overheard it at a computer science conference.) For the intersection problem, you can argue theoretically that pre-sorting by length is an inexpensive way to potentially obtain a boost in performance. However, in practice (for this example data), pre-sorting improves the performance by less than 1%. Furthermore, the absolute times for this operation are short, so saving a 1% of 0.5 seconds might not be worth the effort.

Even if you never compute the intersection of sets in SAS, there are two take-aways from this exercise:

  • When you measure the performance of algorithms, use example data that is similar to the real data that you expect to encounter. The performance of this problem depends on the size of the sets, the number of sets, and the contents of the sets. The results that I show here are based on my choices for these parameters. Different parameters might yield different results.
  • Always consider the total time for an algorithm before you start to optimize it. If your analysis takes 30 seconds, there is no need to optimize a step in the analysis that takes 0.5 seconds. No matter how well you optimize that step, it is not going to substantially shrink the total run time of your analysis.

The post The intersection of multiple sets appeared first on The DO Loop.

9月 172018
 

The SAS/IML language and the MATLAB language are similar. Both provide a natural syntax for performing high-level computations on vectors and matrices, including basic linear algebra subroutines. Sometimes a SAS programmer will convert an algorithm from MATLAB into SAS/IML. Because the languages are not identical, I am sometimes asked, "what is the SAS/IML function that is equivalent to the XYZ function in MATLAB?" One function I am often asked about is the linspace function in MATLAB, which generates a row vector of n evenly spaced points in a closed interval. Although I have written about how to generate evenly spaced points in SAS/IML (and in the DATA step, too!), the name of the SAS/IML function that performs this operation (the DO function) is not very descriptive. Understandably, someone who browses the documentation might pass by the DO function without realizing that it is the function that generates a linearly spaced vector. This article shows how to construct a SAS/IML function that is equivalent to the MATLAB linspace function.

Generate equally spaced points in an interval

Syntactically, the main difference between the DO function in SAS/IML and the linspace function in MATLAB is that the third argument to the DO function is a step size (an increment), whereas the third function to the linspace function is the number of points to generate in an interval. But that's no problem: to generate n evenly spaced points on the interval [a, b], you can use a step size of (b – a)/(n – 1). Therefore, the following SAS/IML function is a drop-in replacement for the MATLAB linspace function:

proc iml;
/* generate n evenly spaced points (a linearly spaced vector) in the interval [a,b] */
start linspace(a, b, numPts=100);
   n = floor(numPts);               /* if n is not an integer, truncate */
   if n < 1 then return( {} );      /* return empty matrix */
   else if n=1 then return( b );    /* return upper endpoint */
   return( do(a, b, (b-a)/(n-1)) ); /* return n equally spaced points */
finish;

A typical use for the linspace function is to generate points in the domain of a function so that you can quickly visualize the function on an interval. For example, the following statements visualize the function exp( -x2 ) on the domain [-3, 3]:

x = linspace(-3, 3);                /* by default, 100 points in [-3,3] */
title "y = exp( -x^2 )";
call series(x, exp(-x##2));         /* graph the function */
Graph of y = exp( -x^2 )

Reminder: "10.0 times 0.1 is hardly ever 1.0"

This is a good time to remind everyone of the programmer's maxim (from Kernighan and Plauger, 1974, The Elements of Programming Style) that "10.0 times 0.1 is hardly ever 1.0." Similarly, "5 times 0.2 is hardly ever 1.0." The maxim holds because many finite decimal values in base 10 have a binary representation that is infinite and repeating. For example, 0.1 and 0.2 are represented by repeating decimals in base 2. Specifically, 0.210 = 0.00110011...2. Thus, just as 3 * (0.3333333) is not equal to 1 in base 10, so too is 5 * 0.00110011...2 not equal to 1 in base 2.

A consequence of this fact is that you should avoid testing floating-point values for equality. For example, if you generate evenly spaced points in the interval [-1, 1] with a step size of 0.2, do not expect that 0.0 is one of the points that are generated, as shown by the following statements:

z = do(-1, 1, 0.2);
/* find all points that are integers */
idx = loc( z = int(z) );               /* test for equality (bad idea) */
print (idx // z[,idx])[r={'idx', 'z[idx]'}];  /* oops! 0.0 is not there! */
 
print z;                               /* show that 0.0 is not one of the points */

When you query for all values for which z = int(z), only the values -1 and +1 are found. If you print out the values in the vector, you'll see that the middle value is an extremely tiny but nonzero value (-5.55e=17). This is not a bug but is a consequence of the fact that 0.2 is represented as a repeating value in binary.

So how can you find the points in a vector that "should be" integers (in exact arithmetic) but might be slightly different than an integer in floating-point arithmetic? The standard approach is to choose a small distance (such as 1e-12 or 1e-14) and look for floating-point numbers that are within that distance from an integer. In SAS, you can use the ROUND function or check the absolute value of the difference, as follows:

eps = 1e-12;
w = round(z, eps);                            /* Round to nearest eps */
idx = loc( int(w) = w);                       /* find points are within epsilon of integer */
print idx;                                    
 
idx = loc( abs(int(z) - z) < eps );           /* find points whose distance to integer is less than eps */
print (idx // z[,idx])[r={'idx', 'z[idx]'}];

In summary, this article shows how to define a SAS/IML function that is equivalent to the MATLAB linspace function. It also reminds us that some finite decimal values (such as 0.1 and 0.2) do not have finite binary representations. When these values are used to generate an arithmetic sequence, the resulting vector of values might be different from what you expect. A wise practice is to never test a floating-point value for equality, but instead to test whether a floating-point value is within a small distance from a target value.

The post Linearly spaced vectors in SAS appeared first on The DO Loop.

8月 082018
 
Use the GROUP= option in PROC SGPLOT to plot lines for TWO categorical variables

The SGPLOT procedure in SAS makes it easy to create graphs that overlay various groups in the data. Many statements support the GROUP= option, which specifies that the graph should overlay group information. For example, you can create side-by-side bar charts and box plots, and you can overlay multiple scatter plots and series plots in the same graph. However, the GROUP= option takes only a single grouping variable! What can you do if you need to visualize combinations of two (or even three!) categorical variables? This article shows how to construct a new group variable that combines the levels of two or more existing categorical variables.

For concreteness, I will show how to overlay multiple series plots (line plots), as shown to the right. (Click to enlarge.) By using this technique, you can overlay curves that describe combinations of gender and race. Or you can plot response curves for control-vs-experimental groups and the severity of a disease. You can use this technique for other plot types, such as creating box plots that visualize a two-way ANOVA.

What is the problem? Why can't you use the GROUP= option?

To motivate the discussion, let's first see why the GROUP= option in the SERIES statement does not work for overlaying two categorical variables. The following DATA step creates two categorical variables. The G1 variable has the values 1, 2, and 3. The G2 variable has the values 'A' and 'B'. For each of the six combinations of (G1, G2), the DATA step creates a curve (actually a line) of (X, Y) values. Let's see what happens if you try to plot the lines by using a SERIES statement with the GROUP=G1 option:

data TwoGroups;
do G1=1 to 3;
   do G2='A', 'B';
      do X = 1 to 10;
         Y = 10 + G1 + 0.5*(G2='A') + G1*X/20;
         output;
      end;
   end;
end;
run;
 
title "First Attempt: Does Not Work";
proc sgplot data=TwoGroups;
   series x=x y=y / group=G1 curvelabel;
run;

The attempt is a failure. The graph does not show six individual curves. Instead, it shows three curves (the number of categories in the G1 variables) and each "curve" is Z-shaped because the graph traces the curve for G2='A' on the range [1, 10] and then draws the curve for G2='B' without "picking up the pen." This happens because the data are sorted by G1, then by G2, then by X.

There are two ways to handle this situation. One way is to try to convert the data from "long form" to "wide form." For these data, which have identical X values for every curve, you can create two new variables Y_A and Y_B that contain the coordinates of the three curves for G2='A' and G2='B', respectively. You can then use two SERIES statements, each using the GROUP=G1 option. Each statement will draw three curves for a total of six. You can use the NOCYCLEATTRS option to make sure that each statement uses the same line colors and patterns. However, this approach becomes complicated if each curve is evaluated at a different set of X values. In that case, it is better to keep the data in long form.

Forming a new group variable by concatenation

The problem would be solved if there were one categorical variable that had six levels instead of two categorical variables that have six joint levels, so let's write some SAS code to make that happen. First sort the data by the categorical variables and then by the X variable. (For the current example, the data are already sorted correctly.) Then write a DATA step that does either of the following options:

  • Option 1: Use the CATT (or CATX) function to concatenate the values of the existing group variables. The new categorical variable will have values that derived from the original variables.
  • Option 2: Use the FIRST.variable syntax to create a new group variable that has the values 1–6. Then use PROC FORMAT to assign a meaningful value to each level of the new categorical variable.

The first option (the CATT function) is automated and less prone to error, but for the sake of completeness both options are shown below:

/* create a new group variable by concatenating the two existing variables */
proc sort data=TwoGroups;
   by G1 G2 X;
run;
 
data Make2Groups;
set TwoGroups;
by G1 G2;
/* Option 1: automatically create joint levels from original levels */
Label = catt("G1=", G1) || "; " || catt("G2=", G2);
/* Option 2: create new categorical variable and use PROC FORMAT to assign values */
if first.G2 then 
   GroupID + 1;        /* GroupID = 1, 2, 3, ... numGroups */
run;
 
/* For Option 2: use PROC FORMAT to form labels that encode the two groups. See
https://blogs.sas.com/content/iml/2017/04/24/two-way-anova-viz.html
*/
proc format;                 
  value GroupFmt 1 = "G1=1; G2='A'"   2 = "G1=1; G2='B'"   3 = "G1=2; G2='A'"
                 4 = "G1=2; G2='B'"   5 = "G1=3; G2='A'"   6 = "G1=3; G2='B'";
run;

The following call to PROC SGPLOT creates a series plot for the Label variable, which corresponds to the joint levels of the original grouping variables. The GROUPLC= option (supported in SAS 9.4M2) colors the lines according to the value of the G2 variable.

title "Two Groups, One SERIES Statement";
proc sgplot data=Make2Groups;
/* format GroupID GroupFmt.; */            /* for Option 2 */
series x=x y=y / group=Label grouplc=G2    /* or group=GroupID for Option 2 */
                 lineattrs=(pattern=solid) curvelabel;
run;

The graph is shown at the top of this article. The new categorical variable has values that correspond to joint levels of the original two variables. The GROUP= option creates six curves when you specify the Label variable. The GROUPLC= option sets the colors of the lines according to the values of the G2 variable.

This technique generalizes to three categorical variables, but I will leave the details to the reader. You might want to use the GROUPLP= option to set the line patterns according to the value of a third categorical variable. Beyond three variables the display will begin to resemble a spaghetti plot. For many categorical variables, you might want to use panels and BY groups to visualize the curves.

This technique also generalizes to other plot types, such as box plots and scatter plots.

The post Plot curves for levels of two categorical variables in SAS appeared first on The DO Loop.

1月 032018
 

I wrote more than 100 posts for The DO Loop blog in 2017. The most popular articles were about SAS programming tips, statistical data analysis, and simulation and bootstrap methods. Here are the most popular articles from 2017 in each category.

General SAS programming techniques

Statistics and Data Analysis

Observed and Expected proportions of M&M color (2017)
  • M&M Colors: It's no surprise that a statistical analysis of the color distribution of M&M candies was one of the most popular articles. Some people are content to know that the candies are delicious, but thousands wanted to read about whether blue and orange candies occur more often than brown.
  • Interpretation of Correlation: Correlation is one of the simplest multivariate statistics, but it can be interpreted in many ways: algebraic, geometric, in terms of regression, and more. This article describes seven ways to view correlation?
  • Winsorize Data: Before you ask "how can I Winsorize data" to eliminate outliers, you should ask "what is Winsorization" and "what are the pitfalls?" This article presents the advantages and disadvantages of Winsorizing data.

Simulation and Bootstrapping

Was you New Year's resolution to learn more about SAS? Did you miss any of these popular posts? Take a moment to read (or re-read!) one of these top 10 posts from the past year.

The post The top 10 posts from <em>The DO Loop</em> in 2017 appeared first on The DO Loop.

12月 182017
 

Slice, slice, baby! You've got to slice, slice, baby!

When you fit a regression model that has multiple explanatory variables, it is a challenge to effectively visualize the predicted values. This article describes how to visualize the regression model by slicing the explanatory variables. In SAS, you can use the SLICEFIT option in the EFFECTPLOT statement visualize a slice of a regression surface.

Why the naive visualization fails

For a regression model that contains one explanatory variable and (optionally) one classification variable, it is easy to visualize the predicted values. Most statistical software packages make it easy to create a "fit plot." For example, the following call to PROC GLM in SAS fits a model to some patients in a heart study:

data Heart;    /* create example data */
set sashelp.heart(obs=500);
where cholesterol < 400;
run;
 
ods graphics / attrpriority=none     /* groups determine symbols and line patterns */
               imagemap tipmax=1500; /* enable tool tips */
 
/* easy to visualize predicted values for 1 continuous and 1 categorical explanatory variable */
proc glm data=Heart plots=meanplot;  /* PLOTS= option supported in many procedures */
class Sex;
model Cholesterol = Sex Systolic;
quit;

The graph shows the observed responses versus the continuous explanatory variable and overlays two curves: one for the predicted values when Sex='Male' and the other when Sex='Female'. Creating this graph is easy because the procedure does all the work.

What happens if you add additional explanatory variables into the model and try to create the same graph? For reasons that will soon be apparent, the procedure will not automatically create the graph when there are additional variables in the model. However, you can use the OUTPUT statement to write the predicted values to a SAS data set and use PROC SGPLOT to create the graph. You will need to sort by the variable that you are plotting on the X axis, as follows:

proc glm data=Heart;
class Sex Smoking_status;
model Cholesterol = Sex Smoking_Status    /* two classification variables */
                    Systolic Weight;      /* two continuous variables */
output out=GLMOut p=Pred;                 /* output data set contains predicted values */
quit;
 
proc sort data=GLMOut; by Systolic Sex; run; /* sort by X variable for graphing */
 
title "Predicted Values";
proc sgplot data=GLMOut;
styleattrs datalinepatterns=(solid solid);
scatter x=Systolic y=Cholesterol / group=Sex transparency=0.75;
series  x=Systolic y=Pred / group=Sex tip=(Smoking_Status Weight); /* add tool tips */
yaxis min=180 max=300;    /* zoom in on predicted values */
footnote J=L "Jagged Lines Because Covariates Have Multiple Values";
run;
Visualize regression model: Graph of response versus explanatory variable. There are hidden explanatory variables. Markers are observed values. Jagged lines are the projections of the predicted values.

This graph looks strange. The regression model is linear, but a plot of the predicted values shows a jagged line for the predicted values. What is going on?

You can use the tool tips feature of the graph to understand why the curves are jagged. If you hover the cursor near a point on the jagged line, the values of the hidden explanatory variables (Weight and Smoking_Status) appear. The graph shows the tool tip at a point that corresponds to a male patient who weighs 160 pounds and who is a moderate smoker. By moving the cursor, you can discover that the previous point along the red line corresponds to a male patient who weighs 155 pounds and is a non-smoker. The subsequent point corresponds to a heavy smoker who weighs 151 pounds.

Because Weight and Smoking_Status were included in the model, the predicted values "jump" up or down as you move along the Systolic axis. Two observations that have similar Systolic values might have very different values for other (hidden) components. Geometrically, this graph displays the projection of the predicted values onto the two-dimensional (Systolic, Cholesterol) plane. To obtain a smooth curve, you must "slice" a response surface rather than project it.

Slice the response surfaces

The predicted values for this model form a set of 10 planes in the three-dimensional space (x, y, z) = (Systolic, Weight, Cholesterol). Each plane is the graph of predicted values for a combination of the 2 genders and 5 levels of smokers. There is one plane is for the ('Male', 'Non-smoker') patients, another for the ('Female', 'Light (1-5)') patients, and so on.

A "slice" through the response surfaces is accomplished by evaluating the model at a particular value of one of the continuous variables. This gives a two-dimensional plot that has 10 lines on it. Because 10 lines might overcrowd the display, it is common to pick a reference value for one of the classification variables and plot only the lines that are indexed by that value. For example, if you choose the reference value Smoking_Status = 'Non-smoker', the plot contains two lines that correspond to ('Male', 'Non-smoker') and ('Female', 'Non-smoker').

This might sound complicated, but SAS provides an easy implementation: the SLICEFIT option in the EFFECTPLOT statement, which is supported in several regression procedures, enables you to specify how you want to slice the surfaces and which combinations of levels you want to display.

By default, the EFFECTPLOT SLICEFIT statement creates a "sliced fit plot" that graphs the response variable versus the first continuous variable and shows the predicted values for each level of the first class variable. "First" is determined by the order in which the variables are listed on the MODEL statement. Other continuous variables are sliced (evaluated) at their mean value; other classification variables are evaluated at their last level.

PROC GLM does not support the EFFECTPLOT statement, but PROC GENMOD does. The following call to PROC GENMOD fits the same model and creates a "sliced fit plot" of the predicted values. The sliced fit plot will show the response variable (Cholesterol) versus the first continuous variable (Systolic) overlaid with predictions for males and females. The value of the Weight variable is set to 151.7, which is the mean value of the sample. The value of the Smoking_Status variable is set to 'Very Heavy (> 25)', which is the last level in alphanumeric order.

title; footnote;
ods graphics / attrpriority=none imagemap=off;
proc genmod data=Heart;
class Sex Smoking_status;
model Cholesterol = Sex Smoking_Status   /* classification variables */
                    Systolic Weight;     /* continuous variables */
/* Plot response vs first cont var for each level of first class var */  
/* Set other cont vars to MEAN; set other class vars to last level */
effectplot slicefit / obs;               /* add scatter plot of observations */
run;
Sliced fit plot for multivariate regression model. Created by the EFFECTPLOT statement in SAS.

The sliced fit plot shows smooth (not jagged) lines because the model is evaluated at constant values of the hidden variables. The values (Weight, Smoking_Status) = (151.7, 'Very Heavy (> 25)') are held constant while the model is evaluated over the range of the Systolic and Sex variables.

Other ways to slice the response surfaces

The SLICEFIT option in the EFFECTPLOT statement supports many suboptions that enable you to control the way that the model is sliced:

  • You can plot any two variables, one continuous and one categorical. Use the X= option to specify the continuous variable and the SLICEBY= option to specify the categorical variable.
  • You can specify the statistics that are used to slice the continuous covariates. By default the covariates are sliced at their mean values. You can use the AT option to specify the following keywords: MEAN (the default), MIN, MAX, MEDIAN, or MIDRANGE. (Recall that the midrange is the value (min+max)/2.) For class variables, the REF option specifies that the last level be used.
  • You can use the AT option to specify particular values for slicing the continuous covariates and class variables.
  • You can specify multiple values for the AT option. The EFFECTPLOT statement will create a panel of sliced fit plots, one for each joint combination of specified values.

The following four EFFECTPLOT statements correspond to the four items in the previous list:

proc genmod data=Heart;
class Sex Smoking_status;
model Cholesterol = Sex Smoking_Status    /* classification variables */
                    Systolic Weight;      /* continuous variables */
/* specify the X and categorical variables */
effectplot slicefit(X=weight sliceby=Smoking_status)  / obs;
 
/* specify statistics used to slice the covariates */
effectplot slicefit / at MIDRANGE      /* new default for continuous vars */ 
                         REF;          /* default for classification vars */
 
/* specify explicit values of the covariates */
effectplot slicefit / at(Weight=150
                         Smoking_Status='Non-smoker');
 
/* specify multiple values of the covariates to get a panel */
effectplot slicefit / at(Weight=150 200
			 Smoking_Status='Non-smoker' 'Heavy (16-25)');
quit;

To save space, only the last sliced fit plot (the panel) is shown below. I have linked to the other three plots: the plot of Weight and Smoking_Status, the plot at midrange, and the plot at specified values.

Panel of sliced fit plot created by EFFECTPLOT SLICEFIT / AT(Weight=150 200  Smoking_Status='Non-smoker' 'Heavy (16-25)'

In summary, you can use the SLICEFIT option in the EFFECTPLOT statement in SAS to visualize regression models that contain many explanatory variables. The AT option enables you to specify values for the covariates. The resulting graph displays a slice through the response surface.

The EFFECTPLOT statement is also available in PROC PLM. PROC PLM enables you to visualize a model that has been saved to an item store. The OBS option (which overlays the predicted values and a scatter plot) is not available in PROC PLM because the item store does not include the observations.

The post Visualize multivariate regression models by slicing continuous variables appeared first on The DO Loop.

12月 042017
 

Imputing missing data is the act of replacing missing data by nonmissing values. Mean imputation replaces missing data in a numerical variable by the mean value of the nonmissing values. This article shows how to perform mean imputation in SAS. It also presents three statistical drawbacks of mean imputation.

How to perform mean imputation in SAS

The easiest way to perform mean imputation in SAS is to use PROC STDIZE. PROC STDIZE supports the REPONLY and the METHOD=MEAN options, which tells it to replace missing values with the mean for the variables on the VAR statement. To demonstrate mean imputation, the following statements randomly add missing values to the Sashelp.Class data set. The call to PROC STDIZE then replaces the missing values and creates a data set called IMPUTED that contains the results:

/* Create "original data" by randomly inserting missing values for some heights */
data Have;
set sashelp.class;
call streaminit(12345);
Replaced = rand("Bernoulli", 0.4);  /* indicator variable is 1 about 40% of time */
if Replaced then Height = .;        
run;
 
/* Mean imputation: Use PROC STDIZE to replace missing values with mean */
proc stdize data=Have out=Imputed 
      oprefix=Orig_         /* prefix for original variables */
      reponly               /* only replace; do not standardize */
      method=MEAN;          /* or MEDIAN, MINIMUM, MIDRANGE, etc. */
   var Height;              /* you can list multiple variables to impute */
run;
 
proc print data=Imputed;
   format Orig_Height Height BESTD8.1;
   var Name Orig_Height Height Weight Replaced;
run;
Mean imputation in SAS

The output shows that the missing data (such as observations 6 and 8) are replaced by 61.5, which is the mean value of the observed heights. For a subsequent visualization, I have included a binary variable (Replaced) that indicates whether an observation was originally missing. The METHOD= option in PROC STDIZE supports several statistics. You can use METHOD=MEDIAN to replace missing values by the median, METHOD=MINIMUM to replace by the minimum value, and so forth.

Problems with mean imputation

Most software packages deal with missing data by using listwise deletion: observations that have missing data are dropped from the analysis. Throwing away hard-collected data is painful and can result in a substantial loss of power for statistical tests. Mean imputation, which is easy to implement, enables analysts to use every observation. However, mean imputation has three serious disadvantages that can lead to problems in your statistical analysis. Mean imputation is a univariate method that ignores the relationships between variables and makes no effort to represent the inherent variability in the data. In particular, when you replace missing data by a mean, you commit three statistical sins:

  • Mean imputation reduces the variance of the imputed variables.
  • Mean imputation shrinks standard errors, which invalidates most hypothesis tests and the calculation of confidence interval.
  • Mean imputation does not preserve relationships between variables such as correlations.

These problems are discussed further in my next blog post. Most experts agree that the drawbacks far outweigh the advantages, especially since most software supports modern alternatives to single imputation, such as multiple imputation. My advice: don't use mean imputation if you can use a more sophisticated alternative.

Epilogue

When I was in college, an actor friend smoked cigarettes. He knew that he should stop, but his addiction was too strong. When he lit up he would recite the following verse and dramatically punctuate the final phrase by blowing a smoke ring:

     If you don't smoke, don't start.
     If you do smoke, stop.
     If you do smoke and won't stop, smoke with style. (*blows smoke ring*)

I don't recommend mean imputation. It is bad for the health of your data. But I can't dissuade from using mean imputation, remember the following verse:

     If you don't use mean imputation, don't start.
     If you do use mean imputation, stop.
     If you do use mean imputation and won't stop, use PROC STDIZE.

The post Mean imputation in SAS appeared first on The DO Loop.

10月 232017
 

A common question on discussion forums is how to compute a principal component regression in SAS. One reason people give for wanting to run a principal component regression is that the explanatory variables in the model are highly correlated which each other, a condition known as multicollinearity. Although principal component regression (PCR) is a popular technique for dealing with almost collinear data, PCR is not a cure-all. This article shows how to compute a principal component regression in SAS; a subsequent article discusses the problems with PCR and presents alternative techniques.

Multicollinearity in regression

Near collinearity among the explanatory variables in a regression model requires special handling because:

  • The crossproduct matrix X`X is ill-conditioned (nearly singular), where X is the data matrix.
  • The standard errors of the parameter estimates are very large. The variance inflation factor (VIF), which is computed by PROC REG, is one way to measure how collinearities inflate the variances of the parameter estimates.
  • The model parameters are highly correlated, which makes interpretation of the parameters difficult.

Principal component regression keeps only the most important principal components and discards the others. This means that you compute the principal components for the explanatory variables and drop the components that correspond to the smallest eigenvalues of X`X. If you keep k principal components, then those components enable you to form a rank-k approximation to the crossproduct matrix. If you regress the response variable onto those k components, you obtain a PCR. Usually the parameter estimates are expressed in terms of the original variables, rather than in terms of the principal components.

In SAS there are two easy ways to compute principal component regression:

  • The PLS procedure supports the METHOD=PCR to perform principal component regression. You can use the NFAC= option to determine the number of principal components to keep.
  • The MODEL statement in PROC REG supports the PCOMIT= option. (This option is read as "PC omit.") The argument to the PCOMIT= option is the number of principal components to drop (omit) from the regression.

Notice that neither of these methods calls PROC PRINCOMP. You could call PROC PRINCOMP, but it would be more complicated than the previous methods. You would have to extract the first principal components (PCs), then use PROC REG to compute the regression coefficients for the PCs, then use matrix computations to convert the parameter estimates from the PCs to the original variables.

Principal component regression is also sometimes used for general dimension reduction. Instead of projecting the response variable onto a p-dimensional space of raw variables, PCR projects the response onto a k-dimensional space where k is less than p. For dimension reduction, you might want to consider another approach such as variable selection by using PROC GLMSELECT or PROC HPGENSELECT. The reason is that the PCR model retains all of the original variables whereas variable selection procedures result in models that have fewer variables.

Use PROC PLS for principal component regression

I recommend using the PLS procedure to compute a principal component regression in SAS. As mentioned previously, you need to use the METHOD=PCR and NFAC= options. The following data for 31 men at a fitness center is from the documentation for PROC REG. The goal of the study is to predict oxygen consumption from age, weight, and various physiological measurements before and during exercise. The following call to PROC PLS computes a PCR that keeps four principal components:

data fitness;
   input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
   datalines;
44 89.47 44.609 11.37 62 178 182   40 75.07 45.313 10.07 62 185 185
44 85.84 54.297  8.65 45 156 168   42 68.15 59.571  8.17 40 166 172
38 89.02 49.874  9.22 55 178 180   47 77.45 44.811 11.63 58 176 176
40 75.98 45.681 11.95 70 176 180   43 81.19 49.091 10.85 64 162 170
44 81.42 39.442 13.08 63 174 176   38 81.87 60.055  8.63 48 170 186
44 73.03 50.541 10.13 45 168 168   45 87.66 37.388 14.03 56 186 192
45 66.45 44.754 11.12 51 176 176   47 79.15 47.273 10.60 47 162 164
54 83.12 51.855 10.33 50 166 170   49 81.42 49.156  8.95 44 180 185
51 69.63 40.836 10.95 57 168 172   51 77.91 46.672 10.00 48 162 168
48 91.63 46.774 10.25 48 162 164   49 73.37 50.388 10.08 67 168 168
57 73.37 39.407 12.63 58 174 176   54 79.38 46.080 11.17 62 156 165
52 76.32 45.441  9.63 48 164 166   50 70.87 54.625  8.92 48 146 155
51 67.25 45.118 11.08 48 172 172   54 91.63 39.203 12.88 44 168 172
51 73.71 45.790 10.47 59 186 188   57 59.08 50.545  9.93 49 148 155
49 76.32 48.673  9.40 56 186 188   48 61.24 47.920 11.50 52 170 176
52 82.78 47.467 10.50 53 170 172
;
 
proc pls data=fitness method=PCR nfac=4;          /* PCR onto 4 factors */
   model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / solution;
run;

The output includes the parameter estimates table, which gives the estimates for the four-component regression in terms of the original variables. Another table (not shown) shows that the first four principal components explain 93% of the variation in the explanatory variables and 78% of the variation in the response variable.

For another example of using PROC PLS to combat collinearity, see Yu (2011), "Principal Component Regression as a Countermeasure against Collinearity."

Use PROC REG for principal component regression

I recommend PROC PLS for principal component regression, but you can also compute a PCR by using the PCOMIT= option on the MODEL statement in PROC REG. However, the parameter estimates are not displayed in any table but must be written to OUTEST= data set, as follows:

proc reg data=fitness plots=none outest=PE; /* write PCR estimates to PE data set */
   model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
         / PCOmit=2;       /* omit 2 PCs ==> keep 6-2=4 PCs */
quit;
 
proc print data=PE(where=(_Type_="IPC")) noobs;
   var Intercept--MaxPulse;
run;

Notice that the PCOMIT=2 option specifies that two PCs should be dropped, which is equivalent to keeping four components in this six-variable model. The parameter estimates are written to the PE data set and are displayed by PROC PRINT. The estimates the same as those found by PROC PLS. In the PE data, the PCR estimates are indicated by the value "IPC" for the _TYPE_ variable, which stands for incomplete principal component regression. The word "incomplete" indicates that not all the principal components are used.

It is worth noting that even though the principal components themselves are based on centered and scaled data, the parameter estimates are reported for the original (raw) variables. It is also worth noting that you can use the OUTSEB option on the PROC REG statement to obtain standard errors for the parameter estimates.

Should you use principal component regression?

This article shows you how to perform principal component regression in SAS by using PROC PLS with METHOD=PCR. However, I must point out that there are statistical drawbacks to using principal component regression. The primary issue is that principal component regression does not use any information about the response variable when choosing the principal components. Before you decide to use PCR, I urge you to read my next post about the drawbacks with the technique. You can then make an informed decision about whether you want to use principal component regression for your data.

The post Principal component regression in SAS appeared first on The DO Loop.