Getting Started

6月 102019
 

Recoding variables can be tedious, but it is often a necessary part of data analysis. Almost every SAS programmer has written a DATA step that uses IF-THEN/ELSE logic or the SELECT-WHEN statements to recode variables. Although creating a new variable is effective, it is also inefficient because you have to create a new data set that contains the new variable. For large data sets, this is wasteful: most of the data remain the same; only the recoded variables are different.

There is an alternative approach: You can use PROC FORMAT in Base SAS to define a custom SAS format. When you use PROC FORMAT, the data are never changed, but all the SAS reports and analyses can display the formatted values instead of the raw data values. You can use the same format for multiple data sets. You can even define multiple formats to analyze the same variable in multiple ways.

An example of using a format to recode a variable

In the simplest situation, a recoding of a variable converts each raw value to an easier-to-interpret value. For example, suppose that the gender variable for patients is recorded as a binary 0/1 variable. This is a terrible choice because it is not clear whether 0 represents males or females. The following example shows the typical IF-THEN logic for recoding 0 as "Female" and 1 as "Male" by creating a new data set and a new variable:

/* original data: Gender is binary variable, which is hard to understand! */
data Have;              
input Gender @@;
datalines;
1 0 0 0 1 1 0 0 . 1 1 0 0 0 0 1 1 1 . 1 1 
;
 
/* Recode by using IF-THEN or SELECT-WHEN. This can be inefficient. */
data HaveRecode;
set Have;
/* use IF-THEN logic to recode gender */
length Gender_Recode $6;
if      Gender=0 then Gender_Recode = "Female";
else if Gender=1 then Gender_Recode = "Male";
else Gender_Recode = " ";
run;
 
proc freq data=HaveRecode;
   tables Gender_Recode Gender;
run;
Recode variables

The table for the Gender_Recode variable is shown. The data, which was originally coded as a binary indicator variable, has been duplicated by creating a character variable that contains the same information but is more understandable. Of course, now you have to use the new variable name to analyze the recoded data. If you have already written programs that refer to the Gender variable, you have to update the programs to use the new variable name. Yuck!

A more efficient choice is to use a custom-defined format. The beauty of using a format is that you do not have to change the data. Instead, you simply define a format that changes the way that the data are used and displayed in SAS procedures. (A data view is a third alternative, but formats have additional advantages.)

You can define the following format (called GenderFmt.), which displays the gender data as "Female" and "Male" without modifying the data set:

/* use a format to recode gender */
proc format;
value GenderFmt
      0 = "Female"
      1 = "Male" 
      other = " ";
run;
 
/* apply the format to original data; no need to create new data set */
proc freq data=Have;
   format Gender GenderFmt.;    /* the name of the format includes a period */
   tables Gender;
run;
Use PROC FORMAT to recode variables in SAS

Notice that the analysis is run on the original data and use the original variable name. No additional data sets, views, or variables are created.

Use a format to recode a character variable

Did you know that you can use PROC FORMAT to define formats for character variables? Formats for character variables are used less often than formats for numeric variables, but the syntax is similar. The main difference is that the name of a character format starts with the '$' symbol.

In addition to recoding the values of a categorical variable, formats are useful because they enable you to merge or combine categories by defining a many-to-one mapping. For example, the following character format recodes values of the TYPE variable and also combines the 'SUV' and 'Wagon' categories into a single category. Although it is not needed for this example, notice that the format also includes an 'Other' category, which can be used to combine small groups. The 'Other' category will also handle invalid data.

/* Create sample data from Sashelp.Cars. Exclude hybrids. Optionally sort the data */
proc sort data=sashelp.cars(where=(Type^='Hybrid')) out=Cars; 
   by MPG_City; 
run;
 
proc format;
value $CarTypeFmt
      'Sedan' = 'Family Car'
      'Sports' = 'Sports Car'
      'SUV','Wagon' = 'Big Car'
      'Truck' = 'Truck'
      Other = 'Other';
run;
 
proc freq data=Cars;
   format Type $CarTypeFmt.;   /* the name the format includes a period at the end */
   tables Type;
run;
Use PROC FORMAT to recode variables in SAS

Using a format enables you to analyze the original data (omit the FORMAT statement) or apply the format (include the FORMAT statement). You can even define multiple formats if you want to slice and dice the data in various ways.

Use a format to bin numeric variables

One of my favorite SAS tricks is to use a format to bin numeric variables into categories. In the following example, the MPG_City variable is used to group vehicles into four categories based on how fuel-efficient the vehicles are. You can use this format to perform any computation that requires a classification variable. The example shows a two-way frequency analysis of the two variables for which we have defined custom formats:

proc format;
value MPGFmt  
      low -<  15   = "Gas Guzzler"    /* < 15      */
       15 -<  20   = "Not Good"       /* [ 15, 20) */
       20 -<  25   = "Good"           /* [ 20, 25) */
       25 -   high = "Great";         /* > 25      */
run;
 
proc freq data=Cars order=data;
   format MPG_City MPGFmt. Type $CarTypeFmt.;
   tables MPG_City * Type / nocol norow nopercent;
run;
Use PROC FORMAT to recode variables in SAS

Store and retrieve formats

Formats are stored in a catalog, which is stored separately from the data. By default, SAS stores the formats in a catalog named WORK.FORMATS. Like everything else stored in WORK, that catalog will vanish when you end the SAS session. Therefore, you need to store the formats to a permanent libref if you want to reuse the formats across SAS session.

SAS supports several features that help you to maintain a permanent library of formats. Here are two facts about format catalogs:

  • You can use the LIBRARY= option on the PROC FORMAT statement to specify a libref in which to store the format catalog. By default, the catalog will be named FORMATS.
  • SAS maintains a list of librefs to search through to find formats. By default, it looks in WORK and a special libref named LIBRARY.

These facts imply that you can do two simple things to create a permanent library of formats. First, define a permanent libref named LIBRARY (the name is important!) that will contain your catalog of formats. Second, specify the LIBRARY=LIBRARY option when you define the format, as follows:

libname library "C:/MyFormats";   /* the libref 'LIBRARY' has special significance! */
proc format library=library;      /* adds format to the permanent catalog LIBRARY.FORMATS */
value $CarTypeFmt
      'Sedan' = 'Family Car'    'Sports' = 'Sports Car'
      'SUV','Wagon' = 'Big Car' 'Truck' = 'Truck'       Other = 'Other';
run;

When you start a new SAS session, you will need to define the LIBRARY libref again if you want to access the formats. For convenience, many people put the LIBNAME statement in their AutoExec.sas file. Because SAS searches for formats in the LIBRARY.FORMATS catalog, SAS will automatically find the $CarTypeFmt. format.

SAS provides many other options for storing formats and for specifying the search locations for formats. For details, see the SAS usage note "How can I permanently store and use formats that I have created?" or John Ladd's 2012 paper, "Yes, We Can... Save SAS Formats."

Summary

In summary, if you need to recode data, custom-defined formats provide an easy alternative to physically changing the data. This article discusses five advantages to using formats to recode data:

  • The data do not change. You can use the original variable names in the analyses.
  • You can apply formats to both character and numerical variables.
  • You can use formats to merge categories and to bin numeric variables.
  • You apply a format to multiple variables in multiple data sets.
  • You can save formats in a permanent libref and use them across SAS sessions.

Do you maintain a library of SAS formats at your workplace? Leave a comment to share your experience and your best practices.

The post 5 reasons to use PROC FORMAT to recode variables in SAS appeared first on The DO Loop.

6月 052019
 

A family of curves is generated by an equation that has one or more parameters. To visualize the family, you might want to display a graph that overlays four of five curves that have different parameter values, as shown to the right. The graph shows members of a family of exponential transformations of the form
f(x; α) = (1 – exp(-α x)) / (1 – exp(-α))
for α > 0 and x ∈ [0, 1]. This graph enables you to see how the parameter affects the shape of the curve. For example, for small values of the parameter, α, the transformation is close to the identity transformation. For larger values of α, the nonlinear transformation stretches intervals near x=0 and compresses intervals near x=1.

Here's a tip for creating a graph like this in SAS. Generate the data in "long format" and use the GROUP= option on the SERIES statement in PROC SGPLOT to plot the curves and control their attributes. The long format and the GROUP= option make it easy to visualize the family of curves.

A family of exponential transformations

I recently read a technical article that used the exponential family given above. The authors introduced the family and stated that they would use α = 8 in their paper. Although I could determine in my head that the function is monotonically increasing on [0, 1] and f(0)=0 and f(1)=1, I had no idea what the transformation looked like for α = 8. However, it is easy to use SAS to generate members of the family for different values of α and overlay the curves:

data ExpTransform;
do alpha = 1 to 7 by 2;                             /* parameters in the outer loop */
   do x = 0 to 1 by 0.01;                           /* domain of function    */
      y = (1-exp(-alpha*x)) / (1 - exp(-alpha));    /* f(x; alpha) on domain */
      output;
   end;
end;
run;
 
/* Use ODS GRAPHICS / ATTRPRIORITY=NONE 
   if you want to force the line attributes to vary in the HTML destination. */
ods graphics / width=400px height=400px;
title "Exponential Family of Transformations";
proc sgplot data=ExpTransform;
   series x=x y=y / group=alpha lineattrs=(thickness=2);
   keylegend / location=inside position=E across=1 opaque sortorder=reverseauto;
   xaxis grid;  yaxis grid;
run;

The graph is shown at the top of this article. The best way to create this graph is to generate the points in the long-data format because:

  • The outer loop controls the values of the parameters and how many curves are drawn. You can use a DO loop to generate evenly spaced parameters or specify an arbitrary sequence of parameters by using the syntax
    DO alpha = 1, 3, 6, 10;
  • The domain of the curve might depend on the parameter value. As shown in the next section, you might want to use a different set of points for each curve.
  • You can use the GROUP= option and the KEYLEGEND statement in PROC SGPLOT to visualize the family of curves.

Visualize a two-parameter family of curves

You can use the same ideas and syntax to plot a two-parameter family of curves. For example, you might want to visualize the density of the Beta distribution for representative values of the shape parameters, a and b. The Wikipedia article about the Beta distribution uses five pairs of (a, b) values; I've used the same values in the following SAS program:

data BetaDist;
array alpha[5] _temporary_ (0.5 5 1 2 2);
array beta [5] _temporary_ (0.5 1 3 2 5);
do i = 1 to dim(alpha);                       /* parameters in the outer loop */
   a = alpha[i]; b = beta[i];
   Params = catt("a=", a, "; b=", b);         /* concatenate parameters */
   do x = 0 to 0.99 by 0.01;
      pdf = pdf("Beta", x, a, b);             /* evaluate the Beta(x; a, b) density */
      if pdf < 2.5 then output;               /* exclude large values */
   end;
end;
run;
 
ods graphics / reset;
title "Probability Density of the Beta(a, b) Distribution";
proc sgplot data=BetaDist;
   label pdf="Density";
   series x=x y=pdf / group=Params lineattrs=(thickness=2);
   keylegend / position=right;
   xaxis grid;  yaxis grid;
run;

The resulting graph gives a good overview of how the parameters in the Beta distribution affect the shape of the probability density function. The program uses a few tricks:

  • The parameters are stored in arrays. The program loops over the number of parameters.
  • A SAS concatenation functions concatenate the parameters into a string that identifies each curve. The CAT, CATS, CATT, and CATX functions are powerful and useful!
  • For this family, several curves are unbounded. The program caps the maximum vertical value of the graph at 2.5.
  • Although it is not obvious, some of the curves are drawn by using 100 points whereas others use fewer points. This is an advantage of using the long format.

In summary, you can use PROC SGPLOT to visualize a family of curves. The task is easiest when you generate the points along each curve in the "long format." The long format is easier to work with than the "wide format" in which each curve is stored in a separate Y variable. When the curve values are in long form, you can use the GROUP= option on the SERIES statement to create an effective visualization by using a small number of statements.

The post Plot a family of curves in SAS appeared first on The DO Loop.

3月 112019
 

A SAS programmer posted an interesting question on a SAS discussion forum. The programmer wanted to iterate over hundreds of SAS data sets, read in all the character variables, and then do some analysis. However, not every data set contains character variables, and SAS complains when you ask it to read the character variables in a data set that contains only numeric variables.

The programmer wanted to use PROC IML to solve the problem, but the issue also occurs in the SAS DATA step. The following program creates three data sets. Two of them (AllChar and Mixed) contain at least one character variable. The third data set (AllNum) does not contain any character variables. For the third data set, an error occurs if you try to use the KEEP=_CHARACTER_ data set option, as shown in the following example:

data AllNum;
   x=1; y=2; z=3;
run;
data AllChar;
   A='ABC'; B='XYZW';
run;
data Mixed;
   name='Joe'; sex='M'; Height=1.8; Weight=81; treatment='Placebo'; 
run;
 
/* try to use DROP=_CHARACTER_ to exclude numeric variables */
data KeepTheChar;
   set AllNum(keep=_CHARACTER_); /* ERROR when no character variables in the data set */
run;
   ERROR: The variable _CHARACTER_ in the DROP, KEEP, or RENAME list has never been referenced.

The same problem occurs in PROC IML if you try to read character variables when none exist:

proc iml;
use AllNum;
   read all var _CHAR_ into X; /* ERROR when no character variables in the data set */
close;
   ERROR: No character variables in the data set.

There are at least two ways to handle this situation:

  1. In both Base SAS and SAS/IML, you can use dictionary tables to determine in advance which data sets contain at least one character variable. You can then read only those data set.
  2. In SAS/IML, you can read all variables into a table, then extract the character variables into a matrix for further processing.

Of course, the same ideas apply if you want to read only numeric variables and you encounter a data set that does not contain any numeric variables.

Use DICTIONARY tables to find information about your SAS session

If you have ever been to a SAS conference, you know that DICTIONARY tables are a favorite topic for SAS programmers. DICTIONARY tables are read-only tables that provide information about the state of the SAS session, including libraries, data sets, variables, and system options. You can access them directly by using PROC SQL. If you want to access the information in the DATA steps or other procedures (like PROC IML), you can use special data views in SASHELP. In particular, the Sashelp.VColumn view provides information about variables in SAS data set and is often used to find data sets that contain certain variable names. (See the references at the end of this article for more information about DICTIONARY tables.)

The following SAS/IML program uses the Sashelp.VColumn to find out which data sets contain at least one character variable:

proc iml;
/* Solution 1: Use dictionary table sashelp.vcolumn */
/* Find data sets in WORK that have AT LEAST ONE character variable */
use sashelp.vcolumn(where=(libname="WORK" & memtype='DATA' & type='char'); /* read only CHAR variables */
   read all var {memname name};  /* memname=data set name; name=name of character variable */
close;
 
/* loop over data sets. If a set contains at least one character variable, process it */
dsName = {'AllNum' 'AllChar' 'Mixed'};       /* names of potential data sets */
do i = 1 to ncol(dsName);
   idx = loc(memname = upcase(dsName[i]));   /* is data set on the has-character-variable list? */
   /* for demo, print whether data set has character variables */
   msg = "The data set " + (dsName[i]) + " contains " + 
          char(ncol(idx)) + " character variables.";
   print msg;
 
   if ncol(idx)>0 then do;            /* the data set exists and has character vars */
      charVars = name[idx];           /* get the names of the character vars */
      use (dsName[i]);                /* open the data set for reading */
      read all var charVars into X;   /* read character variables (always succeeds) */
      close;
      /* process the data */
   end;
end;

The output shows that you can use the DICTIONARY tables to determine which data sets have at least one character variable. You can then use the USE/READ statements in PROC IML to read the character variables and process the data however you wish. As mentioned previously, this technique can also be used in PROC SQL and the DATA step.

Use SAS/IML tables to find character variables

The previous section is very efficient because only character variables are ever read into SAS/IML matrices. However, there might be situations when you want to process character variables (if they exist) and then later process numerical variables (if they exist). Although a SAS/IML matrix contains only one data type (either all numeric or all character), you can read mixed-type data into a SAS/IML table, which supports both numeric and character variables. You can then use the TableIsVarNumeric function to generate a binary indicator variable that tells you which variables in the data are numeric and which are character, as follows:

/* Solution 2: Read all data into a table. Use the TableIsVarNumeric function to determine
   which variables are numeric and which are character. */
dsName = {'AllNum' 'AllChar' 'Mixed'};             /* names of potential data sets */
do i = 1 to ncol(dsName);                          /* for each data set... */
   T = TableCreateFromDataset("WORK", dsName[i]);  /* read all variables into a table */
   numerInd = TableIsVarNumeric(T);                /* binary indicator vector for numeric vars */
   charInd = ^numerInd;                            /* binary indicator vector for character vars */
   numCharVars = sum(charInd);                     /* count of character variables in this data set */
   msg = "The data set " + (dsName[i]) + " contains " + 
         char(numCharVars) + " character variables.";
   print msg;
   if numCharVars > 0 then do;
      X = TableGetVarData(T, loc(charInd));        /* extract the character variables into X */
      /* process the data */
   end;
   /* optionally process the numeric data */
   numNumerVars = sum(numerInd);                   /* count of numeric variables in this data set */
   /* etc */
end;

The output is identical to the output in the previous section.

Summary

In summary, this article discusses a programmer who wants to iterate over many SAS data sets and process only character variables. However, some of the data sets do not have any character variables! This article shows two methods for dealing with this situation: DICTIONARY tables (available through Sashelp views) or SAS/IML tables. The first method is also available in Base SAS.

Of course, you can also use this trick to read all numeric variables when some of the data sets might not have any numeric variable. I've previously written about how to read all numeric variables into a SAS/IML matrix by using the _ALL_ keyword. If the data set contains both numeric and character variables, then only the numeric variables are read.

References

The following resources provide more information about DICTIONARY tables in SAS:

The post How to detect SAS data sets that contain (or do not contain) character variables appeared first on The DO Loop.

11月 142018
 

An ROC curve graphically summarizes the tradeoff between true positives and true negatives for a rule or model that predicts a binary response variable. An ROC curve is a parametric curve that is constructed by varying the cutpoint value at which estimated probabilities are considered to predict the binary event. Most SAS data analysts know that you can fit a logistic model in PROC LOGISTIC and create an ROC curve for that model, but did you know that PROC LOGISTIC enables you to create and compare ROC curves for ANY vector of predicted probabilities regardless of where the predictions came from? This article shows how!

If you want to review the basic constructions of an ROC curve, you can see a previous article that constructs an empirical ROC curve from first principles. The PROC LOGISTIC documentation provides formulas used for constructing an ROC curve.

Produce an ROC plot by using PROC LOGISTIC

Before discussing how to create an ROC plot from an arbitrary vector of predicted probabilities, let's review how to create an ROC curve from a model that is fit by using PROC LOGISTIC. The following data and model are taken from the the PROC LOGISTIC documentation. The data are for 43 cancer patients who also had an intestinal obstruction. The response variable popInd is a postoperative indicator variable: popInd = 1 for patients who died within two months after surgery. The explanatory variables are three pre-operative screening tests. The goal of the study is to determine patients who might benefit from surgery, where "benefit" is measured by postoperative survival of at least two months.

data roc;
   input alb tp totscore popind @@;
   totscore = 10 - totscore;
   datalines;
3.0 5.8 10 0   3.2 6.3  5 1   3.9 6.8  3 1   2.8 4.8  6 0
3.2 5.8  3 1   0.9 4.0  5 0   2.5 5.7  8 0   1.6 5.6  5 1
3.8 5.7  5 1   3.7 6.7  6 1   3.2 5.4  4 1   3.8 6.6  6 1
4.1 6.6  5 1   3.6 5.7  5 1   4.3 7.0  4 1   3.6 6.7  4 0
2.3 4.4  6 1   4.2 7.6  4 0   4.0 6.6  6 0   3.5 5.8  6 1
3.8 6.8  7 1   3.0 4.7  8 0   4.5 7.4  5 1   3.7 7.4  5 1
3.1 6.6  6 1   4.1 8.2  6 1   4.3 7.0  5 1   4.3 6.5  4 1
3.2 5.1  5 1   2.6 4.7  6 1   3.3 6.8  6 0   1.7 4.0  7 0
3.7 6.1  5 1   3.3 6.3  7 1   4.2 7.7  6 1   3.5 6.2  5 1
2.9 5.7  9 0   2.1 4.8  7 1   2.8 6.2  8 0   4.0 7.0  7 1
3.3 5.7  6 1   3.7 6.9  5 1   3.6 6.6  5 1
;
 
ods graphics on;
proc logistic data=roc plots(only)=roc;
   LogisticModel: model popind(event='0') = alb tp totscore;
   output out=LogiOut predicted=LogiPred;       /* output predicted value, to be used later */
run;
ROC curve for linear logistic model fitted in PROC LOGISTIC in SAS

You can see the documentation for details about how to interpret the output from PROC LOGISTIC, but the example shows that you can use the PLOTS=ROC option (or the ROC statement) to create an ROC curve for a model that is fit by PROC LOGISTIC. For this model, the area under the ROC curve is 0.77. Because a random "coin flip" prediction has an expected area of 0.5, this model predicts the survival of surgery patients better than random chance.

Create an ROC curve for any prediction rule

A logistic model is not the only way to predict a binary response. You could also use a decision tree, a generalized mixed model, a nonparametric regression model, or even ask a human expert for her opinion. An ROC curve only requires two quantities: for each observation, you need the observed binary response and a predicted probability. In fact, if you carefully read the PROC LOGISTIC documentation, you will find these sentences:

  • In the "Details" section: "ROC curves can be created ... from the specified model in the MODEL statement, from specified models in ROC statements, or from input variables which act as [predicted probabilities]." (Emphasis added.)
  • In the documentation of the ROC statement: "The PRED= option enables you to input a criterion produced outside PROC LOGISTIC; for example, you can fit a random-intercept model by using PROC GLIMMIX or use survey weights in PROC SURVEYLOGISTIC, then use the predicted values from those models to produce an ROC curve for the comparisons."

In other words, you can use PROC LOGISTIC to create an ROC curve regardless of how the predicted probabilities are obtained! For argument's sake, let's suppose that you ask a human expert to predict the probability of each patient surviving for at least two months after surgery. (Notice that there is no statistical model here, only a probability for each patient.) The following SAS DATA step defines the predicted probabilities, which are then merged with the output from the earlier PROC LOGISTIC call:

data ExpertPred;
   input ExpertPred @@;
   datalines;
0.95 0.2  0.05 0.3  0.1  0.6  0.8  0.5 
0.1  0.25 0.1  0.2  0.05 0.1  0.05 0.1 
0.4  0.1  0.2  0.25 0.4  0.7  0.1  0.1 
0.3  0.2  0.1  0.05 0.1  0.4  0.4  0.7
0.2  0.4  0.1  0.1  0.9  0.7  0.8  0.25
0.3  0.1  0.1 
;
data Survival;
   merge LogiOut ExpertPred;
run;
 
/* create ROC curve from a variable that contains predicted values */
proc logistic data=Survival;
   model popind(event='0') = ExpertPred / nofit;
   roc 'Expert Predictions' pred=ExpertPred;
   ods select ROCcurve;
run;
ROC curve from external predictions, create with PROC LOGISTIC in SAS

Notice that you only need to supply two variables on the MODEL statements: the observed responses and the variable that contains the predicted values. On the ROC statement, I've used the PRED= option to indicate that the ExpertPred variable is not being fitted by the procedure. Although PROC LOGISTIC creates many tables, I've used the ODS SELECT statement to suppress all output except for the ROC curve.

Overlay and compare ROC curves from different models or rules

You might want to overlay and compare ROC curves from multiple predictive models (either from PROC LOGISTIC or from other sources). PROC LOGISTIC can do that as well. You just need to merge the various predicted probabilities into a single SAS data set and then specify multiple ROC statements, as follows:

/* overlay two or more ROC curves by using variables of predicted values */
proc logistic data=Survival;
   model popind(event='0') = LogiPred ExpertPred / nofit;
   roc 'Logistic' pred=LogiPred;
   roc 'Expert'   pred=ExpertPred;
   ods select ROCOverlay;
   /* optional: for a statistical comparison, use ROCCONTRAST stmt and remove the ODS SELECT stmt */
   *roccontrast reference('Expert Model') / estimate e;
run;
Compare ROC curves by using PROC LOGISTIC in SAS to overlay the ROC curves

This ROC overlay shows that the "expert" prediction is almost always superior or equivalent to the logistic model in terms of true and false classification rates. As noted in the comments of the previous call to PROC LOGISTIC, you can use the ROCCONTRAST statement to obtain a statistical analysis of the difference between the areas under the curves (AUC).

In summary, you can use the ROC statement in PROC LOGISTIC to generate ROC curves for models that were computed outside of PROC LOGISTIC. All you need are the predicted probabilities and observed response for each observation. You can also overlay and compare two or more ROC curves and use the ROCCONTRAST statement to analyze the difference between areas under the curves.

The post Create and compare ROC curves for any predictive model appeared first on The DO Loop.

8月 272018
 

A frequent topic on SAS discussion forums is how to check the assumptions of an ordinary least squares linear regression model. Some posts indicate misconceptions about the assumptions of linear regression. In particular, I see incorrect statements such as the following:

  • Help! A histogram of my variables shows that they are not normal! I need to apply a normalizing transformation before I can run a regression....
  • Before I run a linear regression, I need to test that my response variable is normal....

Let me be perfectly clear: The variables in a least squares regression model do not have to be normally distributed. I'm not sure where this misconception came from, but perhaps people are (mis)remembering an assumption about the errors in an ordinary least squares (OLS) regression model. If the errors are normally distributed, you can prove theorems about inferential statistics such as confidence intervals and hypothesis tests for the regression coefficients. However, the normality-of-errors assumption is not required for the validity of the parameter estimates in OLS. For the details, the Wikipedia article on ordinary least squares regression lists four required assumptions; the normality of errors is listed as an optional fifth assumption.

In practice, analysts often "check the assumptions" by running the regression and then examining diagnostic plots and statistics. Diagnostic plots help you to determine whether the data reveal any deviations from the assumptions for linear regression. Consequently, this article provides a "getting started" example that demonstrates the following:

  1. The variables in a linear regression do not need to be normal for the regression to be valid.
  2. You can use the diagnostic plots that are produced automatically by PROC REG in SAS to check whether the data seem to satisfy some of the linear regression assumptions.

By the way, don't feel too bad if you misremember some of the assumptions of linear regression. Williams, Grajales, and Kurkiewicz (2013) point out that even professional statisticians sometimes get confused.

An example of nonnormal data in regression

Consider this thought experiment: Take any explanatory variable, X, and define Y = X. A linear regression model perfectly fits the data with zero error. The fit does not depend on the distribution of X or Y, which demonstrates that normality is not a requirement for linear regression.

For a numerical example, you can simulate data such that the explanatory variable is binary or is clustered close to two values. The following data shows an X variable that has 20 values near X=5 and 20 values near X=10. The response variable, Y, is approximately five times each X value. (This example is modified from an example in Williams, Grajales, and Kurkiewicz, 2013.) Neither variable is normally distributed, as shown by the output from PROC UNIVARIATE:

/* For n=1..20, X ~ N(5, 1). For n=21..40, X ~ N(10, 1).
   Y = 5*X + e, where e ~ N(0,1) */
data Have;
input X Y @@;
datalines;
 3.60 16.85  4.30 21.30  4.45 23.30  4.50 21.50  4.65 23.20 
 4.90 25.30  4.95 24.95  5.00 25.45  5.05 25.80  5.05 26.05 
 5.10 25.00  5.15 26.45  5.20 26.10  5.40 26.85  5.45 27.90 
 5.70 28.70  5.70 29.35  5.90 28.05  5.90 30.50  6.60 33.05 
 8.30 42.50  9.00 45.50  9.35 46.45  9.50 48.40  9.70 48.30 
 9.90 49.80 10.00 48.60 10.05 50.25 10.10 50.65 10.30 51.20 
10.35 49.80 10.50 53.30 10.55 52.15 10.85 56.10 11.05 55.15 
11.35 55.95 11.35 57.90 11.40 57.25 11.60 57.95 11.75 61.15 
;
 
proc univariate data=Have;
   var x y;
   histogram x y / normal;
run;

There is no need to "normalize" these data prior to performing an OLS regression, although it is always a good idea to create a scatter plot to check whether the variables appear to be linearly related. When you regress Y onto X, you can assess the fit by using the many diagnostic plots and statistics that are available in your statistical software. In SAS, PROC REG automatically produces a diagnostic panel of graphs and a table of fit statistics (such as R-squared):

/* by default, PROC REG creates a FitPlot, ResidualPlot, and a Diagnostics panel */
ods graphics on;
proc reg data=Have;
   model Y = X;
quit;

The R-squared value for the model is 0.9961, which is almost a perfect fit, as seen in the fit plot of Y versus X.

Using diagnostic plots to check the assumptions of linear regression

You can use the graphs in the diagnostics panel to investigate whether the data appears to satisfy the assumptions of least squares linear regression. The panel is shown below (click to enlarge).

The first column in the panel shows graphs of the residuals for the model. For these data and for this model, the graphs show the following:

  • The top-left graph shows a plot of the residuals versus the predicted values. You can use this graph to check several assumptions: whether the model is specified correctly, whether the residual values appear to be independent, and whether the errors have constant variance (homoscedastic). The graph for this model does not show any misspecification, autocorrelation, or heteroscedasticity.
  • The middle-left and bottom-left graphs indicate whether the residuals are normally distributed. The middle plot is a normal quantile-quantile plot. The bottom plot is a histogram of the residuals overlaid with a normal curve. Both these graphs indicate that the residuals are normally distributed. This is evidence that you can trust the p-values for significance and the confidence intervals for the parameters.

In summary, I wrote this article to addresses two points:

  1. To dispel the myth that variables in a regression need to be normal. They do not. However, you should check whether the residuals of the model are approximately normal because normality is important for the accuracy of the inferential portions of linear regression such as confidence intervals and hypothesis tests for parameters. (A colleague mentioned to me that standard errors and hypothesis tests tend to be robust to this assumption, so a modest departure from normality is often acceptable.)
  2. To show that the SAS regression procedures automatically provide many graphical diagnostic plots that you can use to assess the fit of the model and check some assumptions for least squares regression. In particular, you can use the plots to check the independence of errors, the constant variance of errors, and the normality of errors.

References

There have been many excellent books and papers that describe the various assumptions of linear regression. I don't feel a need to rehash what has already been written, In addition to the Wikipedia article about ordinary linear regression, I recommend the following:

The post On the assumptions (and misconceptions) of linear regression appeared first on The DO Loop.

8月 222018
 

A SAS programmer recently asked how to interpret the "standardized regression coefficients" as computed by the STB option on the MODEL statement in PROC REG and other SAS regression procedures. The SAS documentation for the STB option states, "a standardized regression coefficient is computed by dividing a parameter estimate by the ratio of the sample standard deviation of the dependent variable to the sample standard deviation of the regressor." Although correct, this definition does not provide an intuitive feeling for how to interpret the standardized regression estimates. This article uses SAS to demonstrate how parameter estimates for the original variables are related to parameter estimates for standardized variables. It also derives how regression coefficients change after a linear transformation of the variables.

Proof by example

One of my college physics professors used to smile and say "I will now prove this by example" when he wanted to demonstrate a fact without proving it mathematically. This section uses PROC STDIZE and PROC REG to "prove by example" that the standardized regression estimates for data are equal to the estimates that you obtain by standardizing the data. The following example uses continuous response and explanatory variables, but there is a SAS Usage Note that describes how to standardize classification variables.

The following call to PROC REG uses the STB option to compute the standardized parameter estimates for a model that predicts the weights of 19 students from heights and ages:

proc reg data=Sashelp.Class plots=none;
   Orig: model Weight = Height Age / stb;
   ods select ParameterEstimates;
quit;

The last column is the result of the STB option on the MODEL statement. You can get the same numbers by first standardizing the data and then performing a regression on the standardized variables, as follows:

/* Put original and standardized variables into the output data set.
   Standardized variables have the names 'StdX' where X was the name of the original variable. 
   METHOD=STD standardizes variables according to StdX = (X - mean(X)) / std(X) */
proc stdize data=Sashelp.Class out=ClassStd method=std OPREFIX SPREFIX=Std;
run;
 
proc reg data=ClassStd plots=none;
   Std: model StdWeight = StdHeight StdAge;
   ods select ParameterEstimates;
quit;

The parameter estimates for the standardized data are equal to the STB estimates for the original data. Furthermore, the t values and p-values for the slope parameters are equivalent because these statistics are scale- and translation-invariant. Notice, however, that scale-dependent statistics such as standard errors and covariance of the betas will not be the same for the two analyses.

Linear transformations of random variables

Mathematically, you can use a algebra to understand how linear transformations affect the relationship between two linearly dependent random variables. Suppose X is a random variable and Y = b0 + b1*X for some constants b0 and b1. What happens to this linear relationship if you apply linear transformations to X and Y?

Define new variables U = (Y - cY) / sY and V = (X - cX) / sX. If you solve for X and Y and plug the expressions into the equation for the linear relationship, you find that the new random variables are related by
U = (b0 + b1*cX - cY) / sY + b1*(sX/sY)*V.
If you define B0 = (b0 + b1*cX - cY) / sY and B1 = b1*(sX/sY), then U = B0 + B1*V, which shows that the transformed variables (U and V) are linearly related.

The physical significance of this result is that linear relationships persist no matter what units you choose to measure the variables. For example, if X is measured in inches and Y is measured in pounds, then the quantities remain linearly related if you measure X in centimeters and measure Y in "kilograms from 20 kg."

The effect of standardizing variables on regression estimates

The analysis in the previous section holds for any linear transformation of linearly related random variables. But suppose, in addition, that

  • U and V are standardized versions of Y and X, respectively. That is, cY and cX are the sample means and sY and sX are the sample standard deviations.
  • The parameters b0 and b1 are the regression estimates for a simple linear regression model.

For simple linear regression, the intercept estimate is b0 = cY - b1*cY, which implies that B0 = 0. Furthermore, the coefficient B1 = b1*(sX/sY) is the original parameter estimate "divided by the ratio of the sample standard deviation of the dependent variable to the sample standard deviation of the regressor," just as the PROC REG documentation states and just as we saw in the PROC REG output in the previous section. Thus the STB option on the MODEL statement does not need to standardize any data! It produces the standardized estimates by setting the intercept term to 0 and dividing the parameter estimates by the ratio of standard deviations, as noted in the documentation. (A similar proof handles multiple explanatory variables.)

Interpretation of the regression coefficients

For the original (unstandardized) data, the intercept estimate predicts the value of the response when the explanatory variables are all zero. The regression coefficients predict the change in the response for one unit change in an explanatory variable. The "change in response" depends on the units for the data, such as kilograms per centimeter.

The standardized coefficients predict the number of standard deviations that the response will change for one STANDARD DEVIATION of change in an explanatory variable. The "change in response" is a unitless quantity. The fact that the standardized intercept is 0 indicates that the predicted value of the (centered) response is 0 when the model is evaluated at the mean values of the explanatory variables.

In summary, standardized coefficients are the parameter estimates that you would obtain if you standardize the response and explanatory variables by centering and scaling the data. A standardized parameter estimate predicts the change in the response variable (in standard deviations) for one standard deviation of change in the explanatory variable.

The post Standardized regression coefficients appeared first on The DO Loop.

6月 112018
 

In SAS, the reserved keyword _NULL_ specifies a SAS data set that has no observations and no variables. When you specify _NULL_ as the name of an output data set, the output is not written. The _NULL_ data set is often used when you want to execute DATA step code that displays a result, defines a macro variable, writes a text file, or makes calls to the EXECUTE subroutine. In those cases, you are interested in the "side effect" of the DATA step and rarely want to write a data set to disk. This article presents six ways to use the _NULL_ data set. Because the _NULL_ keyword is used, no data set is created on disk.

#1. Use SAS as a giant calculator

You can compute a quantity a DATA _NULL_ step and then use the PUT statement to output the answer to the SAS log. For example, the following DATA step evaluates the normal density function at x-0.5 when μ=1 and σ=2. The computation is performed twice: first using the built-in PDF function and again by using the formula for the normal density function. The SAS log shows that the answer is 0.193 in both cases.

data _NULL_;
mu = 1; sigma = 2; x = 0.5; 
pdf = pdf("Normal", x, mu, sigma);
y = exp(-(x-mu)**2 / (2*sigma**2)) / sqrt(2*constant('pi')*sigma**2);
put (pdf y) (=5.3);
run;
pdf=0.193 y=0.193

#2. Display characteristics of a data set

You can use a null DATA step to display characteristics of a data set. For example, the following DATA step uses the PUT statement to display the number of numeric and character variables in the Sashelp.Class data set. No data set is created.

data _NULL_;
set Sashelp.Class;
array char[*} $ _CHAR_;
array num[*} _NUMERIC_;
nCharVar  = dim(char);
nNumerVar = dim(num);
put "Sashelp.Class: " nCharVar= nNumerVar= ;
stop;   /* stop processing after first observation */
run;
Sashelp.Class: nCharVar=2 nNumerVar=3

You can also store these values in a macro variable, as shown in the next section.

#3. Create a macro variable from a value in a data set

You can use the SYMPUT or SYMPUTX subroutines to create a SAS macro variable from a value in a SAS data set. For example, suppose you run a SAS procedure that computes some statistic in a table. Sometimes the procedure supports an option to create an output data that contains the statistic. Other times you might need to use the ODS OUTPUT statement to write the table to a SAS data set. Regardless of how the statistic gets in a data set, you can use a DATA _NULL_ step to read the data set and store the value as a macro variable.

The following statements illustrate this technique. PROC MEANS creates a table called Summary, which contains the means of all numerical variables in the Sashelp.Class data. The ODS OUTPUT statement writes the Summary table to a SAS data set called Means. The DATA _NULL_ step finds the row for the Height variable and creates a macro variable called MeanHeight that contains the statistic. You can use that macro variable in subsequent steps of your analysis.

proc means data=Sashelp.Class mean stackods;
   ods output Summary = Means;
run;
 
data _NULL_;
set Means;
/* use PROC CONTENTS to determine the columns are named Variable and Mean */
if Variable="Height" then             
   call symputx("MeanHeight", Mean);
run;
 
%put &=MeanHeight;
MEANHEIGHT=62.336842105

For a second example, see the article "What is a factoid in SAS," which shows how to perform the same technique with a factoid table.

#4. Create macro variable from a computational result

Sometimes there is no procedure that computes the quantity that you want, or you prefer to compute the quantity yourself. The following DATA _NULL_ step counts the number of complete cases for the numerical variables in the Sashelp.Heart data. It then displays the number of complete cases and the percent of complete cases in the data. You can obtain the same results if you use PROC MI and look at the MissPattern table.

data _NULL_;
set Sashelp.Heart end=eof nobs=nobs;
NumCompleteCases + (nmiss(of _NUMERIC_) = 0); /* increment if all variables are nonmissing */
if eof then do;                               /* when all observations have been read ... */
   PctComplete = NumCompleteCases / nobs;     /* ... find the percentage */
   put NumCompleteCases= PctComplete= PERCENT7.1;
end;
run;
NumCompleteCases=864 PctComplete=16.6%

#5. Edit a text file or ODS template "on the fly"

This is a favorite technique of Warren Kuhfeld, who is a master of writing a DATA _NULL_ step that modifies an ODS template. In fact, this technique is at the heart of the %MODSTYLE macro and the SAS macros that modify the Kaplan-Meier survival plot.

Although I am not as proficient as Warren, I wrote a blog post that introduces this template modification technique. The DATA _NULL_ step is used to modify an ODS template. It then uses CALL EXECUTE to run PROC TEMPLATE to compile the modified template.

#6. A debugging tool

All the previous tips use _NULL_ as the name of a data set that is not written to disk. It is a curious fact that you can use the _NULL_ data set in almost every SAS statement that expects a data set name!

For example, you can read from the _NULL_ data set. Although reading zero observations is not always useful, one application is to check the syntax of your SAS code. Another application is to check whether a procedure is installed on your system. For example, you can run the statements PROC ARIMA data=_NULL_; quit; to check whether you have access to the ARIMA procedure.

A third application is to use _NULL_ to suppress debugging output. During the development and debugging phase of your development, you might want to use PROC PRINT, PROC CONTENTS, and PROC MEANS to ensure that your program is working as intended. However, too much output can be a distraction, so sometimes I direct the debugging output to the _NULL_ data set where, of course, it magically vanishes! For example, the following DATA step subsets the Sashelp.Cars data. I might be unsure as to whether I created the subset correctly. If so, I can use PROC CONTENTS and PROC MEANS to display information about the subset, as follows:

data Cars;
set Sashelp.Cars(keep=Type _NUMERIC_);
if Type in ('Sedan', 'Sports', 'SUV', 'Truck'); /* subsetting IF statement */
run;
 
/* FOR DEBUGGING ONLY */
%let DebugName = Cars;  /* use _NULL_ to turn off debugging output */
proc contents data=&DebugName short;
run;
proc means data=&DebugName N Min Max;
run;

If I don't want to this output (but I want the option to see it again later), I can modify the DebugName macro (%let DebugName = _NULL_;) so that the CONTENTS and MEANS procedures do not produce any output. If I do that and rerun the program, the program does not create any debugging output. However, I can easily restore the debugging output whenever I want.

Summary

In summary, the _NULL_ data set name is a valuable tool for SAS programmers. You can perform computations, create macro variables, and manipulate text files without creating a data set on disk. Although I didn't cover it in this article, you can use DATA _NULL_ in conjunction with ODS for creating customized tables and reports.

What is your favorite application of using the _NULL_ data set? Leave a comment.

The post 6 ways to use the _NULL_ data set in SAS appeared first on The DO Loop.

5月 292018
 

The SAS language provides syntax that enables you to quickly specify a list of variables. SAS statements that accept variable lists include the KEEP and DROP statements, the ARRAY statement, and the OF operator for comma-separated arguments to some functions. You can also use variable lists on the VAR statements and MODEL statements of analytic procedures.

This article describes six ways to specify a list of variables in SAS. There is a section in the SAS documentation that describes how to construct lists, but this blog post provides more context and a cut-and-paste example for every syntax. This article demonstrates the following:

  • Use the _NUMERIC_, _CHARACTER_, and _ALL_ keywords to specify variables of a certain type (numeric or character) or all types.
  • Use a single hyphen (-) to specify a range of variables that have a common prefix and a sequential set of numerical suffixes.
  • Use the colon operator (:) to specify a list of variables that begin with a common prefix.
  • Use a double-hyphen (--) to specify a consecutive set of variables, regardless of type. You can also use a variation of this syntax to specify a consecutive set of variables of a certain type (numeric or character).
  • Use the OF operator to specify variables in an array or in a function call.
  • Use macro variables to specify variables that satisfy certain characteristics.

Some companies might discourage the use of variable lists in production code because automated lists can be volatile. If the number and names of variables in your data sets occasionally change, it is safer to manually list the variables that you are analyzing. However, for developing code and constructing examples, lists can be a huge time saver.

Use the _NUMERIC_, _CHARACTER_, and _ALL_ keywords

You can specify all numeric variables in a data set by using the _NUMERIC_ keyword. You can specify all character variables by using the _CHARACTER_ keyword. Many SAS procedures use a VAR statement to specify the variables to be analyzed. When you want to analyze all variables of a certain type, you can use these keywords, as follows:

/* compute descriptive statistics of allnumeric variables */
proc means data=Sashelp.Heart nolabels; 
   var _NUMERIC_;          /* _NUMERIC_ is the default */
run;
 
/* display the frequencies of all levels for all character variables */
proc freq data=Sashelp.Heart; 
   tables _CHARACTER_;    /* _ALL_ is the defaul */
run;
Use a keyword to specify a list of variables in SAS

One of my favorite SAS programming tricks is to use these keywords in a KEEP or DROP statement (or data set option). For example, the following statements create a new data set that contains all numeric variables and two character variables from the Sashelp.Heart data:

data HeartNumeric;
set Sashelp.Heart(keep=_NUMERIC_            /* all numeric variables */
                       Sex Smoking_Status); /* two character variables */
run;

An example of using the _ALL_ keyword is shown in the section that discusses the OF operator.

Use a hyphen to specify numerical suffixes

In many situations, variables are named with a common prefix and numerical suffix. For example, financial data might have variables that are named Sales2008, Sales2009, ..., Sales2017. In simulation studies, variables often have names such as X1, X2, ..., X50. The hyphen enables you to specify the first and last variable in a list. The first example can be specified as Sales2008-Sales2017. The second example is X1-X50.

The following DATA step creates 10 variables, including the variables x1-x6. Notice that the data set variables are not in alphanumeric order. That is okay. The syntax x1-x6 will select the six variables x1, x2, x3, x4, x5, and x6 regardless of their physical order in the data. The call to PROC REG uses the six variables in a linear regression:

data A;
   retain Y x1 x3 Z x6 x5 x2 W x4 R;  /* create 10 variables and one observation. Initialize to 0 */
run;
proc reg data=A plots=none;
   model Y = x1-x6;
run;

The parameter estimates from PROC REG are displayed in the order that you specify in the MODEL statement. However, if you use the SET statement in a DATA step, the variables appear in the original order unless you intentionally reorder the variables:

data B;
   set A(keep=x1-x6);
run;

Use the colon operator to specify a prefix

If you want to use variables that have a common prefix but have a variety of suffixes, you can use the colon operator (:), which is a wildcard character that matches any name that begins with a specified prefix. For example, the following DATA step creates a data set that contains 10 variables, including five variables that begin with the prefix 'Sales'. The subsequent DATA step drops the variables that begin with the prefix 'Sales':

data A;
retain Sales17 Y Sales16 Z SalesRegion Sales_new Sales1 R; /* 1 obs. Initialize to 0 */
run;
 
data B;
   set A(drop= Sales: ); /* drop all variables that begin with 'Sales' */
run;

Use a double-hyphen to specify consecutive variables

The previous sections used wildcard characters to match variables that had a specified type or prefix. In the previous sections, you will get the same set of variables regardless of how they might be ordered in the data set. You can use a double-hyphen (--) to specify a consecutive set of variables. The variables you get depend on the order of the variables in the data set.

data A;
   retain Y 0   x3 2   C1 'A'   C2 'BC'
          Z 3   W  4   C4 'D'   C5 'EF'; /* Initialize eight variables */
run;
data B;
   set A(keep=x3--C4);
run;

In this example, the data set B contains the variables x3, C1, C2, Z, W, and C4. If you use the double-hyphen to specify a list, be sure that you know the order of the variables and that this order is never going to change. If the order of the variables changes, your program will behave differently.

You can also specify all variables of a certain type within a range of variables. The syntax Y-numeric-Z specifies all numeric variables between Y and Z in the data set. The syntax Y-character-Z specifies all character variables between Y and Z. For example, the following call to PROC CONTENTS displays the variables (in order) in the Sashelp.Heart data. The call to PROC LOGISTIC specifies all the numeric variables between (and including) the AgeCHDiag variable and the Smoking variable:

proc contents data=Sashelp.Heart order=varnum ;
run;
 
proc logistic data=Sashelp.Heart;
   model status = AgeCHDdiag-numeric-Smoking;
   ods select ParameterEstimates;
run;
Use a double-hyphen to specify a contiguous list of variables in SAS

Arrays and the OF operator

You can use variable lists to assign an array in a SAS DATA step. For example, the following program creates a numerical array named X and a character array named C. The program finds the maximum value in each row and puts that value into the variable named rowMaxNUm. The program also creates a variable named Str that contains the concatenation of the character values for each row:

data Arrays;
   set sashelp.Class;
   array X {*} _NUMERIC_;        /* X[1] is 1st var, X[2] is 2nd var, etc */
   array C {*} _CHARACTER_;      /* C[1] is 1st var, C[2] is 2nd var, etc */
   /* use the OF operator to pass values in array to functions */
   rowMaxNum = max(of x[*]);     /* find the max value in this array (row) */
   length Str $30;
   call catx(' ', Str, of C[*]); /* concatenate the strings in this array (row) */
   keep rowMaxNum Str;
run;
 
proc print data=Arrays(obs=4);
run;
Use a keyword to specify a list of variables to certain SAS functions

You can use the OF operator directly in functions without creating an array. For example, the following program uses the _ALL_ keyword to output the "complete cases" for the Sashelp.Heart data. The program drops any observation that has a missing value for any variable:

data CompleteCases;
  set Sashelp.Heart;
  if cmiss(of _ALL_)=0;  /* output only complete cases for all vars */
run;

Use macro variables to specify a list

The previous sections demonstrate how you can use syntax to specify a list of variables to SAS statements. In contrast, this section describes a technique rather than syntax. It is sometimes the case that the names of variables are in a column in a data set. There might be other columns in the data set that contain characteristics or statistics for the variables. For example, the following call to PROC MEANS creates an output data set (called MissingValues) that contains columns named Variable and NMiss.

proc means data=Sashelp.Heart nolabels NMISS stackodsoutput;
   var _NUMERIC_;
   ods output Summary = MissingValues;
run;
proc print; run;
Use a macro variable to specify a list of variables in SAS

Suppose you want to keep or drop those variables that have one or more missing values. The following PROC SQL call creates a macro variable (called MissingVarList) that contains a space-separated list of all variables that have at least one missing value. This technique has many applications and is very powerful.

/* Use PROC SQL to create a macro variable (MissingVarList) that contains
   the list of variables that have a property such as missing values */
proc sql noprint;                              
 select Variable into :MissingVarList separated by ' '
 from MissingValues
 where NMiss > 0;
quit;
%put &=MissingVarList;
MISSINGVARLIST=AgeCHDdiag Height Weight MRW Smoking AgeAtDeath Cholesterol

You can now use the macro variable in a KEEP, DROP, VAR, or MODEL statement, such as KEEP=&MissingVarList;

Summary

This article shows six ways to specify a list of variables to SAS statements and functions. The SAS syntax provides keywords (_NUMERIC_, _CHARACTER_, and _ALL_) and operators (hyphen, colon, and double-hyphen) to make it easy to specify a list of variables. You can use the syntax in conjunction with the OF operator to pass a variable list to some SAS functions. Lastly, if the names of variables are stored in a column in a data set, you can use the full power of PROC SQL to create a macro variable that contains variables that satisfy certain criteria.

Do you use shorthand syntax to specify lists of variables? Why or why not? Leave a comment.

The post 6 easy ways to specify a list of variables in SAS appeared first on The DO Loop.

5月 212018
 

In a recent blog post, Chris Hemedinger used a scatter plot to show the result of 100 coin tosses. Chris arranged the 100 results in a 10 x 10 grid, where the first 10 results were shown on the first row, the second 10 were shown on the second row, and so on. Placing items along each row before going to the next row is called row-major order.

An implicit formula for arranging items in rows

If you process items sequentially, it is easy to position the items in a grid by using an inductive scheme:

  1. Place the first item at (1, 1).
  2. Assume the n_th is placed at position (r, c). Place the (n+1)st item at position (r, c+1) if there is room on the current row, otherwise place it at (r+1, 1), which is the first element of the next row.

The inductive scheme is also called an implicit or recursive formula because the position of the (n+1)st item is given in terms of the position of the nth item.

For example, suppose that you have 70 items and you want to place 11 items in each row. The inductive algorithm looks like the following:

%let Nx = 11;           /* number of items in row */
data Loc;
label r = "Row" c = "Column";
retain r 1  c 1 item 1;
output;                 /* base case */
do item = 2 to 70;      /* inductive step */
   c + 1;
   if c > &Nx then do;
      r + 1; c = 1;
   end;
   output;
end;
run;
 
title "Position of Items in Grid";
proc sgplot data=Loc;
   text x=c y=r text=item / textattrs=(size=12) position=center strip;
   xaxis integer offsetmin=0.05 offsetmax=0.05;
   yaxis reverse offsetmin=0.05 offsetmax=0.05;
run;
Items arranged in a grid in row-major order with 11 items in each row

The inductive algorithm is easy to implement and to understand. However, it does not enable you to easily determine the row and column of the 1,234,567_th item if there are 11 items in each row. Nor does it enable you to compute the positions when the index increments by a value greater than 1. To answer these questions, you need to use an explicit or direct formula.

An explicit formula for arranging items in rows

The explicit formula uses the MOD function to compute the column position and integer division to compute the row position. SAS does not have an explicit "integer division operator," but you can emulate it by using the FLOOR function. The following macro definitions encapsulate the formulas:

/* (row, col) for item n if there are Nx items in each row (count from 1),
   assuming row-major order */
%macro ColPos(n, Nx);
   1 + mod(&n.-1, &Nx.)
%mend;
%macro RowPos(n, Nx);
   1 + floor((&n.-1) / &Nx.)
%mend;

The formulas might look strange because they subtract 1, do a calculation, and then add 1. This formula assumes that you want to count the items, rows, and columns beginning with 1. If you prefer to count from 0 then the formulas become MOD(n, Nx) and FLOOR(N/Nx).

You can use the formulas to directly compute the position of the odd integers in the digits 1–70 when there are 11 items on each row:

%let Nx = 11;
data grid;
do item = 1 to 70 by 2;       /* only odd integers */
   row = %RowPos(item, &Nx);
   col = %ColPos(item, &Nx);
   output;
end;
run;
 
title "Position of Odd Integers in Grid";
proc sgplot data=grid;
   text x=col y=row text=item / textattrs=(size=12) position=center strip;
   xaxis integer offsetmin=0.05 offsetmax=0.05 label="Column" max=&Nx;
   yaxis reverse offsetmin=0.05 offsetmax=0.05 integer label="Row";
run;
Positions of odd integers in a grid in row-major order

Of course, you can also use the direct formula to process items incrementally. The following DATA step computes the positions for 19 observations in the Sashelp.Class data set, where five names are placed in each row:

data gridName;
set sashelp.class;
y = %RowPos(_N_, 5);  /* 5 columns in each row */
x = %ColPos(_N_, 5);
run;
 
title "Position Five Names in Each Row";
proc sgplot data=gridName;
   text x=x y=y text=Name / textattrs=(size=12) position=center strip;
   xaxis integer offsetmin=0.08 offsetmax=0.08;
   yaxis reverse offsetmin=0.05 offsetmax=0.05 integer;
run;
Positions of items in a grid with 5 items in each row

The explicit formula is used in the SAS/IML NDX2SUB function, which tells you the row and column information for the n_th item in a matrix.

In summary, you can use an implicit formula or an explicit formula to arrange items in rows, where each row contains Nx items. The implicit formula is useful when you are arranging the items sequentially. The explicit formula is ideal when you are randomly accessing the items and you need a direct computation that provides the row and column position.

Finally, if you want to arrange items in column-major order (down the first column, then down the second,...), you can use similar formulas. The row position of the n_th item is 1 + mod(n-1, Ny) and the column position is 1 + floor((n-1) / Ny), where Ny is the number of rows in the grid.

The post Position items in a grid appeared first on The DO Loop.

4月 182018
 

The sweep operator performs elementary row operations on a system of linear equations. The sweep operator enables you to build regression models by "sweeping in" or "sweeping out" particular rows of the X`X matrix. As you do so, the estimates for the regression coefficients, the error sum of squares, and the generalized inverse of the system are updated simultaneously. This is true not only for a single response variable but also for multiple responses.

You might remember elementary row operations from when you learned Gaussian elimination, which is a general technique for solving linear systems of equations. Sweeping a matrix is similar to Gaussian elimination, but the sweep operator is most often used to solve least squares problems for the normal equations X`X = X`Y. Therefore, in practice, the sweep operator is usually applied to the rows of a symmetric positive definite system.

How to use the sweep operator to obtain least squares estimates

Before we discuss details of the sweep operator, let's show how it produces regression estimates. The following program uses the SWEEP function in SAS/IML on a six-observation data set from J. Goodnight (1979, p. 153). The first three columns of the matrix M contains the design matrix for the explanatory variables; the first column of M represents the intercept parameter. The last column of the augmented matrix is the response variable. The sweep operator is called on the uncorrected sum of squares and crossproducts matrix (the USSCP matrix), which is computed as M`*M. In block form, the USSCP matrix contains the submatrices X`X, X`Y, Y`X, and Y`Y. The following program sweeps the first three rows of the USSCP matrix:

proc iml;
/*Intercept X1  X2  Y */
M  = {  1   1   1   1,
        1   2   1   3,
        1   3   1   3,
        1   1  -1   2,
        1   2  -1   2,
        1   3  -1   1};
S = sweep(M`*M, 1:3);     /* sweep first three rows (Intercept, X1, X2) */
print S;
The sweep operator applied to all ros of a USSCP matrix to obtain regression coefficients, SSe, and generalized inverse

The resulting matrix (S) is divided into blocks by black lines. The leading 3x3 block is the (generalized) inverse of the X`X matrix. The first three elements of the last column (outlined in red) are the least squares estimates for the regression model. The lower right cell (outlined in blue) is the sum of squared errors (SSE) for the regression model.

You can compare the result with the results from PROC REG, as follows. The parameter estimates and error sum of squares are highlighted for easy comparison:

data Have;
input X1 X2 Y @@;
datalines;
1   1   1   2   1   3   3   1   3
1  -1   2   2  -1   2   3  -1   1
;
 
proc reg data=Have USSCP plots=none;
   model Y = X1 X2;
   ods select USSCP ANOVA ParameterEstimates;
run; quit;
Regression estimates and SSE for least squares analysis in SAS PROC REG

As claimed, the sweep operator produces the same parameter estimates and SSE statistic as PROC REG. The next sections discuss additional details of the sweep operator.

The basics of the sweep operator

Goodnight (1979) defines the sweep operator as the following sequence of row operations. Given a symmetric positive definite matrix A, SWEEP(A, k) modifies the matrix A by using the pivot element A[k,k] and the k_th row, as follows:

  1. Let D = A[k,k] be the k_th diagonal element.
  2. Divide the k_th row by D.
  3. For every other row i ≠ k, let B = A[i,k] be the i_th element of the k_th column. Subtract B x (row k) from row i. Then set A[i,k] = –B/D.
  4. Set A[k,k] = 1/D.

You could program these steps in a matrix language such as SAS/IML, but as shown previously, the SAS/IML language supports the SWEEP function as a built-in function.

Sweeping in effects

The following program uses the same data as the first section. The USSCP matrix (S0) represents the normal equations (plus a little extra information). If you sweep the first row of the S0 matrix, you solve the regression problem for an intercept-only model:

proc iml;
/*Intercept X1  X2  Y */
M  = {  1   1   1   1,    1   2   1   3,    1   3   1   3,
        1   1  -1   2,    1   2  -1   2,    1   3  -1   1};
S0 = M`*M;          print S0;   /* USSCP matrix */
 
/* sweep in 1st row (Intercept) */
S1 = sweep(S0, 1);   print S1[F=BEST6.];
The sweep operator applied to the first row of a USSCP matrix to obtain an intercept-only model

The first element in the fourth row (S1[1,4], outlined in red) is the parameter estimate for the intercept-only model. The last element (S1[4,4], outlined in blue) is the sum of squared errors (SSE) for the intercept-only model.

If you "sweep in" the second row, you solve the least squares problem for a model that includes the intercept and X1. The corresponding elements of the last column contain the parameter estimates for the model. The lower-right element again contains the SSE for the revised model. If you proceed to "sweep in" the third row, the last column again contains the parameter estimates and the SSE, this time for the model that includes the intercept, X1, and X2:

/* sweep in 2nd row (X1) and 3rd row (X2) */
S2 = sweep(S1, 2);   print S2[F=BEST6.]; 
S3 = sweep(S2, 3);   print S3[F=BEST6.];
The sweep operator applied to subsequent rows of a USSCP matrix to 'sweep in' additional effects

Sweeping out effects

One of the useful features of the sweep operators is that you can remove an effect from a model as easily as you can add it. The sweep operator has a reversibility property: if you sweep a row a second time you "undo" the first sweep. In symbols, the operator has the property that SWEEP( SWEEP(A,k), k) = A for any row k. For example, if you want to take the X1 variable out of the model, merely sweep on the second row again:

S4 = sweep(S3, 2);   print S4[F=BEST6.]; /* model with Intercept + X2 */
The sweep operator applied twice 'sweeps out' an effect from a regression model

Of course, as shown in an earlier section, you do not need to sweep in one row at a time. You can sweep in multiple rows with one call by specifying a vector of indices. In fact, the sweep operator also commutes with itself, which means that you can specify the rows in any order. The call SWEEP(S0, 1:3) is equivalent to SWEEP(S0, {3 1 2}) or SWEEP(S0, {2 3 1}) or any other permutation of {1 2 3}.

Summary

The SWEEP operator (which is implemented in the SWEEP function in SAS/IML) enables you to construct least squares model estimates from the USSCP matrix, which is a block matrix that contains the normal equations. You can "sweep in" rows to add effects to a model or "sweep out" rows to remove effects. After each operation, the parameter estimates appear in the portion of the adjusted USSCP matrix that corresponds to the block for X`Y. Furthermore, the residual sum of squares appears in the portion of the adjusted USSCP matrix that corresponds to Y`Y.

Further reading

The sweep operator is known by different names in different fields and has been rediscovered several times. In statistics, the operator was popularized in a TAS article by J. Goodnight (1979) who mentions Ralston (1960) and Beaton (1964) as early researchers. Outside of statistics, the operator is known as the Principal Pivot Transform (Tucker, 1960), gyration (Duffy, Hazony, and Morrison, 1966), or exchange (Stewart and Stewart, 1998). Tsatsomeros (2000) provides an excellent review of the literature and the history of the sweep operator.

The post The sweep operator: A fundamental operation in regression appeared first on The DO Loop.