Last week I showed how to create dummy variables in SAS by using the GLMMOD procedure. The procedure enables you to create design matrices that encode continuous variables, categorical variables, and their interactions. You can use dummy variables to replace categorical variables in procedures that do not support a CLASS statement. You can use other procedures to create design matrices for other parameterizations.

SAS/IML programmers can use two built-in functions to create dummy variables. The DESIGN function generates dummy variables for the GLM parameterization. The DESIGNF function generates dummy variables for the EFFECT encoding. You can use the HDIR function to create interaction effects from the main-effect dummy variables.

The following DATA step creates sample data. The PROC IML statements read the data into vectors and use the DESIGN and DESIGNF function to create dummy variables. Note the use of the ODS LAYOUT GRIDDED statement to print SAS/IML matrices across the page.

data Patients; keep Cholesterol Sex BP_Status; set sashelp.heart; if 18 <= _N_ <= 27; run; proc iml; use Patients; read all var {Cholesterol Sex BP_Status}; close Patients; Dummy_GLM = design( BP_Status ); /* dummy vars, GLM encoding */ Dummy_Effect = designf( BP_Status ); /* dummy vars, EFFECT encoding */ ods layout gridded columns=3 advance=table; /* create gridded layout in HTML */ print BP_Status, Dummy_GLM, Dummy_Effect; ods layout end; |

You can see that the DESIGN matrix creates *k* binary dummy variables for a categorical variable that contains *k* levels. The first column represents the first level (in alphabetical order), which for this data is "High." The first column has the value 1 for each row for which BP_Status="High." Similarly, the second column contains a 1 for each row for which BP_Status="Normal."
The third column contains a 1 for each row for which BP_Status="Optimal."

In contrast, the DESIGNF creates a design matrix that has *k*–1 columns. The matrix has the EFFECT encoding, with the last category ("Optimal") serving as the reference level.
The first column has the value 1 for rows for which BP_Status="High," the value –1 for rows for which BP_Status is the reference level, and 0 otherwise. The second column is similar, except that 1 indicates rows for which BP_Status="Normal."

### Linear regression with dummy variables

Dummy variables convert character variables (and other categorical variables) into numerical variables with a specified encoding. As such they enable you to use matrix computations to perform a statistical analysis such as linear regression.

For example, the following SAS/IML statements perform a regression analysis that models the Cholesterol variable as a linear function of the Sex and BP_Status variables. The statements use the DESIGNF statement to form the dummy variables for each categorical variable. These columns (and an intercept column) are concatenated horizontally to form the design matrix. Because the DESIGNF statement is a nonsingular parameterization, you can use the SOLVE function to solve the normal equations and obtain the least squares solution, as follows:

Y = Cholesterol; /* response variable */ Intercept = j(nrow(Y), 1, 1); X1 = designf( Sex ); X2 = designf( BP_Status ); X = Intercept || X1 || X2; /* design matrix with EFFECT parameterization */ /* Matrix formulation of one-way ANOVA (cell means model): Y = X*beta + epsilon See https://en.wikipedia.org/wiki/Design_matrix */ b = solve( X`*X, X`*Y ); /* solve normal equations */ print b[rowname={"Intercept" "Sex:Female" "BP_Status:High" "BP_Status:Normal"}]; |

The interpretation of the parameter estimates for this linear example is somewhat complicated; see Lewis (2007) if you are interested. However, for comparison, the following call to PROC GENMOD creates parameter estimates for the same linear model. The PARAM=EFFECT option is used so that the procedure uses the EFFECT parameterization.

proc genmod data=Patients; class Sex BP_Status / param=effect; model Cholesterol = Sex BP_Status / noscale; ods select ParameterEstimates; run; |

Strictly speaking, PROC GENMOD uses maximum likelihood estimation whereas the PROC IML code is a least squares estimate, but you can see that the estimates are identical to four decimal places.

### REFERENCE encoding and the GLM parameter estimates

Although SAS/IML does not provide a built-in function for generating a design matrix that uses the REFERENCE encoding, you can easily create such a function. The REFERENCE encoding is similar to the GLM encoding, but with the (redundant) last column dropped:

/* design matrix for reference encoding */ start designr(x); A = design(x); /* get design matrix with GLM encoding */ return( A[,1:ncol(A)-1] ); /* drop last column */ finish; |

If you use the REFERENCE encoding to create the X matrix as in the previous section, then the SOLVE function returns the same parameter estimates that are provided by the GLM procedure. (The GLM procedure sets the parameters for the last dummy columns to zero.)

### Interactions of dummy variables

You can use the HDIR function to create interaction effects. For example, the following statements create columns that indicate the interaction between the Sex and BP_Status variables. The printed output shows the results for the EFFECT parameterization, but the same SAS/IML statement will produce the interaction effects for other parameterizations:

X1X2 = hdir(X1, X2); /* dummy variables for interaction term */ print X1X2[c={"Female High" "Female Normal"}]; |

By using the tips in this article, you can create design matrices for ANOVA and regression models that contain categorical variables. In this way, you can use SAS/IML to reproduce the parameter estimates in many SAS linear regression procedures.

The post Dummy variables in SAS/IML appeared first on The DO Loop.