rick wicklin

4月 242017
 

There are several ways to visualize data in a two-way ANOVA model. Most visualizations show a statistical summary of the response variable for each category. However, for small data sets, it can be useful to overlay the raw data. This article shows a simple trick that you can use to combine two categorical variables and plot the raw data for the joint levels of the two categorical variables.

An ANOVA for two-way interactions

Recall that an ANOVA (ANalysis Of VAriance) model is used to understand differences among group means and the variation among and between groups. The documentation for the ROBUSTREG procedure in SAS/STAT contains an example that compares the traditional ANOVA using PROC GLM with a robust ANOVA that uses PROC ROBUSTREG. The response variable is the survival time (Time) for 16 mice who were randomly assigned to different combinations of two successive treatments (T1, T2). (Higher times are better.) The data are shown below:

data recover;
input  T1 $ T2 $ Time @@;
datalines;
0 0 20.2  0 0 23.9  0 0 21.9  0 0 42.4
1 0 27.2  1 0 34.0  1 0 27.4  1 0 28.5
0 1 25.9  0 1 34.5  0 1 25.1  0 1 34.2
1 1 35.0  1 1 33.9  1 1 38.3  1 1 39.9
;

The response variable depends on the joint levels of the binary variables T1 and T2. A first attempt to visualize the data in SAS might be to create a box plot of the four combinations of T1 and T2. You can do this by assigning T1 to be the "category" variable and T2 to be a "group" variable in a clustered box plot, as follows:

title "Response for Two Groups";
title2 "Use VBOX Statement with Categories and Groups";
proc sgplot data=recover;
   vbox Time / category=T1 group=T2;
run;
Box plots for a binary 'category' variable and a binary 'group' variable

The graph shows the distribution of response for the four joint combinations of T1 and T2. The graph is a little hard to interpret because the category levels are 0/1. The two box plots on the left are for T1=0, which means "Did not receive the T1 treatment." The two box plots on the right are for mice who received the T1 treatment. Within those clusters, the blue boxes indicate the distribution of responses for the mice who did not receive the T2 treatment, whereas the red boxes indicate the response distribution for mice that did receive T2. Both treatments seem to increase the mean survival time for mice, and receiving both treatments seems to give the highest survival times.

Interpreting the graph took a little thought. Also, the colors seem somewhat arbitrary. I think the graph could be improved if the category labels indicate the joint levels. In other words, I'd prefer to see a box plot of the levels of interaction variable T1*T2. If possible, I'd also like to optionally plot the raw response values.

Method 1: Use the EFFECTPLOT statement

The LOGISTIC and GENMOD procedures in SAS/STAT support the EFFECTPLOT statement. Many other SAS regression procedures support the STORE statement, which enables you to save a regression model and then use the PLM procedure (which supports the EFFECTPLOT statement). The EFFECTPLOT statement can create a variety of plots for visualizing regression models, including a box plot of the joint levels for two categorical variables, as shown by the following statements:

/* Use the EFFECTPLOT statement in PROC GENMOD, or use the STORE statement and PROC PLM */
proc genmod data=recover;
   class T1 T2;
   model Time = T1 T2 T1*T2;
   effectplot box / cluster;
   effectplot interaction /  obs(jitter);  /* or use interaction plot to see raw data */
run;
Box plots of joint levels created by the EFFECTPLOT statement in SAS

The resulting graph uses box plots to show the schematic distribution of each of the joint levels of the two categorical variables. (The second EFFECTPLOT statement creates an "interaction plot" that shows the raw values and mean responses.) The means of each group are connected, which makes it easier to compare adjacent means. The labels indicate the levels of the T1*T2 interaction variable. I think this graph is an improvement over the previous multi-colored box plot, and I find it easier to read and interpret.

Although the EFFECTPLOT statement makes it easy to create this plot, the EFFECTPLOT statement does not support overlaying raw values on the box plots. (You can, however, see the raw values on the "interaction plot".) The next section shows an alternative way to create the box plots.

Method 2: Concatenate values to form joint levels of categories

You can explicitly form the interaction variable (T1*T2) by using the CATX function to concatenate the T1 and T2 variables, as shown in the following DATA step view. Because the levels are binary-encoded, the resulting levels are '0 0', '0 1', '1 0', and '1 1'. You can define a SAS format to make the joint levels more readable. You can then display the box plots for the interaction variable and, optionally, overlay the raw values:

data recover2 / view=recover2;
length Treatment $3;          /* specify length of concatenated variable */
set recover;
Treatment = catx(' ',T1,T2);  /* combine into one group */
run;
 
proc format;                  /* make the joint levels more readable */
  value $ TreatFmt '0 0' = 'Control'
                   '1 0' = 'T1 Only'
                   '0 1' = 'T2 Only'
                   '1 1' = 'T1 and T2';
run;
 
proc sgplot data=recover2 noautolegend;
   format Treatment $TreatFmt.;
   vbox Time / category=Treatment;
   scatter x=Treatment y=Time / jitter markerattrs=(symbol=CircleFilled size=10);
   xaxis discreteorder=data;
run;
Distribution of response variable in two-way ANOVA: box plots and raw data overlaid

By manually concatenating the two categorical variables to form a new interaction variable, you have complete control over the plot. You can also overlay the raw data, as shown. The raw data indicates that the "Control" group seems to contain an outlier: a mouse who lived longer than would be expected for his treatment. Using PROC ROBUSTREG to compute a robust ANOVA is one way to deal with extreme outliers in the ANOVA setting.

In summary, the EFFECTPLOT statement enables you to quickly create box plots that show the response distribution for joint levels of two categorical variables. However, sometimes you might want more control, such as the ability to format the labels or overlay the raw data. This article shows how to use the CATX function to manually create a new variable that contains the joint categories.

The post Visualize an ANOVA with two-way interactions appeared first on The DO Loop.

4月 242017
 

There are several ways to visualize data in a two-way ANOVA model. Most visualizations show a statistical summary of the response variable for each category. However, for small data sets, it can be useful to overlay the raw data. This article shows a simple trick that you can use to combine two categorical variables and plot the raw data for the joint levels of the two categorical variables.

An ANOVA for two-way interactions

Recall that an ANOVA (ANalysis Of VAriance) model is used to understand differences among group means and the variation among and between groups. The documentation for the ROBUSTREG procedure in SAS/STAT contains an example that compares the traditional ANOVA using PROC GLM with a robust ANOVA that uses PROC ROBUSTREG. The response variable is the survival time (Time) for 16 mice who were randomly assigned to different combinations of two successive treatments (T1, T2). (Higher times are better.) The data are shown below:

data recover;
input  T1 $ T2 $ Time @@;
datalines;
0 0 20.2  0 0 23.9  0 0 21.9  0 0 42.4
1 0 27.2  1 0 34.0  1 0 27.4  1 0 28.5
0 1 25.9  0 1 34.5  0 1 25.1  0 1 34.2
1 1 35.0  1 1 33.9  1 1 38.3  1 1 39.9
;

The response variable depends on the joint levels of the binary variables T1 and T2. A first attempt to visualize the data in SAS might be to create a box plot of the four combinations of T1 and T2. You can do this by assigning T1 to be the "category" variable and T2 to be a "group" variable in a clustered box plot, as follows:

title "Response for Two Groups";
title2 "Use VBOX Statement with Categories and Groups";
proc sgplot data=recover;
   vbox Time / category=T1 group=T2;
run;
Box plots for a binary 'category' variable and a binary 'group' variable

The graph shows the distribution of response for the four joint combinations of T1 and T2. The graph is a little hard to interpret because the category levels are 0/1. The two box plots on the left are for T1=0, which means "Did not receive the T1 treatment." The two box plots on the right are for mice who received the T1 treatment. Within those clusters, the blue boxes indicate the distribution of responses for the mice who did not receive the T2 treatment, whereas the red boxes indicate the response distribution for mice that did receive T2. Both treatments seem to increase the mean survival time for mice, and receiving both treatments seems to give the highest survival times.

Interpreting the graph took a little thought. Also, the colors seem somewhat arbitrary. I think the graph could be improved if the category labels indicate the joint levels. In other words, I'd prefer to see a box plot of the levels of interaction variable T1*T2. If possible, I'd also like to optionally plot the raw response values.

Method 1: Use the EFFECTPLOT statement

The LOGISTIC and GENMOD procedures in SAS/STAT support the EFFECTPLOT statement. Many other SAS regression procedures support the STORE statement, which enables you to save a regression model and then use the PLM procedure (which supports the EFFECTPLOT statement). The EFFECTPLOT statement can create a variety of plots for visualizing regression models, including a box plot of the joint levels for two categorical variables, as shown by the following statements:

/* Use the EFFECTPLOT statement in PROC GENMOD, or use the STORE statement and PROC PLM */
proc genmod data=recover;
   class T1 T2;
   model Time = T1 T2 T1*T2;
   effectplot box / cluster;
   effectplot interaction /  obs(jitter);  /* or use interaction plot to see raw data */
run;
Box plots of joint levels created by the EFFECTPLOT statement in SAS

The resulting graph uses box plots to show the schematic distribution of each of the joint levels of the two categorical variables. (The second EFFECTPLOT statement creates an "interaction plot" that shows the raw values and mean responses.) The means of each group are connected, which makes it easier to compare adjacent means. The labels indicate the levels of the T1*T2 interaction variable. I think this graph is an improvement over the previous multi-colored box plot, and I find it easier to read and interpret.

Although the EFFECTPLOT statement makes it easy to create this plot, the EFFECTPLOT statement does not support overlaying raw values on the box plots. (You can, however, see the raw values on the "interaction plot".) The next section shows an alternative way to create the box plots.

Method 2: Concatenate values to form joint levels of categories

You can explicitly form the interaction variable (T1*T2) by using the CATX function to concatenate the T1 and T2 variables, as shown in the following DATA step view. Because the levels are binary-encoded, the resulting levels are '0 0', '0 1', '1 0', and '1 1'. You can define a SAS format to make the joint levels more readable. You can then display the box plots for the interaction variable and, optionally, overlay the raw values:

data recover2 / view=recover2;
length Treatment $3;          /* specify length of concatenated variable */
set recover;
Treatment = catx(' ',T1,T2);  /* combine into one group */
run;
 
proc format;                  /* make the joint levels more readable */
  value $ TreatFmt '0 0' = 'Control'
                   '1 0' = 'T1 Only'
                   '0 1' = 'T2 Only'
                   '1 1' = 'T1 and T2';
run;
 
proc sgplot data=recover2 noautolegend;
   format Treatment $TreatFmt.;
   vbox Time / category=Treatment;
   scatter x=Treatment y=Time / jitter markerattrs=(symbol=CircleFilled size=10);
   xaxis discreteorder=data;
run;
Distribution of response variable in two-way ANOVA: box plots and raw data overlaid

By manually concatenating the two categorical variables to form a new interaction variable, you have complete control over the plot. You can also overlay the raw data, as shown. The raw data indicates that the "Control" group seems to contain an outlier: a mouse who lived longer than would be expected for his treatment. Using PROC ROBUSTREG to compute a robust ANOVA is one way to deal with extreme outliers in the ANOVA setting.

In summary, the EFFECTPLOT statement enables you to quickly create box plots that show the response distribution for joint levels of two categorical variables. However, sometimes you might want more control, such as the ability to format the labels or overlay the raw data. This article shows how to use the CATX function to manually create a new variable that contains the joint categories.

The post Visualize an ANOVA with two-way interactions appeared first on The DO Loop.

4月 192017
 

Restricted cubic splines are a powerful technique for modeling nonlinear relationships by using linear regression models. I have attended multiple SAS Global Forum presentations that show how to use restricted cubic splines in SAS regression procedures. However, the presenters have all used the %RCSPLINE macro (Frank Harrell, 1988) to generate a SAS data set that contains new variables for the spline basis functions. They then use those basis variables in a SAS regression procedure.

Since SAS 9.3, many SAS regression procedures provide a native implementation of restricted cubic splines by using the EFFECT statement in SAS. This article provides examples of using splines in regression models. Because some older procedures (such as PROC REG) do not support the EFFECT statement, the article also shows how to use SAS procedures to generate the spline basis, just like the %RCSPLINE macro does.

If you are not familiar with splines and knots, read the overview article "Understanding splines in the EFFECT statement." You can also read the documentation for the EFFECT statement, which includes an overview of spline effects as well as a brief description of restricted cubic splines, which are also called "natural cubic splines." I think the fact that the SAS documentation refers to the restricted cubic splines as "natural cubic splines" has prevented some practitioners from realizing that SAS supports restricted cubic splines.

Regression with restricted cubic splines in SAS

This section provides an example of using splines in PROC GLMSELECT to fit a GLM regression model. Because the functionality is contained in the EFFECT statement, the syntax is the same for other procedures. For example, if you have a binary response you can use the EFFECT statement in PROC LOGISTIC. If you a fitting a generalized linear model or a mixed model, you can use PROC GLMMIX.

This example fits the MPG_CITY variable as a function of the WEIGHT variable for vehicles in the Sashelp.Cars data set. The data and a scatter plot smoother are shown in the adjacent graph. (To produce the graph, the following statements sort the data, but that is not required.) The smoother is based on a restricted spline basis with five knots. Notice that the SELECTION=NONE option in the MODEL statement turns off the variable selection features of PROC GLMSELECT, which makes the procedure fit one model just like PROC GLM.

/* create sample data; sort by independent variable for graphing */
proc sort data=sashelp.cars   
          out=cars(keep=mpg_city weight model);
by weight;
run;
 
/* Fit data by using restricted cubic splines.
   The EFFECT statement is supported by many procedures:
   GLIMMIX, GLMSELECT, LOGISTIC, PHREG, PLS, QUANTREG, ROBUSTREG, ... */
ods select ANOVA ParameterEstimates SplineKnots;
proc glmselect data=cars;
  effect spl = spline(weight / details naturalcubic basis=tpf(noint)                 
                               knotmethod=percentiles(5) );
   model mpg_city = spl / selection=none;         /* fit model by using spline effects */
   output out=SplineOut predicted=Fit;            /* output predicted values for graphing */
quit;
 
title "Restricted Cubic Spline Regression";
proc sgplot data=SplineOut noautolegend;
   scatter x=weight y=mpg_city;
   series x=weight y=Fit / lineattrs=(thickness=3 color=red);
run;

The EFFECT statement with the SPLINE option is used to generate spline effects from the WEIGHT variable. The effects are assigned the collective name 'spl'. Putting 'spl' on the MODEL statement says "use the spline effects that I created." You can specify multiple EFFECT statements. Spline effects depend on three quantities: the kind of spline, the number of knots, and the placement of the knots.

  • You specify restricted cubic splines by using the NATURALCUBIC BASIS=TPF(NOINT) options. Technically you do not need the NOINT suboption, but I recommend it because it makes the parameter estimates table simpler.
  • The KNOTMETHOD= option enables you to specify the number and placement of knots. In this example, PERCENTILES(5) places knots at five evenly spaced percentiles of the explanatory variable, which are the 16.6%, 33.3%, 50%, 66.6%, and 83.3% percentiles. Equivalently, the knots are placed at the 1/6, 2/6, 3/6, 4/6, and 5/6 quantiles.

The number and placement of knots for splines

Most researchers use a small number of knots, often three to five. The exact location of knots is not usually critical: if you change the positions slightly the predicted values of the regression change only slightly. A common scheme is to place the knots at fixed percentiles of the data, as shown above. Alternatively, Harrell (Regression Modeling Strategies, 2010 and 2015) suggests heuristic percentiles as shown below, and this scheme seems to be popular among biostatisticians.

Harrell's table of knot placement for restricted cubic splines

The KNOTMETHOD= option on the EFFECT statement provides several options for specifying the locations of knots. Try each of the following options and look at the "SplineKnots" table to see the positions of the knots:

  • KNOTMETHOD=PERCENTILES(5): Places knots at the percentiles 1/6, 2/6, ..., 5/6. An example is shown above.
  • KNOTMETHOD=EQUAL(5): Places knots evenly within the range of the data. If δ = (max-min)/6, then the knots are located at min + i δ for i=1, 2, ..., 5.
  • KNOTMETHOD=RANGEFRACTIONS(0.05 0.275 0.50 0.725 0.95): If you want knots placed unevenly with the range of the data, use this option. For example, the value 0.05 means "place a knot 5% along the data range" and 0.272 means "place a knot 27.5% along the data range." You can separate list values by using spaces or commas.
  • KNOTMETHOD=LIST(2513, 3174, 3474.5, 3903, 5000): This option enables you to list the locations (in data coordinates) of the knots. You can use this option to specify Harrell's schemes. For example, Harrel suggests the 5th, 27.5th, 50th, 72.5th, and 95th percentiles when you use five knots. You can use PROC UNIVARIATE to find these percentiles for the WEIGHT variable and then type the results into the KNOTMETHOD=LIST option.
Comparison of several knot-placement schemes for restricted cubic spline regression in SAS

The adjacent graph shows the predicted values for the four different knot placement methods. (Click to enlarge.) You can see that the general shape of the regression curves is similar regardless of the placement of knots. The differences can be understood by looking at the placement of the first and last knots. The slope of the natural cubic spline fit is linear past the extreme knots, so the placement of the extreme knots dictate where the predictions become linear. For the EQUAL and RANGEFRACTION methods, the largest knots are placed at WEIGHT=6300 and WEIGHT=6923, respectively. Consequently, the predictions for large values of WEIGHT look more cubic than linear. In contrast, the largest knot for the PERCENTILES method is placed at 4175 and the largest LIST knot is 5000. The predictions past those values are linear.

Automate Harrell's knot placement suggestions

In the previous section, I manually typed in the values that correspond to the 5th, 27.5th, 50th, 72.5th, and 95th percentiles of the WEIGHT variable, as suggested by Harrell's scheme. However, it is not difficult to automate that step. You can use PROC UNIVARIATE to write the percentile values to a SAS data set and then use a short DATA _NULL_ step to store those values into a macro variable. The macro variable can then be used as the argument to the KNOTMETHOD=LIST option, as follows:

proc univariate data=cars noprint;
   var weight;
   output out=percentiles pctlpts=5 27.5 50 72.5 95 pctlpre=p_; / * specify the percentiles */
run;
 
data _null_;
set percentiles;
call symput('pctls',catx(',', of _numeric_));   /* put all values into a comma-separated list */
run;
 
%put &=pctls;                 /* optional: display the list of values */
...
   effect spl = spline(weight / ... knotmethod=list(&pctls) ); /* use the macro variable here */
...
PCTLS=2513,3174,3474.5,3903,5000

Write the spline basis functions to a SAS data set

As mentioned eaerlier, not every SAS procedure supports the EFFECT statement. However, the GLMSELECT, LOGISTIC, and GLIMMIX procedures all provide an OUTDESIGN= option, which enables you to output the design matrix that contains the spline basis functions. Furthermore, PROC LOGISTIC supports the OUTDESIGNONLY option and PROC GLIMMIX supports the NOFIT option so that the procedures do not fit the model but only output the design matrix.

You can use the variables in the design matrix to perform regression or other analyses. For example, the following example writes the restricted cubic basis functions to a data set, then uses PROC REG to fit the model:

/* create SplineBasis = data set that contains spline basis functions */
proc glmselect data=cars outdesign(addinputvars fullmodel)=SplineBasis;
   effect spl = spline(weight / naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
   model mpg_city = spl / selection=none;  
quit;
 
/* use design variables in other procedures */
proc reg data=SplineBasis;
   model mpg_city = spl_:;
   output out=out p=p;
run;

One last comment: the basis functions that are generated by the EFFECT statement are not equal to the basis functions created by the %RCSPLINE macro, but they are equivalent. The EFFECT statement uses the definition from The Elements of Statistical Learning (Hastie, Tibshirani, and Friedman, 2009, 2nd Ed, pp. 144-146). The %RCSPLINE macro implements scaled versions of the basis function. Thus parameter estimates will be different but the predicted values will be the same.

Summary

You can use the NATURALCUBIC BASIS=TPF(NOINT) option in the EFFECT statement in SAS to perform regression with restricted cubic splines, which are also called natural cubic splines. You can use the KNOTMETHOD= option to specify the number and placement of the knots. The EFFECT statement is supported by the GLMSELECT, LOGISTIC, and GLIMMIX procedures, among others. These procedures enable you to output a design matrix that contains the spline basis functions, which you can use in procedures that do not support the EFFECT statement.

This article was inspired by several talks that I heard at SAS Global Forum. For more information on this topic, see the following:

The post Regression with restricted cubic splines in SAS appeared first on The DO Loop.

4月 172017
 
Illustration of the 68-95-99.7 rule

A reader commented on last week's article about constructing symmetric intervals. He wanted to know if I created it in SAS.

Yes, the graph, which illustrates the so-called 68-95-99.7 rule for the normal distribution, was created by using several statements in the SGPLOT procedure in Base SAS

  • The SERIES statement creates the bell-shaped curve.
  • The BAND statement creates the shaded region under the curve.
  • The DROPLINE statement creates the vertical lines from the curve to the X axis.
  • The HIGHLOW statement creates the horizontal lines that indicate interval widths.
  • The TEXT statement creates the text labels "68%", "95%", and "99.7%".

If you follow some simple rules, it is easy to use PROC SGPLOT the overlay multiple curves and lines on a graph. The key is to organize the underlying data into a block form, as shown conceptually in the plot to the right. I use different variable names for each component of the plot, and I set the values of the other variables to missing when they are no longer relevant. Each statement uses only the variables that are relevant for that overlay.

The following DATA step creates the data for the 68-95-99.7 graph. Can you match up each section with the SGPLOT statements that overlay various components to create the final image?

data NormalPDF;
mu = 50; sigma = 8;     /* parameters for the normal distribution N(mu, sigma) */
/* 1. Data for the SERIES and BAND statements */
do m = -4 to 4 by 0.05;                /* x in [mu-4*sigma, mu+4*sigma] */
   x = mu + m*sigma;
   f = pdf("Normal", x, mu, sigma);    /* height of normal curve */
   output;
end;
x=.; f=.;
 
/* 2. Data for vertical lines at mu + m *sigma, m=-3, -2, -1, 1, 2, 3 */
do m =-3 to 3;
   if m=0 then continue;               /* skip m=0 */
   Lx = mu + m*sigma;                  /* horiz location of segment */
   Lf = pdf("Normal", Lx, mu, sigma);  /* vertical height of segment */
   output;
end;
LX = .; Lf = .;
 
/* 3. Data for horizontal lines. Heights are 1.1, 1.2, and  1.3 times the max height of the curve */
Tx = mu;                               /* text centered at mu */
fMax = pdf("Normal", mu, mu, sigma);   /* highest point of curve */
Text = "68%  ";                        /* 68% interval */
TL = mu - sigma;  TR = mu + sigma;     /* Left/Right endpoints of interval */
Ty = 1.1 * fMax;                       /* height of label and segment */
output;
 
Text = "95%  ";                        /* 95% interval */
TL = mu - 2*sigma;  TR = mu + 2*sigma; /* Left/Right endpoints of interval */
Ty = 1.2 * fMax;                       /* height of label and segment */
output;
 
Text = "99.7%";                        /* 99.7% interval */
TL = mu - 3*sigma;  TR = mu + 3*sigma; /* Left/Right endpoints of interval */
Ty = 1.3 * fMax;                       /* height of label and segment */
output;
 
keep x f Lx Lf Tx Ty TL TR Text;
run;
 
proc sgplot data=NormalPDF noautolegend;
   band     x=x upper=f lower=0;
   series   x=x y=f             / lineattrs=(color=black);
   dropline x=Lx y=Lf           / dropto=x lineattrs=(color=black);
   highlow  y=Ty low=TL high=TR / lowcap=serif highcap=serif lineattrs=(thickness=2);
   text     x=Tx y=Ty text=Text / backfill fillattrs=(color=white) textattrs=(size=14); 
   yaxis offsetmin=0 min=0 label="Density";
   xaxis values=(20 to 80 by 10) display=(nolabel);
run;

The post Visualize the 68-95-97.5 rule in SAS appeared first on The DO Loop.

4月 122017
 

At SAS Global Forum last week, I saw a poster that used SAS/IML to optimized a quadratic objective function that arises in financial portfolio management (Xia, Eberhardt, and Kastin, 2017). The authors used the Newton-Raphson optimizer (NLPNRA routine) in SAS/IML to optimize a hypothetical portfolio of assets.

The Newton-Raphson algorithm is one of my favorite optimizers. However, for a quadratic objective function there is a simpler choice. You can use the NLPQUA function in SAS/IML to optimize a quadratic objective function.

Optimal value of quadratic function of two variables

Optimize an unconstrainted quadratic objective function in SAS/IML

Let's see how this works by specifying a quadratic polynomial in two variables. The function is
f(x,y) = (9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6
       = 0.5*x` H x + L x + 6
where H = {18  1, 1  8} is a 2x2 symmetric matrix, L = {-12  -4} is a row vector, and x = {x, y} is a column vector.

To use the NLPQUA subroutine, you need to write down the Hessian matrix, H, which is the symmetric matrix of second derivatives. Of course, for a quadratic function the second derivatives are constants. In the following SAS/IML program, the matrix H contains the second derivatives and the vector lin contains the coefficients for the linear and constant terms of f:

/* objective function is
   f(x,y) = (3*x-2)##2 + (2*y-1)##2 + x#y + 1
          = (9*x##2  + x#y + 4*y##2)  + -12*x  - 4*y + 6
   grad(f) = g(x,y) = [ 18*x + y - 12,  x + 8*y - 4 ]
   hess(f)  = [ dg1/dx  dg2/dx ]  =  [ 18  1 ]
              [ dg1/dy  dg2/dy ]     [  1  8 ]
*/
proc iml;
H = {18 1,             /* matrix of second derivatives */
      1 8};
lin = { -12 -4 6 };    /* vector of linear terms and the constant term */

The NLPQUA subroutine enables you to specify the Hessian matrix and the linear coefficients. From those values, the routine uses an efficient solver to find the global minimum of the quadratic polynomial, which occurs at x=92/143, y=60/143:

x0 = j(1, 2, 1);       /* initial guess */
opt = {0 1};           /* minimize, print final parameters */
call nlpqua(rc, xOpt, H, x0, opt) lin=lin;
print xOpt; 
print xOpt[format=Fract.];
Optimal values of a quadratic function

Solve a constrained quadratic program in SAS/IML

You can also use the NLPQUA subroutine to solve a constrained problem. Suppose that you are only interested in values of (x,y) in the unit square and you require that the solution satisfies the linear constraint x + y = 1. In that case you can define a matrix that specifies the linear constraints as follows:

blc = {0 0 . .,   /* lower bound: 0 <= x_i        */
       1 1 . .,   /* upper bound:      x_i <= 1   */
       1 1 0 1};  /* linear constraint x +  y = 1 */
call nlpqua(rc, conOpt, H, x0, opt, blc) lin=lin;
print conOpt[format=Fract.];

Notice how the linear constraints are specified for this (and every other) NLP function. Let p be the number of parameters in the problem. The first p elements of the first row specify the lower bounds for the parameters; use a missing value for variables that are not bounded below. The last two columns of the first row are not used, so you should always set those elements to missing values. Similarly, the first p elements of the second row specify the upper bound for the variables. The last two columns of the second row are not used, so set those elements to missing values.

Additional rows specify coefficients for the linear constraint. An equality constraint of the form a1*x1 + a2*x2 + ... + an *xn = c is encoded as a row of coefficients {a1 a2 ... an 0 c}, where the 0 in the penultimate location indicates equality. A "less than" inequality of the form a1*x1 + a2*x2 + ... + an *xn ≤ c is encoded as the row {a1 a2 ... an  -1  c}, where the -1 indicates the "less than or equal" symbol. Similarly, a value of +1 in the penultimate location indicates the "greater than or equal" symbol (≥). See the documentation for the NLP routines for additional details about specifying constraints.

Solve quadratic programs in PROC OPTMODEL

If you have access to SAS/OR software, PROC OPTMODEL provides a simple and natural language for solving simple and complex optimization problems. For this problem, PROC OPTMODEL detects that the objective function is quadratic and

/* variable names x and y */
proc optmodel;
   /* unconstrained quadratic optimization */
   var x, y;
   min f = 9*x^2 + x*y + 4*y^2 - 12*x - 4*y + 6;
   solve;                    /* call QP solver */
   print x Fract. y Fract.;
 
   /* constrained quadratic optimization */
   x.lb = 0; x.ub = 1;        /* bounds on variables */
   y.lb = 0; y.ub = 1;
   con SumToOne:              /* linear constraint */
      x + y = 1;
   solve;                     /* call QP solver */
   print x Fract. y Fract.;
quit;

In summary, if you want to optimize a quadratic objective function (with or without constraints), use the NLPQUA subroutine in SAS/IML, which is a fast and efficient way to maximize or minimize a quadratic function of many variables. If you have access to SAS/OR software, PROC OPTMODEL provides an even simpler syntax.

The post Quadratic optimization in SAS appeared first on The DO Loop.

4月 102017
 

Many intervals in statistics have the form p ± δ, where p is a point estimate and δ is the radius (or half-width) of the interval. (For example, many two-sided confidence intervals have this form, where δ is proportional to the standard error.) Many years ago I wrote an article that mentioned that you can construct these intervals in the SAS/IML language by using a concatenation operator (|| or //). The concatenation creates a two-element vector, like this:

proc iml;
mu = 50; 
delta = 1.5;
CI = mu - delta  || mu + delta;   /* horizontal concatenation ==> 1x2 vector */

Last week it occurred to me that there is a simple trick that is even easier: use the fact that SAS/IML is a matrix-vector language to encode the "±" sign as a vector {-1, 1}. When SAS/IML sees a scalar multiplied by a vector, the result will be a vector:

CI = mu + {-1  1}*delta;         /* vector operation ==> 1x2 vector */
print CI;

You can extend this example to compute many intervals by using a single statement. For example, in elementary statistics we learn the "68-95-99.7 rule" for the normal distribution. The rule says that in a random sample drawn from a normal population, about 68% of the observations will be within 1 standard deviation of the mean, about 95% will be within 2 standard deviations, and about 99.7 % will be within 3 standard deviations of the mean. You can construct those intervals by using a "multiplier matrix" whose first row is {-1, +1}, whose second row is {-2, +2}, and whose third row is {-3, +3}. The following SAS/IML statements construct the three intervals for the 69-95-99.7 rule for a normal population with mean 50 and standard deviation 8:

mu = 50;  sigma = 8;
m = {-1 1,
     -2 2,
     -3 3};
 
Intervals = mu + m*sigma;
ApproxPct = {"68%", "95%", "99.7"};
print Intervals[rowname=ApproxPct];

Just for fun, let's simulate a large sample from the normal population and empirically confirm the 68-95-99.7 rule. You can use the RANDFUN function to generate a random sample and use the BIN function to detect which observations are in each interval:

call randseed(12345);
n = 10000;                                /* sample size */
x = randfun(n, "Normal", mu, sigma);      /* simulate normal sample */
ObservedPct = j(3,1,0);
do i = 1 to 3;
   b = bin(x, Intervals[i,]);             /* b[i]=1 if x[i] in interval */
   ObservedPct[i] = sum(b) / n;           /* percentage of x in interval */
end;
 
results = Intervals || {0.68, 0.95, 0.997} || ObservedPct;
print results[colname={"Lower" "Upper" "Predicted" "Observed"}
              label="Probability of Normal Variate in Intervals: X ~ N(50, 8)"];

The simulation confirms the 68-95-99.7 rule. Remember that the rule is a mnemonic device. You can compute the exact probabilities by using the CDF function. In SAS/IML, the exact computation is p = cdf("Normal", m[,2]) - cdf("Normal", m[,1]);

In summary, the SAS/IML language provides an easy syntax to construct intervals that are symmetric about a central value. You can use a vector such as {-1, 1} to construct an interval of the form p ± δ, or you can use a k x 2 matrix to construct k symmetric intervals.

The post A simple trick to construct symmetric intervals appeared first on The DO Loop.

4月 052017
 

Most regression models try to model a response variable by using a smooth function of the explanatory variables. However, if the data are generated from some nonsmooth process, then it makes sense to use a regression function that is not smooth. A simple way to model a discontinuous process in SAS is to use spline effects and specify repeated value for the knots.

Discontinuous processes: More common than you might think

The classical ANOVA is one way to analyze data that are collected before and after a known event. For example, you might record gas mileage for a car before and after a tune-up. You might collect patient data before and after they undergo a medical or surgical treatment. You might have data about real estate prices before and after some natural disaster. In all these cases, you might suspect that the response changes abruptly because of the event.

To give a simple example, suppose that a driver records the fuel economy (in mile per gallon) for a car for 12 weeks. Because the car engine is coughing and knocking, the owner brings the car to a mechanic for maintenance. After the maintenance, the car seems to run better and he records the fuel economy for another six weeks. The hypothetical data are below:

data MPG;
input mpg @@;
week = _N_;
period = ifc(week > 12, "After ", "Before");
label mpg="Miles per Gallon";
datalines;
30.5 28.1 27.1 31.2 25.2 31.1 27.7 28.2 29.6 30.6 28.9 25.9 
30.6 33.0 31.2 29.7 32.7 31.1 
;

Notice that the data contains a binary indicator variable (period) that records whether the data are from before or after the tune-up. You can use PROC GLM to perform a simple ANOVA analysis to determine whether there was a significant change in the mean fuel economy after the maintenance. The following call to PROC GLM runs an ANOVA and indicates that the mean fuel economy is about 2.7 mpg better after the tune-up.

proc glm data=MPG plots=fitplot;
   class period / ref=first;
   model mpg = period /solution;
   output out=out predicted=Pred;
quit;

Graphically, this regression analysis is usually visualized by using two box plots. (PROC GLM creates the box plots automatically when ODS graphics are enabled.) However, because the independent variable is time, you could also use a series plot to show the observed data and the mean response before and after the maintenance. By using the GROUP= option on the SERIES statement, you can get two lines for the "before" and "after" time periods.

title "Piecewise Constant Regression with Jump Discontinuity";
proc sgplot data=Out;
   block x=week block=period / transparency=0.8;
   scatter x=week y=mpg / markerattrs=(symbol=CircleFilled color=black);
   series x=week y=pred / group=period lineattrs=(thickness=3) ;
run;
Piecewise constant regression function with jump discontinuity

The graph shows that the model has a jump discontinuity at the time point at which the maintenance intervention occurred. If you include the WEEK variable in the analysis, you can model the response as a linear function of time, with a jump at the time of the tune-up.

All this is probably very familiar. However, did you know that you can use splines to model the data as a continuous function that has a kink or "corner" at the time of the maintenance event? You can use this feature when the model is continuous, but the slope changes at a known time.

Splines for nonsmooth models

Several SAS procedures support the EFFECT statement, which enables you to build spline effects. The paper "Rediscovering SAS/IML Software" (Wicklin 2010, p. 4) has an example where splines are used to construct a highly nonlinear curve for a scatter plot.

A spline effect is determined by the placement of certain points called "knots." Often knots are evenly spaced within the range of the explanatory variable, but the EFFECT statement supports many other ways to position the knots. In fact, the documentation for the EFFECT statement says: "If you remove the restriction that the knots of a spline must be distinct and allow repeated knots, then you can obtain functions with less smoothness and even discontinuities at the repeated knot location. For a spline of degree d and a repeated knot with multiplicity md, the piecewise polynomials that join such a knot are required to have only d – m matching derivatives."

The degree of a linear regression is d=1, so if you specify a knot position once you obtain a piecewise linear function that contains a "kink" at the knot. The following call to PROC GLIMMIX demonstrates this technique. (I use GLIMMIX because neither PROC GLM nor PROC GENMOD support the EFFECT statement.) You can manually specify the position of knots by using the KNOTMETHOD=LIST(list) option on the EFFECT statement.

proc glimmix data=MPG;
   effect spl = spline(week / degree=1 knotmethod=list(1 13 18));     /* p-w linear */
   *effect spl = spline(week / degree=2 knotmethod=list(1 13 13 1); /*p-w quadratic */
   model mpg = spl / solution;  
   output out=out predicted=Pred;
quit;
 
title "Piecewise Linear Regression with Kink";
proc sgplot data=Out noautolegend;
   block x=week block=period / transparency=0.8;
   scatter x=week y=mpg / markerattrs=(symbol=CircleFilled color=black);
   series x=week y=pred / lineattrs=(thickness=3) ;
run;
Piecewise Linear Regression with Kink

The graph shows that the model is piecewise linear, but that the slope of the model changes at week=13. In contrast, the second EFFECT statement in the PROC GLIMMIX code (which is commented out), specifies piecewise quadratic polynomials (d=2) and repeats the knot at week=13. That results in two quadratic models that give the same predicted value at week=13 but the model is not smooth at that location. Try it out!

If you are using a SAS procedure that does not support the EFFECT statement, you can use the GLMMIX procedure to output the dummy variables that are associated with the spline effects. A nice paper by David Pasta (2003) describes how to use dummy variables in a variety of models. The paper was written before the EFFECT statement; many of the ideas in the paper are easier to implement by using the EFFECT statement.

Lastly, the TRANSREG procedure in SAS supports spline effects but has its own syntax. See the TRANSREG documentation, which includes an example of repeating knots to build a regression model for discontinuous data.

Have you ever needed to construct a nonsmooth regression model? Tell your story by leaving a comment.

The post Nonsmooth models and spline effects appeared first on The DO Loop.

4月 032017
 

One of the advantages of the new mixed-type tables in SAS/IML 14.2 is the greatly enhanced printing functionality. You can control which rows and columns are printed, specify formats for individual columns, and even use templates to completely customize how tables are printed. Printing a table is accomplished by using the TableCreateFromDataSet function. Finally, the TablePrint subroutine prints a customize portion of the table:

data class;
set sashelp.class;
Birthday = '03APR2017'd - age*365 - floor(365*uniform(1)); /* create birthday */
format Birthday DATE9.;
run;
 
proc iml;
tClass = TableCreateFromDataSet("Class");    /* read data into table */
run TablePrint(tClass) firstobs=3 numobs=5 
                       var={"Age" "Birthday"} 
                       ID="Name"
                       label="Subset of Class";
Basic printing of SAS/IML Tables

Notice that the table contains both numeric and character columns. Furthermore, the numeric columns have different formats. The TablePrint subroutine has some distinct advantages over the traditional PRINT statement in SAS/IML:

  • The TablePrint subroutine supports an easy way to display a range of observations. When you use the PRINT statement for multiple vectors, you have to use row subscripts in each vector, such as PRINT (X[3:8,]) (Y[3:8,]);
  • The TablePrint subroutine supports printing any columns in any order. When you use the PRINT statement on a matrix, you have to use column subscripts to change the order of the matrix columns: PRINT (X[, {2 3 1}]);
  • The PRINT statement supports the ROWNAME= option (for specifying row headers), the COLNAME= option (for specifying column headers), and the LABEL= option. Those options are easy to work with when you print a single matrix. However, you can't store mixed-type data in a matrix and those options are less convenient when you print a set of vectors.

Advanced printing of SAS/IML tables

Trafic lighting: Color cells in SAS/IML tables by cell contents

The SAS/IML documentation has several sections of documentation devoted to

For statistical programmers, the ability to use ODS templates means that output from PROC IML can look the same as output from some other SAS procedue. For example, suppose that you have a table that contains parameter estimates for a linear regression. The following example prints that table by using the same ODS template that PROC REG uses, which is the Stat.Reg.ParameterEstimates template:

proc iml;
vars = {"Intercept", X, Z};
stats = {32.19   5.08   21.42   42.97,
          0.138  0.0348  0.0644  0.2117, 
          1.227  0.5302  0.1027  2.3506 }; 
tbl = TableCreate("Variable", vars);
call TableAddVar(tbl, {"Estimate" "StdErr" "LowerCL" "UpperCL"},  stats);
 
Confidence=95;
call TablePrint(tbl) template="Stat.Reg.ParameterEstimates"
                     dynamic={Confidence};
Print SAS/IML tables by using existing ODS templates

This example works because the column names in the SAS/IML table match the names that are expected by the Stat.REG.ParameterEstimates template. The DYNAMIC= option specifies a dynamic variable (Confidence) that the template requires. See the documentation for further details.

Summary

In summary, the TablePrint subroutine in SAS/IML gives programmers control over many options for printing tables of data and statistics. For complex layouts, you can use an existing ODS template or create your own template to customize virtually every aspect of your tabular displays.

The post Print tables in SAS/IML appeared first on The DO Loop.

3月 292017
 

Lists are collections of objects. SAS/IML 14.2 supports lists as a way to store matrices, data tables, and other lists in a single object that you can pass to functions. SAS/IML lists automatically grow if you add new items to them and shrink if you remove items. You can also modify existing items. You can use indices to refer to items of a list, or you can assign names to the items to create a named list.

Create a list

You can create a list by using the ListSetItem subroutine to assign a value to each item:

proc iml;
L = ListCreate(2);                  /* L is two-item list */
call ListSetItem(L, 1, 1:3);        /* 1st item is 1x3 vector */
call ListSetItem(L, 2, {4 3, 2 1}); /* 2nd item is 2x2 matrix */

The arguments for the ListSetItem subroutine are list, index, and value, which is the same order that you use when you assign a value to a matrix element, such as A[2] = 5.

If you do not know how many items the list will contain, or if you later want to add additional items, you can use

X = {3 1 4, 2 5 3};                 /* create a 2x3 matrix */
call ListAddItem(L, X);             /* add 3rd item; list grows */

Notice that the syntax for the ListAddItem subroutine only requires the list and the value because the new item is always added to the end of the list. At this point, the list contains three items, as shown below:

Items in a SAS/IML list can be different shapes and sizes

Insert, modify, and delete list items

A convenient feature of SAS/IML lists is that they automatically resize. You can insert new items into any position in the list. You can also replace an existing item or delete an item. The following statements demonstrate modifying an existing list.

call ListInsertItem(L, 2, -1:1);    /* insert new item before item 2; list grows */
call ListSetItem(L, 1, {A B, C D}); /* replace 1st item with 2x2 character matrix */
call ListDeleteItem(L, 3);          /* delete 3rd item; list shrinks */

The arguments for k, existing higher-indexed items are also renumbered. (Conceptually, items "move to the left.") After the previous sequence of operations, the list contains the following items:

You can insert, modify, and delete items in a SAS/IML list

Create named lists

In the previous section, the list acted like a dynamic array: it stored items that you could access by using indices such as 1, 2, and 3. For some applications, it makes more sense to name the list items and refer to the items by their names rather than their positions. This is similar to "structs" in some languages, where you can refer to members of a struct by name.

For example, suppose a teacher wants to store information about students in her classes. For each student, she might want to store the student's name, class, and test scores. The following statements create a SAS/IML list that has three items named "Name", "Class", and "Scores". The items are then assigned values:

Student = ListCreate({"Name" "Class" "Scores"});  /* create named list with 3 items */
call ListSetItem(Student, "Name", "Ron");         /* set "name" value for student */
call ListSetItem(Student, "Class", "Statistics"); /* set "class" value  for student */
Tests = {100 97 94 100 100};                      /* test scores */
call ListSetItem(Student, "Scores", Tests);       /* set "scores" value for student */

Although you can still use positional indices to access list items ("Ron" is the first item, "Statistics" is the second item, ...), you can also use the name of items. For example, to extract the test scores into a SAS/IML matrix or vector, you can use

s = ListGetItem(Student, "Scores");               /* get test scores from list */

Lists are for storing items, not for computing. You cannot add, subtract, or multiply lists. However, the example shows that you can extract an item into a matrix and subsequently perform algebraic operations on the matrix.

Lists of lists

Lists can contain sublists. In this way you can represent nested or hierarchical data. For example, the teacher might want to store information about several students. If information about each student is stored in a list, then she can use a list of lists to store information about multiple students.

The following statements create a list called All. The first student ("Ron") is added to the list, which means that the item is a copy of the Student list. Then information about a second student ("Karl") is stored in the Student list and copied into the All list as a second item. This process could continue until all students are stored in the list.

All = ListCreate();                  /* empty list */
call ListAddItem(All, Student);      /* add "Ron" to list */
 
call ListSetItem(Student, "Name", "Karl"); 
call ListSetItem(Student, "Class", "Calculus");
call ListSetItem(Student, "Scores", {90 92 84 70 80});  
call ListAddItem(All, Student);      /* add "Karl" to list */

At this point, the All list contains two items. Each item is a list that contains three items.

Summary

SAS/IML 14.2 supports lists, which are containers that store other objects. Lists dynamically grow or shrink as necessary. You can index items by their position in the list (using indices) or you can create a named list and reference items by their names. You can use built-in functions to extract items a list or add new items to a list. You can insert, delete, and modify list items. You can represent hierarchical data by using lists of lists. For more information about SAS/IML lists, see the chapter in the SAS/IML documentation. The documentation shows how lists can be used to emulate other data structures, such as stacks, queues, and trees.

The post Lists: Nonmatrix data structures in SAS/IML appeared first on The DO Loop.

3月 272017
 

Did you know that you can check a SAS macro variable to see if ODS graphics is enabled? The other day I wanted to write a SAS program that creates a graph only if ODS graphics is enabled. The solution is to check the SYSODSGRAPHICS macro variable, which is automatically updated by the SAS system whenever ODS graphics is enabled or disabled. (Thanks to Warren Kuhfeld for this tip!) The SYSODSGRAPHICS variable has the value 0 for "off" and 1 for "on."

For example, the following SAS/IML program checks to see if ODS graphics is on. If so, it creates a bar chart of a categorical variable. If not, it computes the count for each category and prints a frequency table:

ods graphics on;         /* -OR-  ods graphics off */
proc iml;
use sashelp.cars;  read all var "Origin";  close;
 
if &SYSODSGRAPHICS then  /* is ODS graphics enabled? */
   call bar(Origin);     /* optionally display a graphical summary */
/* always display a tabular summary */
call tabulate(Categories, Counts, Origin);
print Counts[colname=Categories];

The SYSODSGRAPHICS automatic macro variable is a recent addition to the automatic macro variables in SAS. You can read Rick Langston's 2015 paper to discover other (relatively) new macro features in SAS 9.3 and SAS 9.4.

SAS has many other automatic macro variables. In general, you can use automatic macro variables in SAS to check the status of the SAS system at run time. Some of my other favorite automatic macro variables (which some people call system macro variables) are the following:

  • SYSERR, SYSERRORTEXT, and SYSWARNINGTEXT: Provide information about errors and warnings from SAS procedures. In my book Statistical Programming with SAS/IML I show how to use these macro variables in conjunction with the SUBMIT/ENDSUBMIT statements in SAS/IML.
  • SYSVER and SYSVLONG: Provide information about your version of SAS. You can use these macro variables to execute newer, more efficient, code if the user has a recent version of SAS.
  • SYSSCP and SYSSCPL: Provide information about the operating system. For example, is SAS running on Windows or Linux?

Do you have a favorite automatic variable? Leave a comment and tell me how you use automatic variables in your SAS programs.

The post Is ODS graphics enabled? Use automatic macro variables to determine the state of SAS appeared first on The DO Loop.