data analysis

7月 052018
 

SAS enables you to evaluate a regression model at any location within the range of the data. However, sometimes you might be interested in how the predicted response is increasing or decreasing at specified locations. You can use finite differences to compute the slope (first derivative) of a regression model. This numerical approximation technique is most useful for nonparametric regression models that cannot be simply written in terms of an analytical formula.

A nonparametric model of drug absorption

The following data are the hypothetical concentrations of a drug in a patient's bloodstream at times (measured in hours) during a 72-hour period after the drug is administered. For a real drug, a pharmacokinetic researcher might construct a parametric model and fit the model by using PROC NLMIXED. The following example uses the EFFECT statement to fit a regression model that uses cubic splines. Other nonparametric models in SAS include loess curves, generalized additive models, adaptive regression, and thin-plate splines.

data Drug;
input Time Concentration @@;
datalines;
1  3    3  7   6 19  12 73  18 81
24 71  36 38  42 28  48 20  72 12
;
 
proc glmselect data=Drug;
   effect spl = spline(Time/ naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
   model Concentration = spl / selection=none; /* fit model by using spline effects */
   store out=SplineModel;                      /* store model for future scoring */
quit;

Because the data are not evenly distributed in time, a graph of the spline fit evaluated at the data points does not adequately show the response curve. Notice that the call to PROC GLMSELECT used a STORE statement to store the model to an item store. You can use PROC PLM to score the model on a uniform grid of values to visualize the regression model:

/* use uniform grid to visualize curve */
data ScoreData;
do Time = 0 to 72;
   output;
end;
run;
 
/* score the model on the uniform grid */
proc plm restore=SplineModel noprint;
   score data=ScoreData out=ScoreResults;
run;
 
/* merge fitted curve with original data and plot the fitted curve */
data All;
set Drug ScoreResults;
run;
 
title "Observed and Predicted Blood Content of Drug";
proc sgplot data=All noautolegend; 
   scatter x=Time y=Concentration;
   series x=Time y=Predicted / name="fit" legendlabel="Predicted Response";
   keylegend "fit" / position=NE location=inside opaque;
   xaxis grid values=(0 to 72 by 12) label="Hours";
   yaxis grid;
run;
Nonparametric model: How can you evaluate the derivative?

Finite difference approximations for derivatives of nonparametric models

A researcher might be interested in knowing the slope of the regression curve at certain time points. The slope indicates the rate of change of the response variable (the blood-level concentration). Because nonparametric regression curves do not have explicit formulas, you cannot use calculus to compute a derivative. However, you can use finite difference formulas to compute a numerical derivative at any point.

There are several finite difference formulas for the first derivative. The forward and backward formulas are less accurate than the central difference formula. Let h be a small value. Then the approximate derivative of a function f at a point t is given by
f′(t) ≈ [ f(t + h) – f(th) ] / 2h

The formula says that you can approximate the slope at t by evaluating the model at the points t ± h. You can use the DIF function to compute the difference between the response function at adjacent time points and divide that difference by 2h. The code that scores the model is similar to the more-familiar case of scoring on a uniform grid. The following statements evaluate the derivative of the model at six-hour intervals.
/* compute derivatives at specified points */
data ScoreData;
h = 0.5e-5;
do t = 6 to 48 by 6;        /* siz-hour intervals */
   Time = t - h;  output;
   Time = t + h;  output;
end;
keep t Time h;
run;
 
/* score the model at each time point */
proc plm restore=SplineModel noprint;
   score data=ScoreData out=ScoreOut;  /* Predicted column contains f(x+h) and f(x-h) */
run;
 
/* compute first derivative by using central difference formula */
data Deriv;
set ScoreOut;
Slope = dif(Predicted) / (2*h);  /* [f(x+h) - f(x-h)] / 2h */
Time = t;                        /* estimate slope at this time */
if mod(_N_,2)=0;                 /* process observations in pairs; drop even obs */
drop h t;
run;
 
proc print data=Deriv;
run;

The output shows that after six hours the drug is entering the bloodstream at 6.8 units per hour. By 18 hours, the rate of absorption has slowed to 0.6 units per hour. After 24 hours, the rate of absorption is negative, which means that the blood-level concentration is decreasing. At approximately 30 hours, the drug is leaving the bloodstream at the rate of -3.5 units per hour.

This technique generalizes to other nonparametric models. If you can score the model, you can use the central difference formula to approximate the first derivative in the interior of the data range.

For more about numerical derivatives, including a finite-difference approximation of the second derivative, see Warren Kuhfeld's article on derivatives for penalized B-splines. Warren's article is focused on how to obtain the predicted values that are generated by the built-in regression models in PROC SGPLOT (LOESS and PBSPLINE), but it contains derivative formulas that apply to any regression curve.

The post Compute derivatives for nonparametric regression models appeared first on The DO Loop.

7月 022018
 
Who was the greatest US president? Presidental greatness scores through 2018

Which president of the United States is ranked the greatest by presidential historians? This article visualizes the results of the 2018 Presidential Greatness Survey, which was created and administered by B. Rottinghaus and J. Vaughn. They analyzed 166 responses from experts in political science who ranked the 44 US presidents on a 0–100 "greatness scale" where 0 represents abject failure, 50 is average, and 100 is great. The survey results are presented in their 2018 report.

Table 1 of the report (p. 2-3) provides the average greatness scores of the presidents among all respondents. Just as "beauty is in the eye of the beholder," so too "greatness" is a subjective notion that depends on the values and beliefs of the assessor. Table 2A (p. 6-7) provides the average greatness scores for respondents by the political party and political ideology of the respondents. Of the respondents, 57.2% were Democrats, 30.1% were Independent or Other, and 12.7% were Republicans. Similarly, 58.4% self-identified as liberal or somewhat liberal, 24.1% self-identified as moderate, and 17.5% reported being conservative or somewhat conservative. The relatively large proportion of liberals should be considered when interpreting the survey results.

I imported the survey results into SAS and combined them to produce a visualization of the data. The graph at the right (click to enlarge) shows the presidents ranked by their overall mean greatness score, which is shown by an X. The mean scores for the conservative, moderate, and liberal respondents are shown as filled circles. The party of the president is indicated by a colored bar that shows the range of scores.

A few results are apparent from the graph:

  • Lincoln, Washington, FDR, Teddy Roosevelt, and Jefferson are the five greatest presidents according to these experts.
  • Obama and Reagan are two recent presidents who are ranked highly.
  • Statistically speaking, there are large groups of presidents for which the relative ranking is uncertain. For example, the presidents ranked 12–20 (Madison through Monroe) have mean scores that are similar.
  • The current president, Donald Trump, is ranked last. In fairness, Trump had not yet completed his first year in office at the time of the survey (December 2017). The authors of the report dedicate several pages (p. 10–13) analyzing a survey question about who was the Most Polarizing president, a category that Trump won handily.
  • Other presidents who scored low include James Buchanan (the "Do-Nothing President"), William Henry Harrison (who died 31 days into his term), and Franklin Pierce (a weak leader in a contentious pre-Civil War era).

You can also graph the data by using the political parties of the experts rather than their ideologies. The graph is very similar.

Bias and disagreement in ranking presidents

One of the most interesting aspects of these data is the relative widths of the ranges of most 20th and 21st century presidents. These wide intervals are caused when experts of one political ideology judge a president much differently than experts from another ideology. Typically, the conservative-liberal disparity amounts to 10–15 points on the "greatness" scale, but for several presidents (Coolidge and Obama) the average conservative score is more than 20 points different from the average liberal score.

This disparity is evidence of the partisan state of politics in the US. Even these experts, who presumably use historical facts rather than opinions to guide their ranking, are not immune from seeing current events through the lens of their own personal values and biases.

To better visualize the range of disagreement among the experts, I graphed the difference between the conservative and liberal scores for each president. Again, the color of the line indicates the political party of the president. The 20th century began with McKinley's presidency. For almost every president since McKinley, the gap for Democratic presidents is negative, indicating that the conservative experts judged the presidency less favorably than the liberal experts. For Republican presidents, the direction of the gap is reversed: the conservative experts ranked the presidency more favorably than their liberal counterparts. The exception is Teddy Roosevelt, whose progressive agenda (the "Square Deal" and conservation measures such as national parks and forests) appeals to liberal experts.

Summary

Rottinghaus and Vaughn's 2018 Presidential Greatness Survey analyzes the opinions of 166 experts as to which US presidents are considered great, average, or a failure. The first graph in this blog post shows that although there is general agreement among experts for many 18th and 19th century presidents (Lincoln and Washington were great; Pierce, W. H. Harrison, and Buchanan were not.), there is less agreement for modern presidents. As seen in the second graph, the variation tends to correlate along ideological grounds, with conservatives and liberals having very different views about which presidents were great.

You can download the SAS program that creates the graphs in this article. For a different visualization of these data, see Robert Lawrence's article "All the U.S. Presidents, Ranked by 'Greatness'."

The post Ranking US presidents appeared first on The DO Loop.

6月 272018
 

This article describes how to obtain an initial guess for nonlinear regression models, especially nonlinear mixed models. The technique is to first fit a simpler fixed-effects model by replacing the random effects with their expected values. The parameter estimates for the fixed-effects model are often good initial guesses for the parameters in the full model.

Background: initial guesses for regression parameters

For linear regression models, you don't usually need to provide an initial guess for the parameters. Least squares methods directly solve for the parameter estimates by using the data. For iterative methods, such as logistic regression model, software usually starts with a pre-determined initial guess of the slope parameters and iteratively refine the guess by using an optimization technique. For example, PROC LOGISTIC in SAS begins by guessing that all slopes are zero. The "slopes all zero" model, which is better known as an intercept-only model, is a reduced model that is related to the full logistic model.

Several other regression procedures also fit a related (simpler) model and then use the estimates for the simpler model as an initial guess for the full model. For example, the "M method" in robust regression starts with a least squares solution and then iteratively refines the coefficients until a robust fit is obtained. Similarly, one way to fit a generalized linear mixed models is to first solve a linear mixed model and use those estimates as starting values for the generalized mixed model.

Fit a fixed-effect model to obtain estimates for a mixed model

The previous paragraphs named procedures that do not require the user to provide an initial guess for parameters. However, procedures such as PROC NLIN and PROC NLMIXED in SAS enable you to fit arbitrary nonlinear regression models. The cost for this generality is that the user must provide a good initial guess for the parameters.

This can be a challenge. To help, the NLIN and NLMIXED procedures enable you to specify a grid of initial values, which can be a valuable technique. For some maximum-likelihood estimates (MLE), you can use the method of moments to choose initial parameters. Although it is difficult to choose good initial parameters for a nonlinear mixed model, the following two-step process can be effective:

  • Substitute the expected value for every random effect in the model. You obtain a fixed-effect model that is related to the original mixed model, but has fewer parameters. Estimate the parameters for the fixed-effect model. (Most random effects are assumed to be normally distributed with zero mean, so in practice you "drop" the random effects.)
  • Use the estimates from the fixed-effect model as an initial guess for the full model. Hopefully, the initial guesses are close enough and the full model will converge quickly.

The following example shows how this two-step process works. The PROC NLMIXED documentation contains an example of data from an experiment that fed two different diets to pregnant rats and observed the survival of pups in the litters. The following SAS statements define the data and show the full nonlinear mixed model in the example:

data rats;
input trt $ m x @@;
x1 = (trt='c');
x2 = (trt='t');
litter = _n_;
datalines;
c 13 13   c 12 12   c  9  9   c  9  9   c  8  8   c  8  8   c 13 12   c 12 11
c 10  9   c 10  9   c  9  8   c 13 11   c  5  4   c  7  5   c 10  7   c 10  7
t 12 12   t 11 11   t 10 10   t  9  9   t 11 10   t 10  9   t 10  9   t  9  8
t  9  8   t  5  4   t  9  7   t  7  4   t 10  5   t  6  3   t 10  3   t  7  0
;
 
title "Full nonlinear mixed model";
proc nlmixed data=rats;
   parms  t1=2 t2=2 s1=1 s2=1;     /* <== documentation values. How to guess these??? */
   bounds s1 s2 > 0;
   eta = x1*t1 + x2*t2 + alpha;
   p   = probnorm(eta);
   model x ~ binomial(m,p);
   random alpha ~ normal(0,x1*s1*s1+x2*s2*s2) subject=litter;
run;
Estimate nonlinear mixed model in SAS by using PROC NLMIXED

The model converges in about 10 iterations for the initial guesses on the PARMS statement. But if you deviate from those values, you might encounter problems where the model does not converge.

Fit a fixed-effect model to obtain initial estimates for the mixed model

The full mixed model has one random effect that involves two variance parameters, s1 and s2. If you replace the random effect with its expected value (0), you obtain the following two-parameter fixed-effect model:

title "Reduced model: Fixed effects only";
proc nlmixed data=rats;
   parms  t1=2 t2=2;
   eta = x1*t1 + x2*t2;
   p   = probnorm(eta);
   model x ~ binomial(m,p);
   ods output ParameterEstimates = FixedEstimates;
run;
Estimate fixed-effect parameters in a reduced model in SAS by using PROC NLMIXED

Notice that the estimates for t1 and t2 in the two-parameter model are close to the values in the full model. However, the two-parameter model is much easier to work with, and you can use grids and other visualization techniques to guide you in estimating the parameters. Notice that the estimates for t1 and t2 are written to a SAS data set called FixedEstimates. You can read those estimates on the PARMS statement by using a second call to PROC NLMIXED. This second call fits a four-parameter model, but the t1 and t2 parameters are hopefully good guesses so you can focus on finding good guesses for s1 and s2 in the second call.

/* full random-effects model. The starting values for the fixed effects
   are the estimates from the fixed-effect-only model */
title "Full model: Estimates from a reduced model";
proc nlmixed data=rats;
   parms s1=0.1 s2=1 / data=FixedEstimates;   /* read in initial values for t1, t2 */
   bounds s1 s2 > 0;
   eta = x1*t1 + x2*t2 + alpha;
   p   = probnorm(eta);       /* standard normal quantile for eta */
   model x ~ binomial(m,p);
   random alpha ~ normal(0,x1*s1*s1+x2*s2*s2) subject=litter;
run;
Use estimates in the reduced (fixed-effect-only) model to estimate a full nonlinear mixed model in SAS by using PROC NLMIXED

Combining a grid search and a reduced (fixed-effect) model

You can combine the previous technique with a grid search for the random-effect parameters. You can use PROC TRANSPOSE to convert the FixedEstimates data set from long form (with the variables 'Parameter' and 'Estimate') to wide form (with variables 't1' and 't2'). You can then use a DATA step to add a grid of values for the s1 and s2 parameters that are used to estimate the random effect, as follows:

/* transpose fixed-effect estimates from long form to wide form */
proc transpose data=FixedEstimates(keep=Parameter Estimate) out=PE(drop=_NAME_);
   id Parameter;
   var Estimate;
run;
 
/* add grid of guesses for the parameters in the random effect */
data AllEstimates;
set PE;
do s1 = 0.5 to 1.5 by 0.5;     /* s1 in [0.5, 1.5] */
   do s2 = 0.5 to 2 by 0.5;    /* s2 in [0.5, 2.0] */
      output;
   end;
end;
run;
 
title "Full model: Estimates from a reduced model and from a grid search";
proc nlmixed data=rats;
   parms / data=AllEstimates;   /* use values from reduced model for t1, t2; grid for s1, s2  */
   bounds s1 s2 > 0;
   eta = x1*t1 + x2*t2 + alpha;
   p   = probnorm(eta);
   model x ~ binomial(m,p);
   random alpha ~ normal(0,x1*s1*s1+x2*s2*s2) subject=litter;
run;

The output is the same as for the initial model and is not shown.

In summary, you can often use a two-step process to help choose initial values for the parameters in a many-parameter mixed model. The first step is to estimate the parameters in a smaller and simpler fixed-effect model by replacing the random effects with their expected values. You can then use those fixed-effect estimates in the full model. You can combine this reduced-model technique with a grid search.

I do not have a reference for this technique. It appears to be "folklore" that "everybody knows," but I was unable to locate an example nor a proof that indicates the conditions under which this technique to work. If you know a good reference for this technique, please leave a comment.

The post Reduced models: A way to choose initial parameters for a mixed model appeared first on The DO Loop.

6月 252018
 

When you fit nonlinear fixed-effect or mixed models, it is difficult to guess the model parameters that fit the data. Yet, most nonlinear regression procedures (such as PROC NLIN and PROC NLMIXED in SAS) require that you provide a good guess! If your guess is not good, the fitting algorithm, which is often a nonlinear optimization algorithm, might not converge. This seems to be a Catch-22 paradox: You can't estimate the parameters until you fit the model, but you can't fit the model until you provide an initial guess!

Fortunately, SAS provides a way to circumvent this paradox: you can specify multiple initial parameter values for the nonlinear regression models in PROC NLIN and PROC NLMIXED. The procedure will evaluate the model on a grid that results from all combinations of the specified values and initialize the optimization by using the value in the grid that provides the best fit. This article shows how to specify a grid of initial values for a nonlinear regression model.

An example that specifies a grid of initial values

The following data and model is from the documentation of the NLMIXED procedure. The data describe the number of failures and operating hours for 10 pumps. The pumps are divided into two groups: those that are operated continuously (group=1) and those operated intermittently (group=2). The number of failures is modeled by a Poisson distribution whose parameter (lambda) is fit by a linear regression with separate slopes and intercepts for each group. The example in the documentation specifies a single initial value (0 or 1) for each of the five parameters. In contrast, the following call to PROC NLMIXED specifies multiple initial values for each parameter:

data pump;
  input y t group;q
  pump = _n_;
  logtstd = log(t) - 2.4564900;
  datalines;
 5  94.320 1
 1  15.720 2
 5  62.880 1
14 125.760 1
 3   5.240 2
19  31.440 1
 1   1.048 2
 1   1.048 2
 4   2.096 2
22  10.480 2
;
 
proc nlmixed data=pump;
   parms logsig 0 1                 /* multiple initial guesses for each parameter */
         beta1 -1 to 1 by 0.5 
         beta2 -1 to 1 by 0.5
         alpha1 1 5 10
         alpha2 1 5;                /* use BEST=10 option to see top 10 choices */
   if (group = 1) then eta = alpha1 + beta1*logtstd + e;
                  else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

On the PARMS statement, several guesses are provided for each parameter. Two or three guesses are listed for the logsig, alpha1, and alpha2 parameters. An arithmetic sequence of values is provided for the beta1 and beta2 parameters. The syntax for an arithmetic sequence is start TO stop BY increment, although you can omit the BY clause if the increment is 1. In this example, the arithmetic sequence is equivalent to the list -1, -0.5, 0, 0.5, 1.

The procedure will form the Cartesian product of the guesses for each parameter. Thus the previous syntax specifies a uniform grid of points in parameter space that contains 2 x 5 x 5 x 3 x 2 = 300 initial guesses. As this example indicates, the grid size grows geometrically with the number of values for each parameter. Therefore, resist the urge to use a fine grid for each parameter. That will result in many unnecessary evaluations of the log-likelihood function.

The following table shows a few of the 300 initial parameter values, along with the negative log-likelihood function evaluated at each set of parameters. From among the 300 initial guesses, the procedure will find the combination that results in the smallest value of the negative log-likelihood function and use that value to initialize the optimization algorithm.

If you want to visualize the range of values for the log-likelihood function, you can graph the negative log-likelihood versus the row number. In the following graph, a star indicates the parameter value in the grid that has the smallest value of the negative log-likelihood. I used the IMAGEMAP option on the ODS GRAPHICS statement to turn on tool tips for the graph, so that the parameter values for each point appears when you hover the mouse pointer over the point.

The graph shows that about 10 parameter values have roughly equivalent log-likelihood values. If you want to see those values in a table, you can add the BEST=10 option on the PARMS statement. In this example, there is little reason to choose the smallest parameter value over some of the other values that have a similar log-likelihood.

Create a file for the initial parameters

The ability to specify lists and arithmetic sequences is powerful, but sometimes you might want to use parameter values from a previous experiment or study or from a simplified model for the data. Alternatively, you might want to generate initial parameter values randomly from a distribution such as the uniform, normal, or exponential distributions. The PARMS statement in PROC NLMIXED supports a DATA= option that enables you to specify a data set for which each row is a set of parameter values. (In PROC NLIN and other SAS procedures, use the PDATA= option.) For example, the following DATA step specifies two or three values for the logsig, alpha1, and alpha2 parameters. It specifies random values in the interval (-1, 1) for the beta1 and beta2 parameters. The ParamData data set has 12 rows. The NLMIXED procedure evaluates the model on these 12 guesses, prints the top 5, and then uses the best value to initialize the optimization algorithm:

data ParamData;
call streaminit(54321);
do logsig = 0, 1;
   do alpha1 = 1, 5, 10;
      do alpha2 = 1, 5; 
         beta1 = -1 + 2*rand("uniform");  /* beta1 ~ U(-1, 1) */ 
         beta2 = -1 + 2*rand("uniform");  /* beta2 ~ U(-1, 1) */ 
         output;
      end;
   end;
end;
run;
 
proc nlmixed data=pump;
   parms / DATA=ParamData BEST=5;        /* read guesses for parameters from data set */
   if (group = 1) then 
        eta = alpha1 + beta1*logtstd + e;
   else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

As mentioned, the values for the parameters can come from anywhere. Sometimes random values are more effective than values on a regular grid. If you use a BOUNDS statement to specify valid values for the parameters, any infeasible parameters are discarded and only feasible parameter values are used.

Summary

In summary, this article provides an example of a syntax to specify a grid of initial parameters. SAS procedures that support a grid search include NLIN, NLMIXED, MIXED and GLIMMIX (for covariance parameters), SPP, and MODEL. You can also put multiple guesses into a "wide form" data set: the names of the parameters are the columns and each row contains a different combination of the parameters.

It is worth emphasizing that if your model does not converge, it might not be the parameter search that is at fault. It is more likely that the model does not fit your data. Although the technique in this article is useful for finding parameters so that the optimization algorithm converges, no technique cannot compensate for an incorrect model.

The post Use a grid search to find initial parameter values for regression models in SAS appeared first on The DO Loop.

6月 252018
 

When you fit nonlinear fixed-effect or mixed models, it is difficult to guess the model parameters that fit the data. Yet, most nonlinear regression procedures (such as PROC NLIN and PROC NLMIXED in SAS) require that you provide a good guess! If your guess is not good, the fitting algorithm, which is often a nonlinear optimization algorithm, might not converge. This seems to be a Catch-22 paradox: You can't estimate the parameters until you fit the model, but you can't fit the model until you provide an initial guess!

Fortunately, SAS provides a way to circumvent this paradox: you can specify multiple initial parameter values for the nonlinear regression models in PROC NLIN and PROC NLMIXED. The procedure will evaluate the model on a grid that results from all combinations of the specified values and initialize the optimization by using the value in the grid that provides the best fit. This article shows how to specify a grid of initial values for a nonlinear regression model.

An example that specifies a grid of initial values

The following data and model is from the documentation of the NLMIXED procedure. The data describe the number of failures and operating hours for 10 pumps. The pumps are divided into two groups: those that are operated continuously (group=1) and those operated intermittently (group=2). The number of failures is modeled by a Poisson distribution whose parameter (lambda) is fit by a linear regression with separate slopes and intercepts for each group. The example in the documentation specifies a single initial value (0 or 1) for each of the five parameters. In contrast, the following call to PROC NLMIXED specifies multiple initial values for each parameter:

data pump;
  input y t group;q
  pump = _n_;
  logtstd = log(t) - 2.4564900;
  datalines;
 5  94.320 1
 1  15.720 2
 5  62.880 1
14 125.760 1
 3   5.240 2
19  31.440 1
 1   1.048 2
 1   1.048 2
 4   2.096 2
22  10.480 2
;
 
proc nlmixed data=pump;
   parms logsig 0 1                 /* multiple initial guesses for each parameter */
         beta1 -1 to 1 by 0.5 
         beta2 -1 to 1 by 0.5
         alpha1 1 5 10
         alpha2 1 5;                /* use BEST=10 option to see top 10 choices */
   if (group = 1) then eta = alpha1 + beta1*logtstd + e;
                  else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

On the PARMS statement, several guesses are provided for each parameter. Two or three guesses are listed for the logsig, alpha1, and alpha2 parameters. An arithmetic sequence of values is provided for the beta1 and beta2 parameters. The syntax for an arithmetic sequence is start TO stop BY increment, although you can omit the BY clause if the increment is 1. In this example, the arithmetic sequence is equivalent to the list -1, -0.5, 0, 0.5, 1.

The procedure will form the Cartesian product of the guesses for each parameter. Thus the previous syntax specifies a uniform grid of points in parameter space that contains 2 x 5 x 5 x 3 x 2 = 300 initial guesses. As this example indicates, the grid size grows geometrically with the number of values for each parameter. Therefore, resist the urge to use a fine grid for each parameter. That will result in many unnecessary evaluations of the log-likelihood function.

The following table shows a few of the 300 initial parameter values, along with the negative log-likelihood function evaluated at each set of parameters. From among the 300 initial guesses, the procedure will find the combination that results in the smallest value of the negative log-likelihood function and use that value to initialize the optimization algorithm.

If you want to visualize the range of values for the log-likelihood function, you can graph the negative log-likelihood versus the row number. In the following graph, a star indicates the parameter value in the grid that has the smallest value of the negative log-likelihood. I used the IMAGEMAP option on the ODS GRAPHICS statement to turn on tool tips for the graph, so that the parameter values for each point appears when you hover the mouse pointer over the point.

The graph shows that about 10 parameter values have roughly equivalent log-likelihood values. If you want to see those values in a table, you can add the BEST=10 option on the PARMS statement. In this example, there is little reason to choose the smallest parameter value over some of the other values that have a similar log-likelihood.

Create a file for the initial parameters

The ability to specify lists and arithmetic sequences is powerful, but sometimes you might want to use parameter values from a previous experiment or study or from a simplified model for the data. Alternatively, you might want to generate initial parameter values randomly from a distribution such as the uniform, normal, or exponential distributions. The PARMS statement in PROC NLMIXED supports a DATA= option that enables you to specify a data set for which each row is a set of parameter values. (In PROC NLIN and other SAS procedures, use the PDATA= option.) For example, the following DATA step specifies two or three values for the logsig, alpha1, and alpha2 parameters. It specifies random values in the interval (-1, 1) for the beta1 and beta2 parameters. The ParamData data set has 12 rows. The NLMIXED procedure evaluates the model on these 12 guesses, prints the top 5, and then uses the best value to initialize the optimization algorithm:

data ParamData;
call streaminit(54321);
do logsig = 0, 1;
   do alpha1 = 1, 5, 10;
      do alpha2 = 1, 5; 
         beta1 = -1 + 2*rand("uniform");  /* beta1 ~ U(-1, 1) */ 
         beta2 = -1 + 2*rand("uniform");  /* beta2 ~ U(-1, 1) */ 
         output;
      end;
   end;
end;
run;
 
proc nlmixed data=pump;
   parms / DATA=ParamData BEST=5;        /* read guesses for parameters from data set */
   if (group = 1) then 
        eta = alpha1 + beta1*logtstd + e;
   else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

As mentioned, the values for the parameters can come from anywhere. Sometimes random values are more effective than values on a regular grid. If you use a BOUNDS statement to specify valid values for the parameters, any infeasible parameters are discarded and only feasible parameter values are used.

Summary

In summary, this article provides an example of a syntax to specify a grid of initial parameters. SAS procedures that support a grid search include NLIN, NLMIXED, MIXED and GLIMMIX (for covariance parameters), SPP, and MODEL. You can also put multiple guesses into a "wide form" data set: the names of the parameters are the columns and each row contains a different combination of the parameters.

It is worth emphasizing that if your model does not converge, it might not be the parameter search that is at fault. It is more likely that the model does not fit your data. Although the technique in this article is useful for finding parameters so that the optimization algorithm converges, no technique cannot compensate for an incorrect model.

The post Use a grid search to find initial parameter values for regression models in SAS appeared first on The DO Loop.

6月 042018
 

Years ago, I wrote an article about how to create a Top 10 table and bar chart. The program can be trivially modified to create a "Top N" table and plot, such as Top 5, Top 20, or even Top 100. Not long after the article was written, the developer of PROC FREQ (who had read my blog post) implemented the MAXLEVELS= option on the TABLES statement in PROC FREQ in SAS 9.4. The MAXLEVELS= option automatically creates these tables and plots for you! The option applies only to one-way tables, not to cross tabulations.

A Top 10 plot and bar chart

Suppose you want to see the Top 10 manufacturers of vehicles in the Sashelp.Cars data set. The following call to PROC FREQ uses the MAXLEVELS=10 option to create a Top 10 table and a bar chart of the 10 manufacturers who appear most often in the data:

%let TopN = 10;
proc freq data=sashelp.cars ORDER=FREQ;
  tables make / maxlevels=&TopN Plots=FreqPlot;
run;
Top 10 table in SAS, produced by PROC FREQ
Top 10 bar chart in SAS, produced by PROC FREQ

A few comments about the MAXLEVELS= option:

  • The option affects only the display of the table and chart. The statistics (cumulative counts, percentages, chi-square tests,...) are all based on the full data and all categories. Similarly, if you use the OUT= option to write the counts to an output data set, the data set will contain the counts for all categories, not just the top categories.
  • The "OneWayFreqs" table contains a note at the bottom to remind you that you are looking at a partial table. You can also see that the "Cumulative Percent" column does not end at 100.
  • ThePLOTS=FREQPLOT option contains suboptions that you can use to control aspects of the plot. For example, if you want a horizontal bar chart that shows percentages instead of counts, use the option plots=FreqPlot(orient=horizontal scale=percent).

Sometimes a small change can make a big difference. Although it is not difficult to follow the steps in the original article to create a Top 10 table and chart, the MAXLEVELS= option is even easier.

A Top 10 plot with an "Others" category

For completeness, the following statements are reproduced from an earlier article that shows how to merge the low-frequency categories into a new category with the value "Others." The resulting bar chart shows the most frequent categories and lumps the others into an "Others" category.

%let TopN = 10;
%let VarName = Make;
proc freq data=sashelp.cars ORDER=FREQ noprint;   /* no print. Create output data set of all counts */
  tables &VarName / out=TopOut;
run;
 
data Other;      /* keep the values for the Top categories. Use "Others" for the smaller categories */
set TopOut;
label topCat = "Top Categories or 'Other'";
topCat = &VarName;       /* name of original categorical var */
if _n_ > &TopN then
    topCat = "Other";    /* merge smaller categories */
run;
 
proc freq data=Other ORDER=data;   /* order by data and use WEIGHT statement for counts */
  tables TopCat / plots=FreqPlot(scale=percent);
  weight Count;                    
run;
Bar chart of Top 10 categories and an 'Others' category that shows the count of the smaller categories

In the second PROC FREQ analysis, the smaller categories are merged into a new category called "Others." The "Others" category is suitable for tabular and graphical reporting, but you should think carefully before running statistical tests. As this example shows, the distribution of the new categorical variable might be quite different from the distribution of the original variable.

The post An easy way to make a "Top 10" table and bar chart in SAS appeared first on The DO Loop.

5月 312018
 

A previous article showed how to use a calibration plot to visualize the goodness-of-fit for a logistic regression model. It is common to overlay a scatter plot of the binary response on a predicted probability plot (below, left) and on a calibration plot (below, right):

The SAS program that creates these plots is shown in the previous article. Notice that the markers at Y=0 and Y=1 are displayed by using a scatter plot. Although the sample size is only 500 observations, the scatter plot for the binary response suffers from overplotting. For larger data sets, the scatter plot might appear as two solid lines of markers, which does not provide any insight into the distribution of the horizontal variable. You can plot partially transparent markers, but that does not help the situation much. A better visualization is to eliminate the scatter plot and instead use a binary fringe plot (also called a butterfly fringe plot) to indicate the horizontal positions for each observed response.

A predicted probability plot with binary fringe

A predicted probability plot with a binary fringe plot for logistic regression

A predicted probability plot with a binary fringe plot is shown to the right. Rather than use the same graph area to display the predicted probabilities and the observed responses, a small "butterfly fringe plot" is shown in a panel below the predicted probabilities. The lower panel indicates the counts of the responses by using lines of various lengths. Lines that point down indicate the number of counts for Y=0 whereas lines that point up indicate the counts for Y=1. For these data, the X values less than 1 have many downward-pointing lines whereas the X values greater than 1 have many upward-pointing lines.

To create this plot in SAS, you can do the following:

  1. Use PROC LOGISTIC to output the predicted probabilities and confidence limits for a logistic regression of Y on a continuous explanatory variable X.
  2. Compute the min and max of the continuous explanatory variable.
  3. Use PROC UNIVARIATE to count the number of X values in each of 100 bins in the range [min, max] for Y=0 and Y=1.
  4. Merge the counts with the predicted probabilities.
  5. Define a GTL template to define a panel plot. The main panel shows the predicted probabilities and the lower panel shows the binary fringe plot.
  6. Use PROC SGRENDER to display the panel.

You can download the SAS program that defines the GTL template and creates the predicted probability plot.

A calibration plot with binary fringe

A calibration plot with a binary fringe plot for logistic regression

By using similar steps, you can create a calibration plot with a binary fringe plot as shown to the right. The main panel is used for the calibration plot and a small binary fringe plot is shown in a panel below it. The lower panel shows the counts of the responses at various positions. Note that the horizontal variable is the predicted probability from the model whereas the vertical variable is the empirical probability as estimated by the LOESS procedure. For these simulated data, the fringe plot shows that most of the predicted probabilities are less than 0.2 and these small values mostly correspond to Y=0. The observations for Y=1 mostly have predicted probabilities that are greater than 0.5. The fringe plot reveals that about 77% of the observed responses are Y=0, a fact that was not apparent in the original plots that used a scatter plot to visualize the response variable.

To create this plot in SAS, you can do the following:

  1. Use PROC LOGISTIC to output the predicted probabilities for any logistic regression.
  2. Use PROC LOESS to regress Y onto the predicted probability. This estimates the empirical probability for each value of the predicted probability.
  3. Use PROC UNIVARIATE to count the number of predicted probabilities for each of 100 bins in the range [0, 1] for Y=0 and Y=1.
  4. Merge the counts with the predicted probabilities.
  5. Define a GTL template to define a panel plot. The main panel shows the calibration plot and the lower panel shows the binary fringe plot.
  6. Use PROC SGRENDER to display the panel.

You can download the SAS program that defines the GTL template and creates the calibration plot.

For both plots, the frequencies of the responses are shown by using "needles," but you can make a small change to the GTL to make the fringe plot use thin bars so that it looks more like a butterfly plot of two histograms. See the program for details.

What do you think of this plot? Do you like the way that the binary fringe plot visualizes the response variable, or do you prefer the classic plot that uses a scatter plot to show the positions of Y=0 and Y=1? Leave a comment.

The post Use a fringe plot to visualize binary data in logistic models appeared first on The DO Loop.

5月 162018
 

In my article about how to construct calibration plots for logistic regression models in SAS, I mentioned that there are several popular variations of the calibration plot. The previous article showed how to construct a loess-based calibration curve. Austin and Steyerberg (2013) recommend the loess-based curve on the basis of an extensive simulation study. However, some practitioners prefer to use a decile calibration plot. This article shows how to construct a decile-based calibration curve in SAS.

The decile calibration plot

The decile calibration plot is a graphical analog of the Hosmer-Lemeshow goodness-of-fit test for logistic regression models. The subjects are divided into 10 groups by using the deciles of the predicted probability of the fitted logistic model. Within each group, you compute the mean predicted probability and the mean of the empirical binary response. A calibration plot is a scatter plot of these 10 ordered pairs, although most calibration plots also include the 95% confidence interval for the proportion of the binary responses within each group. Many calibration plots connect the 10 ordered pairs with piecewise line segments, others use a loess curve or a least squares line to smooth the points.

Create the decile calibration plot in SAS

The previous article simulated 500 observations from a logistic regression model logit(p) = b0 + b1*x + b2*x2 where x ~ U(-3, 3). The following call to PROC LOGISTIC fits a linear model to these simulated data. That is, the model is intentionally misspecified. A call to PROC RANK creates a new variable (Decile) that identifies the deciles of the predicted probabilities for the model. This variable is used to compute the means of the predicted probabilities and the empirical proportions (and 95% confidence intervals) for each decile:

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim noprint;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
run;
 
/* To construct the decile calibration plot, identify deciles of the predicted prob. */
proc rank data=LogiOut out=LogiDecile groups=10;
   var PredProb;
   ranks Decile;
run;
 
/* Then compute the mean predicted prob and the empirical proportions (and CI) for each decile */
proc means data=LogiDecile noprint;
   class Decile;
   types Decile;
   var y PredProb;
   output out=LogiDecileOut mean=yMean PredProbMean
          lclm=yLower uclm=yUpper;
run;
 
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiDecileOut noautolegend aspect=1;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   *loess x=PredProbMean y=yMean;  /* if you want a smoother based on deciles */
   series x=PredProbMean y=yMean;  /* if you to connect the deciles */
   scatter x=PredProbMean y=yMean / yerrorlower=yLower yerrorupper=yUpper;
   yaxis label="Observed Probability of Outcome";
   xaxis label="Predicted Probability of Outcome";
run;
Decile calibration curve for a misspecified logistic regression model

The diagonal line is the line of perfect calibration. In a well-calibrated model, the 10 markers should lie close to the diagonal line. For this example, the graph indicates that the linear model does not fit the data well. For the first decile of the predicted probability (the lowest predicted-risk group), the observed probability of the event is much higher than the mean predicted probability. For the fourth, sixth, and seventh deciles, the observed probability is much lower than the mean predicted probability. For the tenth decile (the highest predicted-risk group), the observed probability is higher than predicted. By the way, this kind of calibration is sometimes called internal calibration because the same observations are used to fit and assess the model.

The decile calibration plot for a correctly specified model

You can fit a quadratic model to the data to see how the calibration plot changes for a correctly specified model. The results are shown below. In this graph, all markers are close to the diagonal line, which indicates a very close agreement between the predicted and observed probabilities of the event.

Decile calibration curve for a correctly specified logistic regression model

Should you use the decile calibration curve?

The decile-based calibration plot is popular, perhaps because it is so simple that it can be constructed by hand. Nevertheless, Austin and Steyerberg (2013) suggest using the loess-based calibration plot instead of the decile-based plot. Reasons include the following:

  • The use of deciles results in estimates that "display greater variability than is evident in the loess-based method" (p. 524).
  • Several researchers have argued that the use of 10 deciles is arbitrary. Why not use five? Or 15? In fact, the results of the Hosmer-Lemeshow test "can depend markedly on the number of groups, and there's no theory to guide the choice of that number." (P. Allison, 2013. "Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression")

Many leading researchers in logistic regression do not recommend the Hosmer-Lemeshow test for these reasons. The decile-based calibration curve shares the same drawbacks. Since SAS can easily create the loess-based calibration curve (see the previous article), there seems to be little reason to prefer the decile-based version.

The post Decile calibration plots in SAS appeared first on The DO Loop.

5月 142018
 

A logistic regression model is a way to predict the probability of a binary response based on values of explanatory variables. It is important to be able to assess the accuracy of a predictive model. This article shows how to construct a calibration plot in SAS. A calibration plot is a goodness-of-fit diagnostic graph. It enables you to qualitatively compare a model's predicted probability of an event to the empirical probability.

Calibration curves

There are two standard ways to assess the accuracy of a predictive model for a binary response: discrimination and calibration. To assess discrimination, you can use the ROC curve. Discrimination involves counting the number of true positives, false positive, true negatives, and false negatives at various threshold values. In contrast, calibration curves compare the predicted probability of the response to the empirical probability. Calibration can help diagnose lack of fit. It also helps to rank subjects according to risk.

This article discusses internal calibration, which is the agreement on the sample used to fit the model. External calibration, which involves comparing the predicted and observed probabilities on a sample that was not used to fit the model, is not discussed.

In the literature, there are two types of calibration plots. One uses a smooth curve to compare the predicted and empirical probabilities. The curve is usually a loess curve, but sometimes a linear regression curve is used. The other (to be discussed in a future article) splits the data into deciles. An extensive simulation study by Austin and Steyerberg (2013) concluded that loess-based calibration curves have several advantages. This article discusses how to use a loess fit to construct a calibration curve.

A calibration curve for a misspecified model

Calibration curves help to diagnose lack of fit. This example simulates data according to a quadratic model but create a linear fit to that data. The calibration curve should indicate that the model is misspecified.

Step 1: Simulate data for a quadratic logistic regression model

Austin and Steyerberg (2013) use the following simulated data, which is based on the earlier work of Hosmer et al. (1997). Simulate X ~ U(-3, 3) and define η = b0 + b1*X + b2*X2) to be a linear predictor, where b0 = -2.337, b1 = 0.8569, b2 = 0.3011. The logistic model is logit(p) = η, where p = Pr(Y=1 | x) is the probability of the binary event Y=1. The following SAS DATA step simulates N=500 observations from this logistic regression model:

/* simulate sample of size N=500 that follows a quadratic model from Hosmer et al. (1997) */
%let N = 500;                              /* sample size */ 
data LogiSim(keep=Y x);
call streaminit(1234);                     /* set the random number seed */
do i = 1 to &N;                            /* for each observation in the sample */
   /* Hosmer et al. chose x ~ U(-3,3). Use rand("Uniform",-3,3) for modern versions of SAS */
   x = -3 + 6*rand("Uniform");
   eta = -2.337 + 0.8569*x + 0.3011*x**2;  /* quadratic model used by Hosmer et al. */
   mu = logistic(eta);
   Y = rand("Bernoulli", mu);
   output;
end;
run;

For this simulated data, the "true" model is known to be quadratic. Of course, in practice, the true underlying model is unknown.

Step 2: Fit a logistic model

The next step is to fit a logistic regression model and save the predicted probabilities. The following call to PROC LOGISTIC intentionally fits a linear model. The calibration plot will indicate that the model is misspecified.

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim plots=effect;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
run;
Plot of predicted probabilities for a logistic regression model in SAS

As shown above, PROC LOGISTIC can automatically create a fit plot for a regression that involves one continuous variable. The fit plot shows the observed responses, which are plotted at Y=0 (failure) or Y=1 (success). The predicted probabilities are shown as a sigmoidal curve. The predicted probabilities are in the range (0, 0.8). The next section creates a calibration plot, which is a graph of the predicted probability versus the observed response.

Create the calibration plot

The output from PROC LOGISTIC includes the predicted probabilities and the observed responses. Copas (1983), Harrell et al. (1984), and others have suggested assessing calibration by plotting a scatter plot smoother and overlaying a diagonal line that represents the line of perfect calibration. If the smoother lies close to the diagonal, the model is well calibrated. If there is a systematic deviation from the diagonal line, that indicates that the model might be misspecified.

The following statements sort the data by the predicted probabilities and create the calibration plot. (The sort is optional for the loess smoother, but is useful for other analyses.)

proc sort data=LogiOut;  by PredProb;  run;
 
/* let the data choose the smoothing parameter */
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiOut noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm;   /* smoothing value by AICC (0.657) */
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
run;
Calibration plot for a misspecified logistic model

The loess smoother does not lie close to the diagonal line, which indicates that the linear model is not a good fit for the quadratic data. The loess curve is systematically above the diagonal line for very small and very large values of the predicted probability. This indicates that the empirical probability (the proportion of events) is higher than predicted on these intervals. In contrast, when the predicted probability is in the range [0.1, 0.45], the empirical probability is lower than predicted.

Several variations of the calibration plot are possible:

  • Add the NOMARKERS option to the LOESS statement to suppress the observed responses.
  • Omit the CLM option if you do not want to see a confidence band.
  • Use the SMOOTH=0.75 option if you want to specify the value for the loess smoothing parameter. The value 0.75 is the value that Austin and Steyerberg used in their simulation study, in part because that is the default parameter value in the R programming language.
  • Use a REG statement instead of the LOESS statement if you want a polynomial smoother.

A calibration plot for a correctly specified model

In the previous section, the model was intentionally misspecified. The calibration curve for the example did not lie close to the diagonal "perfect calibration" line, which indicates a lack of fit. Let's see how the calibration curve looks for the same data if you include a quadratic term in the model. First, generate the predicted probabilities for the quadratic model, then create the calibration curve for the model:

proc logistic data=LogiSim noprint;
   model Y(event='1') = x x*x;  /* fit a quadratic model */
   output out=LogiOut2 predicted=PredProb;
run;
 
proc sort data=LogiOut2;  by PredProb;  run;
 
title "Calibration Plot for a Correct Model";
proc sgplot data=LogiOut2 noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm nomarkers;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   yaxis grid; xaxis grid;
run;
Calibration plot for a correctly specified logistic model

This calibration plot indicates that the quadratic model fits the data well. The calibration curve is close to the diagonal reference line, which is the line of perfect calibration. The curve indicates that the predicted and empirical probabilities are close to each other for low-risk, moderate-risk, and high-risk subjects.

It is acceptable for the calibration curve to have small dips, bends, and wiggles. The extent that the curve wiggles is determined by the (automatically chosen) smoothing parameter for the loess algorithm, The value of the smoothing parameter is not displayed by PROC SGPLOT, but you can use PROC LOESS to display the smoothing parameter and other statistics for the loess smoother.

According to Austin and Steyerberg (2013), the calibration curve can detect misspecified models when the magnitude of the omitted term (or terms) is moderate to strong. The calibration curve also behaves better for large samples and when the incidence of outcomes is close to 50% (as opposed to having rare outcomes). Lastly, there is a connection between discrimination and calibration: The calibration curve is most effective in models for which the discrimination (as measured by the C-statistic or area under the ROC curve) is good.

In summary, this article shows how to construct a loess-based calibration curve for logistic regression models in SAS. To create a calibration curve, use PROC LOGISTIC to output the predicted probabilities for the model and plot a loess curve that regresses the observed responses onto the predicted probabilities.

My next article shows how to construct a popular alternative to the loess-based calibration curve. In the alternative version, the subjects are ranked into deciles according to the predicted probability or risk.

Further reading

The post Calibration plots in SAS appeared first on The DO Loop.

5月 022018
 

Order matters. When you create a graph that has a categorical axis (such as a bar chart), it is important to consider the order in which the categories appear. Most software defaults to alphabetical order, which typically gives no insight into how the categories relate to each other.

Alphabetical order rarely adds any insight to the data visualization. For a one-dimensional graph (such as a bar chart) software often provides alternative orderings. For example, in SAS you can use the CATEGORYORDER= option to sort categories of a bar chart by frequency. PROC SGPLOT also supports the DISCRETEORDER= option on the XAXIS and YAXIS statements, which you can use to specify the order of the categories.

It is much harder to choose the order of variables for a two-dimensional display such as a heat map (for example, of a correlation matrix) or a matrix of scatter plots. Both of these displays are often improved by strategically choosing an order for the variables so that adjacent levels have similar properties. Catherine Hurley ("Clustering Visualizations of Multidimensional Data", JCGS, 2004) discusses several ways to choose a particular order for p variables from among the p! possible permutations. This article implements the single-link clustering technique for ordering variables. The same technique is applicable for ordering levels of a nominal categorical variable.

Order variables in two-dimensional displays

I first implemented the single-link clustering method in 2008, which was before I started writing this blog. Last week I remembered my decade-old SAS/IML program and decided to blog about it. This article discusses how to use the program; see Hurley's paper for details about how the program works.

To illustrate ordering a set of variables, the following program creates a heat map of the correlation matrix for variables from the Sashelp.Cars data set. The variables are characteristics of motor vehicles. The rows and columns of the matrix display the variables in alphabetical order. The (i,j)th cell of the heat map visualizes the correlation between the i_th and j_th variable:

proc iml;
/* create correlation matric for variables in alphabetical order */
varNames = {'Cylinders' 'EngineSize' 'Horsepower' 'Invoice' 'Length' 
            'MPG_City' 'MPG_Highway' 'MSRP' 'Weight' 'Wheelbase'};
use Sashelp.Cars;  read all var varNames into Y;  close;
corr = corr(Y);
 
call HeatmapCont(corr) xvalues=varNames yvalues=varNames 
            colorramp=colors range={-1.01 1.01} 
            title="Correlation Matrix: Variables in Alphabetical Order";
Heat map of correlation matrix; variables in alphabetical order

The heat map shows that the fuel economy variables (MPG_City and MPG_Highway) are negatively correlated with most other variables. The size of the vehicle and the power of the engine are positively correlated with each other. The correlations with the price variables (Invoice and MSRP) are small.

Although this 10 x 10 heat map visualizes all pairwise correlations, it is possible to permute the variables so that highly correlated variables are adjacent to each other. The single-link cluster method rearranges the variables by using a symmetric "merit matrix." The merit matrix can be a correlation matrix, a distance matrix, or some other matrix that indicates how the i_th category is related to the j_th category.

The following statements assume that the SingleLinkFunction is stored in a module library. The statements load and call the SingleLinkCluster function. The "merit matrix" is the correlation matrix so the algorithm tries to put strongly positively correlated variables next to each other. The function returns a permutation of the indices. The permutation induces a new order on the variables. The correlation matrix for the new order is then displayed:

/* load the module for single-link clustering */
load module=(SingleLinkCluster);
permutation = SingleLinkCluster( corr );  /* permutation of 1:p */
V = varNames[permutation];                /* new order for variables */
R = corr[permutation, permutation];       /* new order for matrix */
call HeatmapCont(R) xvalues=V yvalues=V 
            colorramp=colors range={-1.01 1.01} 
            title="Correlation: Variables in Clustered Order";
Heat map of correlation matrix; variables in clustered order

In this permuted matrix, the variables are ordered according to their pairwise correlation. For this example, the single-link cluster algorithm groups together the three "size" variables (Length, Wheelbase, and Weight), the three "power" variables (EngineSize, Cylinders, and Horsepower), the two "price" variables (MSRP and Invoice), and the two "fuel economy" variables (MPG_Highway and MPG_City). In this representation, strong positive correlations tend to be along the diagonal, as evidenced by the dark green colors near the matrix diagonal.

The same order is useful if you want to construct a huge scatter plot matrix for these variables. Pairs of variables that are "interesting" tend to appear near the diagonal. In this case, the definition of "interesting" is "highly correlated," but you could choose some other statistic to build the "merit matrix" that is used to cluster the variables. The scatter plot matrix is shown below. Click to enlarge.

Scatter plot matrix of 10 variables. The highly correlated variable tend to appear together so that diagonal scatter plot exhibit high correlation.

Other merit matrices

I remembered Hurley's article while I was working on an article about homophily in married couples. Recall that homophily is the tendency for similar individuals to associate, or in this case to marry partners that have similar academic interests. My heat map used the data order for the disciplines, but I wondered whether the order could be improved by using single-link clustering. The data in the heat map are not symmetric, presumably because women and men using different criteria to choose their partners. However, you can construct the "average homophily matrix" in a natural way. If A is any square matrix then (A` + A)/2 is a symmetric matrix that averages the lower- and upper-triangular elements of A. The following image shows the "average homophily" matrix with the college majors ordered by clusters:

The heat map shows the "averaged data," but in a real study you would probably want to display the real data. You can see that green cells and red cells tend to occur in blocks and that most of the dark green cells are close to the diagonal. In particular, the order of adjacent majors on an axis gives information about affinity. For example, here are a few of the adjacent categories:

  • Biology and Medical/Health
  • Engineering tech, computers, and engineering
  • Interdisciplinary studies, philosophy, and theology/religion
  • Math/statistics and physical sciences
  • Communications and business

Summary

In conclusion, when you create a graph that has a nominal categorical axis, you should consider whether the graph can be improved by ordering the categories. The default order, which is often alphabetical, makes it easy to locate a particular category, but there are other orders that use the data to reveal additional structure. This article uses the single-link cluster method, but Hurley (2004) mentions several other methods.

The post Order variables in a heat map or scatter plot matrix appeared first on The DO Loop.