Statistical Programming

8月 012018
 

When you use a regression procedure in SAS that supports variable selection (GLMSELECT or QUANTSELECT), did you know that the procedures automatically produce a macro variable that contains the names of the selected variables? This article provides examples and details. A previous article provides an overview of the 'SELECT' procedures in SAS for building statistical models.

The final model in PROC GLMSELECT

PROC GLMSELECT uses variable selection techniques such as LAR and LASSO to fit a parsimonious linear model from a large number of potential regressors. The following call to PROC GLMSELECT is adapted from the "Getting Started" example from the documentation, which models the log-transformed salaries of baseball players by using on-the-field statistics, player characteristics, and team attributes. When PROC GLMSELECT runs, it creates the _GLSIND macro variable, which contains the names of the effects in the final model.

/* GLMSELECT creates &_GLSInd for selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / selection=lasso;
quit;
 
%put &_GLSInd;      /* display names of effects for selected model */
 
/* Use _GLSInd to call PROC GLM on final selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = &_GLSInd / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   nHits nRBI nBB YrMajor CrHits CrRuns CrRbi

The MODEL statement in PROC GLMSELECT includes 18 independent variables, but the final LASSO model contains only seven variables. The _GLSInd macro contains the name of the selected variables. As shown in the example, the macro can be used in subsequent analyses. The example uses the macro on the MODEL statement of PROC GLM. The _GLSIND variable contains the original variable names, which means that if you use the CLASS statement (or EFFECT statement) in the GLMSELECT model, you should specify the same statement in subsequent procedures.

If you want to use the selected variables in a procedure that does not support the CLASS statement (or EFFECT statement), you should use the OUTDESIGN= option in PROC GLMSELECT to generate a design matrix and use the _GLSMOD macro variable to specify the names of the dummy variables in the final model. An example is given at the end of this article.

The final model in PROC QUANTSELECT

Similarly, PROC QUANTSELECT creates a macro variable (named _QSLInd) that names the independent variables in the final selected model. For example, the following example models the conditional 90th percentile of the LogSalary variable for the same data set:

/* QUANTSELECT creates _QRSInd for selected model */
proc quantselect data=sashelp.baseball;
class league division;
model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / 
                  selection=lasso quantile=0.9;
run;
 
%put &_QRSInd;
--- SAS Log ---
%put &_QRSInd;
   nRBI CrHome nBB nRuns

When modeling the top 10% of salaries, the selected model includes two factors (career home runs and the number of runs scored in the previous year) that did not appear when predicting the conditional mean of the salaries. You can use the _QRSInd macro to build a predictive model in PROC QUANTREG.

PROC HPQUANTSELECT also creates a macro variable that contains the final selected independent variables. The macro is named _HPQRSIND.

Macros created by other selection procedures in SAS

The HPGENSELECT and LOGISTIC procedures, which can perform variable selection for generalized linear models, do not create a macro variable that contains the selected variables. However, the STEPDISC procedure creates a macro variable named _STDVar that contains the names of the quantitative variables that best discriminate among the classes in a discriminant analysis.

What happens if you use splines or split classification variables?

You might wonder what happens if you use the EFFECT statement to generate spline effects and/or use the SPLIT option on the CLASS statement to enable individual levels of a classification variable to enter/leave the model independently of other levels. (For the LASSO, LAR, and ELASTICNET methods, all spline effects and classification effects are split.) In this case, the final model is likely to contain only certain levels of a categorical variable or only certain basis functions for a spline effect. Consequently, probably you will need to use the design matrix and _GLSMOD macro variable in subsequent procedures.

For example, the following example creates a spline effect with five internal knots and splits the levels of two categorical variables as part of selecting a LASSO model. The final model contains the "Type='Sedan'" level, the "Origin='Asia'" level, and the first and fourth spline basis for the Weight variable:

proc glmselect data=Sashelp.Cars(where=(Type^="Hybrid"))
               outdesign=GLSSplitDesign;      /* create design matrix */
  class origin type / split;                  /* LASSO will force the SPLIT option */
  effect splWt  = spline(weight / details naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
  model mpg_city = origin | type | splWt @ 2  /  selection=Lasso;
quit;
 
%put &_GLSInd;    /* model refers to effects that are not part of the input data */
%put &_GLSMod;    /* model refers to dummy variables that are in the OUTDESIGN= data set */
 
/* for split variables, use the _GLSMOD macro and the OUTDESIGN= data set */
proc glm data=GLSSplitDesign;
   model mpg_city = &_GLSMod / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   Type_Sedan Type_Sports Origin_Asia*Type_Sedan splWt:1 splWt:4 splWt:4*Origin_USA
%put &_GLSMod;
   Type_Sedan Type_Sports Origin_Asia_Type_Sedan splWt_1 splWt_4 splWt_4_Origin_USA

Notice that the _GLSIND macro contains names such as "splWt:1" that are not in the input data set. However, the names in the _GLSMOD macro are all valid columns in the GLSSplitDesign data set, which was created by using the OUTDESIGN= option. So if you are using the LASSO, LAR, or ELASTICNET methods of variable selection or if you manually specify the SPLIT option, you will want to use the _GLSMOD macro and the design data set in subsequent analyses.

The post Which variables are in the final selected model? appeared first on The DO Loop.

8月 012018
 

When you use a regression procedure in SAS that supports variable selection (GLMSELECT or QUANTSELECT), did you know that the procedures automatically produce a macro variable that contains the names of the selected variables? This article provides examples and details. A previous article provides an overview of the 'SELECT' procedures in SAS for building statistical models.

The final model in PROC GLMSELECT

PROC GLMSELECT uses variable selection techniques such as LAR and LASSO to fit a parsimonious linear model from a large number of potential regressors. The following call to PROC GLMSELECT is adapted from the "Getting Started" example from the documentation, which models the log-transformed salaries of baseball players by using on-the-field statistics, player characteristics, and team attributes. When PROC GLMSELECT runs, it creates the _GLSIND macro variable, which contains the names of the effects in the final model.

/* GLMSELECT creates &_GLSInd for selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / selection=lasso;
quit;
 
%put &_GLSInd;      /* display names of effects for selected model */
 
/* Use _GLSInd to call PROC GLM on final selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = &_GLSInd / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   nHits nRBI nBB YrMajor CrHits CrRuns CrRbi

The MODEL statement in PROC GLMSELECT includes 18 independent variables, but the final LASSO model contains only seven variables. The _GLSInd macro contains the name of the selected variables. As shown in the example, the macro can be used in subsequent analyses. The example uses the macro on the MODEL statement of PROC GLM. The _GLSIND variable contains the original variable names, which means that if you use the CLASS statement (or EFFECT statement) in the GLMSELECT model, you should specify the same statement in subsequent procedures.

If you want to use the selected variables in a procedure that does not support the CLASS statement (or EFFECT statement), you should use the OUTDESIGN= option in PROC GLMSELECT to generate a design matrix and use the _GLSMOD macro variable to specify the names of the dummy variables in the final model. An example is given at the end of this article.

The final model in PROC QUANTSELECT

Similarly, PROC QUANTSELECT creates a macro variable (named _QSLInd) that names the independent variables in the final selected model. For example, the following example models the conditional 90th percentile of the LogSalary variable for the same data set:

/* QUANTSELECT creates _QRSInd for selected model */
proc quantselect data=sashelp.baseball;
class league division;
model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / 
                  selection=lasso quantile=0.9;
run;
 
%put &_QRSInd;
--- SAS Log ---
%put &_QRSInd;
   nRBI CrHome nBB nRuns

When modeling the top 10% of salaries, the selected model includes two factors (career home runs and the number of runs scored in the previous year) that did not appear when predicting the conditional mean of the salaries. You can use the _QRSInd macro to build a predictive model in PROC QUANTREG.

PROC HPQUANTSELECT also creates a macro variable that contains the final selected independent variables. The macro is named _HPQRSIND.

Macros created by other selection procedures in SAS

The HPGENSELECT and LOGISTIC procedures, which can perform variable selection for generalized linear models, do not create a macro variable that contains the selected variables. However, the STEPDISC procedure creates a macro variable named _STDVar that contains the names of the quantitative variables that best discriminate among the classes in a discriminant analysis.

What happens if you use splines or split classification variables?

You might wonder what happens if you use the EFFECT statement to generate spline effects and/or use the SPLIT option on the CLASS statement to enable individual levels of a classification variable to enter/leave the model independently of other levels. (For the LASSO, LAR, and ELASTICNET methods, all spline effects and classification effects are split.) In this case, the final model is likely to contain only certain levels of a categorical variable or only certain basis functions for a spline effect. Consequently, probably you will need to use the design matrix and _GLSMOD macro variable in subsequent procedures.

For example, the following example creates a spline effect with five internal knots and splits the levels of two categorical variables as part of selecting a LASSO model. The final model contains the "Type='Sedan'" level, the "Origin='Asia'" level, and the first and fourth spline basis for the Weight variable:

proc glmselect data=Sashelp.Cars(where=(Type^="Hybrid"))
               outdesign=GLSSplitDesign;      /* create design matrix */
  class origin type / split;                  /* LASSO will force the SPLIT option */
  effect splWt  = spline(weight / details naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
  model mpg_city = origin | type | splWt @ 2  /  selection=Lasso;
quit;
 
%put &_GLSInd;    /* model refers to effects that are not part of the input data */
%put &_GLSMod;    /* model refers to dummy variables that are in the OUTDESIGN= data set */
 
/* for split variables, use the _GLSMOD macro and the OUTDESIGN= data set */
proc glm data=GLSSplitDesign;
   model mpg_city = &_GLSMod / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   Type_Sedan Type_Sports Origin_Asia*Type_Sedan splWt:1 splWt:4 splWt:4*Origin_USA
%put &_GLSMod;
   Type_Sedan Type_Sports Origin_Asia_Type_Sedan splWt_1 splWt_4 splWt_4_Origin_USA

Notice that the _GLSIND macro contains names such as "splWt:1" that are not in the input data set. However, the names in the _GLSMOD macro are all valid columns in the GLSSplitDesign data set, which was created by using the OUTDESIGN= option. So if you are using the LASSO, LAR, or ELASTICNET methods of variable selection or if you manually specify the SPLIT option, you will want to use the _GLSMOD macro and the design data set in subsequent analyses.

The post Which variables are in the final selected model? appeared first on The DO Loop.

6月 272018
 

This article describes how to obtain an initial guess for nonlinear regression models, especially nonlinear mixed models. The technique is to first fit a simpler fixed-effects model by replacing the random effects with their expected values. The parameter estimates for the fixed-effects model are often good initial guesses for the parameters in the full model.

Background: initial guesses for regression parameters

For linear regression models, you don't usually need to provide an initial guess for the parameters. Least squares methods directly solve for the parameter estimates by using the data. For iterative methods, such as logistic regression model, software usually starts with a pre-determined initial guess of the slope parameters and iteratively refine the guess by using an optimization technique. For example, PROC LOGISTIC in SAS begins by guessing that all slopes are zero. The "slopes all zero" model, which is better known as an intercept-only model, is a reduced model that is related to the full logistic model.

Several other regression procedures also fit a related (simpler) model and then use the estimates for the simpler model as an initial guess for the full model. For example, the "M method" in robust regression starts with a least squares solution and then iteratively refines the coefficients until a robust fit is obtained. Similarly, one way to fit a generalized linear mixed models is to first solve a linear mixed model and use those estimates as starting values for the generalized mixed model.

Fit a fixed-effect model to obtain estimates for a mixed model

The previous paragraphs named procedures that do not require the user to provide an initial guess for parameters. However, procedures such as PROC NLIN and PROC NLMIXED in SAS enable you to fit arbitrary nonlinear regression models. The cost for this generality is that the user must provide a good initial guess for the parameters.

This can be a challenge. To help, the NLIN and NLMIXED procedures enable you to specify a grid of initial values, which can be a valuable technique. For some maximum-likelihood estimates (MLE), you can use the method of moments to choose initial parameters. Although it is difficult to choose good initial parameters for a nonlinear mixed model, the following two-step process can be effective:

  • Substitute the expected value for every random effect in the model. You obtain a fixed-effect model that is related to the original mixed model, but has fewer parameters. Estimate the parameters for the fixed-effect model. (Most random effects are assumed to be normally distributed with zero mean, so in practice you "drop" the random effects.)
  • Use the estimates from the fixed-effect model as an initial guess for the full model. Hopefully, the initial guesses are close enough and the full model will converge quickly.

The following example shows how this two-step process works. The PROC NLMIXED documentation contains an example of data from an experiment that fed two different diets to pregnant rats and observed the survival of pups in the litters. The following SAS statements define the data and show the full nonlinear mixed model in the example:

data rats;
input trt $ m x @@;
x1 = (trt='c');
x2 = (trt='t');
litter = _n_;
datalines;
c 13 13   c 12 12   c  9  9   c  9  9   c  8  8   c  8  8   c 13 12   c 12 11
c 10  9   c 10  9   c  9  8   c 13 11   c  5  4   c  7  5   c 10  7   c 10  7
t 12 12   t 11 11   t 10 10   t  9  9   t 11 10   t 10  9   t 10  9   t  9  8
t  9  8   t  5  4   t  9  7   t  7  4   t 10  5   t  6  3   t 10  3   t  7  0
;
 
title "Full nonlinear mixed model";
proc nlmixed data=rats;
   parms  t1=2 t2=2 s1=1 s2=1;     /* <== documentation values. How to guess these??? */
   bounds s1 s2 > 0;
   eta = x1*t1 + x2*t2 + alpha;
   p   = probnorm(eta);
   model x ~ binomial(m,p);
   random alpha ~ normal(0,x1*s1*s1+x2*s2*s2) subject=litter;
run;
Estimate nonlinear mixed model in SAS by using PROC NLMIXED

The model converges in about 10 iterations for the initial guesses on the PARMS statement. But if you deviate from those values, you might encounter problems where the model does not converge.

Fit a fixed-effect model to obtain initial estimates for the mixed model

The full mixed model has one random effect that involves two variance parameters, s1 and s2. If you replace the random effect with its expected value (0), you obtain the following two-parameter fixed-effect model:

title "Reduced model: Fixed effects only";
proc nlmixed data=rats;
   parms  t1=2 t2=2;
   eta = x1*t1 + x2*t2;
   p   = probnorm(eta);
   model x ~ binomial(m,p);
   ods output ParameterEstimates = FixedEstimates;
run;
Estimate fixed-effect parameters in a reduced model in SAS by using PROC NLMIXED

Notice that the estimates for t1 and t2 in the two-parameter model are close to the values in the full model. However, the two-parameter model is much easier to work with, and you can use grids and other visualization techniques to guide you in estimating the parameters. Notice that the estimates for t1 and t2 are written to a SAS data set called FixedEstimates. You can read those estimates on the PARMS statement by using a second call to PROC NLMIXED. This second call fits a four-parameter model, but the t1 and t2 parameters are hopefully good guesses so you can focus on finding good guesses for s1 and s2 in the second call.

/* full random-effects model. The starting values for the fixed effects
   are the estimates from the fixed-effect-only model */
title "Full model: Estimates from a reduced model";
proc nlmixed data=rats;
   parms s1=0.1 s2=1 / data=FixedEstimates;   /* read in initial values for t1, t2 */
   bounds s1 s2 > 0;
   eta = x1*t1 + x2*t2 + alpha;
   p   = probnorm(eta);       /* standard normal quantile for eta */
   model x ~ binomial(m,p);
   random alpha ~ normal(0,x1*s1*s1+x2*s2*s2) subject=litter;
run;
Use estimates in the reduced (fixed-effect-only) model to estimate a full nonlinear mixed model in SAS by using PROC NLMIXED

Combining a grid search and a reduced (fixed-effect) model

You can combine the previous technique with a grid search for the random-effect parameters. You can use PROC TRANSPOSE to convert the FixedEstimates data set from long form (with the variables 'Parameter' and 'Estimate') to wide form (with variables 't1' and 't2'). You can then use a DATA step to add a grid of values for the s1 and s2 parameters that are used to estimate the random effect, as follows:

/* transpose fixed-effect estimates from long form to wide form */
proc transpose data=FixedEstimates(keep=Parameter Estimate) out=PE(drop=_NAME_);
   id Parameter;
   var Estimate;
run;
 
/* add grid of guesses for the parameters in the random effect */
data AllEstimates;
set PE;
do s1 = 0.5 to 1.5 by 0.5;     /* s1 in [0.5, 1.5] */
   do s2 = 0.5 to 2 by 0.5;    /* s2 in [0.5, 2.0] */
      output;
   end;
end;
run;
 
title "Full model: Estimates from a reduced model and from a grid search";
proc nlmixed data=rats;
   parms / data=AllEstimates;   /* use values from reduced model for t1, t2; grid for s1, s2  */
   bounds s1 s2 > 0;
   eta = x1*t1 + x2*t2 + alpha;
   p   = probnorm(eta);
   model x ~ binomial(m,p);
   random alpha ~ normal(0,x1*s1*s1+x2*s2*s2) subject=litter;
run;

The output is the same as for the initial model and is not shown.

In summary, you can often use a two-step process to help choose initial values for the parameters in a many-parameter mixed model. The first step is to estimate the parameters in a smaller and simpler fixed-effect model by replacing the random effects with their expected values. You can then use those fixed-effect estimates in the full model. You can combine this reduced-model technique with a grid search.

I do not have a reference for this technique. It appears to be "folklore" that "everybody knows," but I was unable to locate an example nor a proof that indicates the conditions under which this technique to work. If you know a good reference for this technique, please leave a comment.

The post Reduced models: A way to choose initial parameters for a mixed model appeared first on The DO Loop.

6月 252018
 

When you fit nonlinear fixed-effect or mixed models, it is difficult to guess the model parameters that fit the data. Yet, most nonlinear regression procedures (such as PROC NLIN and PROC NLMIXED in SAS) require that you provide a good guess! If your guess is not good, the fitting algorithm, which is often a nonlinear optimization algorithm, might not converge. This seems to be a Catch-22 paradox: You can't estimate the parameters until you fit the model, but you can't fit the model until you provide an initial guess!

Fortunately, SAS provides a way to circumvent this paradox: you can specify multiple initial parameter values for the nonlinear regression models in PROC NLIN and PROC NLMIXED. The procedure will evaluate the model on a grid that results from all combinations of the specified values and initialize the optimization by using the value in the grid that provides the best fit. This article shows how to specify a grid of initial values for a nonlinear regression model.

An example that specifies a grid of initial values

The following data and model is from the documentation of the NLMIXED procedure. The data describe the number of failures and operating hours for 10 pumps. The pumps are divided into two groups: those that are operated continuously (group=1) and those operated intermittently (group=2). The number of failures is modeled by a Poisson distribution whose parameter (lambda) is fit by a linear regression with separate slopes and intercepts for each group. The example in the documentation specifies a single initial value (0 or 1) for each of the five parameters. In contrast, the following call to PROC NLMIXED specifies multiple initial values for each parameter:

data pump;
  input y t group;q
  pump = _n_;
  logtstd = log(t) - 2.4564900;
  datalines;
 5  94.320 1
 1  15.720 2
 5  62.880 1
14 125.760 1
 3   5.240 2
19  31.440 1
 1   1.048 2
 1   1.048 2
 4   2.096 2
22  10.480 2
;
 
proc nlmixed data=pump;
   parms logsig 0 1                 /* multiple initial guesses for each parameter */
         beta1 -1 to 1 by 0.5 
         beta2 -1 to 1 by 0.5
         alpha1 1 5 10
         alpha2 1 5;                /* use BEST=10 option to see top 10 choices */
   if (group = 1) then eta = alpha1 + beta1*logtstd + e;
                  else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

On the PARMS statement, several guesses are provided for each parameter. Two or three guesses are listed for the logsig, alpha1, and alpha2 parameters. An arithmetic sequence of values is provided for the beta1 and beta2 parameters. The syntax for an arithmetic sequence is start TO stop BY increment, although you can omit the BY clause if the increment is 1. In this example, the arithmetic sequence is equivalent to the list -1, -0.5, 0, 0.5, 1.

The procedure will form the Cartesian product of the guesses for each parameter. Thus the previous syntax specifies a uniform grid of points in parameter space that contains 2 x 5 x 5 x 3 x 2 = 300 initial guesses. As this example indicates, the grid size grows geometrically with the number of values for each parameter. Therefore, resist the urge to use a fine grid for each parameter. That will result in many unnecessary evaluations of the log-likelihood function.

The following table shows a few of the 300 initial parameter values, along with the negative log-likelihood function evaluated at each set of parameters. From among the 300 initial guesses, the procedure will find the combination that results in the smallest value of the negative log-likelihood function and use that value to initialize the optimization algorithm.

If you want to visualize the range of values for the log-likelihood function, you can graph the negative log-likelihood versus the row number. In the following graph, a star indicates the parameter value in the grid that has the smallest value of the negative log-likelihood. I used the IMAGEMAP option on the ODS GRAPHICS statement to turn on tool tips for the graph, so that the parameter values for each point appears when you hover the mouse pointer over the point.

The graph shows that about 10 parameter values have roughly equivalent log-likelihood values. If you want to see those values in a table, you can add the BEST=10 option on the PARMS statement. In this example, there is little reason to choose the smallest parameter value over some of the other values that have a similar log-likelihood.

Create a file for the initial parameters

The ability to specify lists and arithmetic sequences is powerful, but sometimes you might want to use parameter values from a previous experiment or study or from a simplified model for the data. Alternatively, you might want to generate initial parameter values randomly from a distribution such as the uniform, normal, or exponential distributions. The PARMS statement in PROC NLMIXED supports a DATA= option that enables you to specify a data set for which each row is a set of parameter values. (In PROC NLIN and other SAS procedures, use the PDATA= option.) For example, the following DATA step specifies two or three values for the logsig, alpha1, and alpha2 parameters. It specifies random values in the interval (-1, 1) for the beta1 and beta2 parameters. The ParamData data set has 12 rows. The NLMIXED procedure evaluates the model on these 12 guesses, prints the top 5, and then uses the best value to initialize the optimization algorithm:

data ParamData;
call streaminit(54321);
do logsig = 0, 1;
   do alpha1 = 1, 5, 10;
      do alpha2 = 1, 5; 
         beta1 = -1 + 2*rand("uniform");  /* beta1 ~ U(-1, 1) */ 
         beta2 = -1 + 2*rand("uniform");  /* beta2 ~ U(-1, 1) */ 
         output;
      end;
   end;
end;
run;
 
proc nlmixed data=pump;
   parms / DATA=ParamData BEST=5;        /* read guesses for parameters from data set */
   if (group = 1) then 
        eta = alpha1 + beta1*logtstd + e;
   else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

As mentioned, the values for the parameters can come from anywhere. Sometimes random values are more effective than values on a regular grid. If you use a BOUNDS statement to specify valid values for the parameters, any infeasible parameters are discarded and only feasible parameter values are used.

Summary

In summary, this article provides an example of a syntax to specify a grid of initial parameters. SAS procedures that support a grid search include NLIN, NLMIXED, MIXED and GLIMMIX (for covariance parameters), SPP, and MODEL. You can also put multiple guesses into a "wide form" data set: the names of the parameters are the columns and each row contains a different combination of the parameters.

It is worth emphasizing that if your model does not converge, it might not be the parameter search that is at fault. It is more likely that the model does not fit your data. Although the technique in this article is useful for finding parameters so that the optimization algorithm converges, no technique cannot compensate for an incorrect model.

The post Use a grid search to find initial parameter values for regression models in SAS appeared first on The DO Loop.

6月 252018
 

When you fit nonlinear fixed-effect or mixed models, it is difficult to guess the model parameters that fit the data. Yet, most nonlinear regression procedures (such as PROC NLIN and PROC NLMIXED in SAS) require that you provide a good guess! If your guess is not good, the fitting algorithm, which is often a nonlinear optimization algorithm, might not converge. This seems to be a Catch-22 paradox: You can't estimate the parameters until you fit the model, but you can't fit the model until you provide an initial guess!

Fortunately, SAS provides a way to circumvent this paradox: you can specify multiple initial parameter values for the nonlinear regression models in PROC NLIN and PROC NLMIXED. The procedure will evaluate the model on a grid that results from all combinations of the specified values and initialize the optimization by using the value in the grid that provides the best fit. This article shows how to specify a grid of initial values for a nonlinear regression model.

An example that specifies a grid of initial values

The following data and model is from the documentation of the NLMIXED procedure. The data describe the number of failures and operating hours for 10 pumps. The pumps are divided into two groups: those that are operated continuously (group=1) and those operated intermittently (group=2). The number of failures is modeled by a Poisson distribution whose parameter (lambda) is fit by a linear regression with separate slopes and intercepts for each group. The example in the documentation specifies a single initial value (0 or 1) for each of the five parameters. In contrast, the following call to PROC NLMIXED specifies multiple initial values for each parameter:

data pump;
  input y t group;q
  pump = _n_;
  logtstd = log(t) - 2.4564900;
  datalines;
 5  94.320 1
 1  15.720 2
 5  62.880 1
14 125.760 1
 3   5.240 2
19  31.440 1
 1   1.048 2
 1   1.048 2
 4   2.096 2
22  10.480 2
;
 
proc nlmixed data=pump;
   parms logsig 0 1                 /* multiple initial guesses for each parameter */
         beta1 -1 to 1 by 0.5 
         beta2 -1 to 1 by 0.5
         alpha1 1 5 10
         alpha2 1 5;                /* use BEST=10 option to see top 10 choices */
   if (group = 1) then eta = alpha1 + beta1*logtstd + e;
                  else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

On the PARMS statement, several guesses are provided for each parameter. Two or three guesses are listed for the logsig, alpha1, and alpha2 parameters. An arithmetic sequence of values is provided for the beta1 and beta2 parameters. The syntax for an arithmetic sequence is start TO stop BY increment, although you can omit the BY clause if the increment is 1. In this example, the arithmetic sequence is equivalent to the list -1, -0.5, 0, 0.5, 1.

The procedure will form the Cartesian product of the guesses for each parameter. Thus the previous syntax specifies a uniform grid of points in parameter space that contains 2 x 5 x 5 x 3 x 2 = 300 initial guesses. As this example indicates, the grid size grows geometrically with the number of values for each parameter. Therefore, resist the urge to use a fine grid for each parameter. That will result in many unnecessary evaluations of the log-likelihood function.

The following table shows a few of the 300 initial parameter values, along with the negative log-likelihood function evaluated at each set of parameters. From among the 300 initial guesses, the procedure will find the combination that results in the smallest value of the negative log-likelihood function and use that value to initialize the optimization algorithm.

If you want to visualize the range of values for the log-likelihood function, you can graph the negative log-likelihood versus the row number. In the following graph, a star indicates the parameter value in the grid that has the smallest value of the negative log-likelihood. I used the IMAGEMAP option on the ODS GRAPHICS statement to turn on tool tips for the graph, so that the parameter values for each point appears when you hover the mouse pointer over the point.

The graph shows that about 10 parameter values have roughly equivalent log-likelihood values. If you want to see those values in a table, you can add the BEST=10 option on the PARMS statement. In this example, there is little reason to choose the smallest parameter value over some of the other values that have a similar log-likelihood.

Create a file for the initial parameters

The ability to specify lists and arithmetic sequences is powerful, but sometimes you might want to use parameter values from a previous experiment or study or from a simplified model for the data. Alternatively, you might want to generate initial parameter values randomly from a distribution such as the uniform, normal, or exponential distributions. The PARMS statement in PROC NLMIXED supports a DATA= option that enables you to specify a data set for which each row is a set of parameter values. (In PROC NLIN and other SAS procedures, use the PDATA= option.) For example, the following DATA step specifies two or three values for the logsig, alpha1, and alpha2 parameters. It specifies random values in the interval (-1, 1) for the beta1 and beta2 parameters. The ParamData data set has 12 rows. The NLMIXED procedure evaluates the model on these 12 guesses, prints the top 5, and then uses the best value to initialize the optimization algorithm:

data ParamData;
call streaminit(54321);
do logsig = 0, 1;
   do alpha1 = 1, 5, 10;
      do alpha2 = 1, 5; 
         beta1 = -1 + 2*rand("uniform");  /* beta1 ~ U(-1, 1) */ 
         beta2 = -1 + 2*rand("uniform");  /* beta2 ~ U(-1, 1) */ 
         output;
      end;
   end;
end;
run;
 
proc nlmixed data=pump;
   parms / DATA=ParamData BEST=5;        /* read guesses for parameters from data set */
   if (group = 1) then 
        eta = alpha1 + beta1*logtstd + e;
   else eta = alpha2 + beta2*logtstd + e;
   lambda = exp(eta);
   model y ~ poisson(lambda);
   random e ~ normal(0,exp(2*logsig)) subject=pump;
run;

As mentioned, the values for the parameters can come from anywhere. Sometimes random values are more effective than values on a regular grid. If you use a BOUNDS statement to specify valid values for the parameters, any infeasible parameters are discarded and only feasible parameter values are used.

Summary

In summary, this article provides an example of a syntax to specify a grid of initial parameters. SAS procedures that support a grid search include NLIN, NLMIXED, MIXED and GLIMMIX (for covariance parameters), SPP, and MODEL. You can also put multiple guesses into a "wide form" data set: the names of the parameters are the columns and each row contains a different combination of the parameters.

It is worth emphasizing that if your model does not converge, it might not be the parameter search that is at fault. It is more likely that the model does not fit your data. Although the technique in this article is useful for finding parameters so that the optimization algorithm converges, no technique cannot compensate for an incorrect model.

The post Use a grid search to find initial parameter values for regression models in SAS appeared first on The DO Loop.

6月 202018
 

A previous article provides an example of using the BOOTSTRAP statement in PROC TTEST to compute bootstrap estimates of statistics in a two-sample t test. The BOOTSTRAP statement is new in SAS/STAT 14.3 (SAS 9.4M5). However, you can perform the same bootstrap analysis in earlier releases of SAS by using procedures in Base SAS and SAS/STAT. This article gives an example of how to bootstrap in SAS.

The main steps of the bootstrap method in SAS

A previous article describes how to construct a bootstrap confidence interval in SAS. The major steps of a bootstrap analysis follow:

  1. Compute the statistic of interest for the original data
  2. Resample B times (with replacement) from the data to form B bootstrap samples. The resampling process should respect the structure of the analysis and the null hypothesis. In SAS it is most efficient to use the DATA step or PROC SURVEYSELECT to put all B random bootstrap samples into a single data set.
  3. Use BY-group processing to compute the statistic of interest on each bootstrap sample. The BY-group approach is much faster than using macro loops. The union of the statistic is the bootstrap distribution, which approximates the sampling distribution of the statistic under the null hypothesis.
  4. Use the bootstrap distribution to obtain bootstrap estimates of bias, standard errors, and confidence intervals.

Compute the statistic of interest

This article uses the same bootstrap example as the previous article. The following SAS DATA step subsets the Sashelp.Cars data to create a data set that contains two groups: SUV" and "Sedan". There are 60 SUVs and 262 sedans. The statistic of interest is the difference of means between the two groups. A call to PROC TTEST computes the difference between group means for the data:

data Sample;    /* create the sample data. The two groups are "SUV" and "Sedan" */
set Sashelp.Cars(keep=Type MPG_City);
if Type in ('Sedan' 'SUV');
run;
 
/* 1. Compute statistic (difference of means) for data */
proc ttest data=Sample;
   class Type;
   var MPG_City;
   ods output Statistics=SampleStats;   /* save statistic in SAS data set */
run;
 
/* 1b. OPTIONAL: Store sample statistic in a macro variable for later use */
proc sql noprint;
select Mean into :Statistic
       from SampleStats where Method="Satterthwaite";
quit;
%put &=Statistic;
STATISTIC= -4.9840

The point estimate for the difference of means between groups is -4.98. The TTEST procedure produces a graph (not shown) that indicates that the MPG_City variable is moderately skewed for the "Sedan" group. Therefore you might question the usefulness of the classical parametric estimates for the standard error and confidence interval for the difference of means. The following bootstrap analysis provides a nonparametric estimate about the accuracy of the difference of means.

Resample from the data

For many resampling schemes, PROC SURVEYSELECT is the simplest way to generate bootstrap samples. The documentation for PROC TTEST states, "In a bootstrap for a two-sample design, random draws of size n1 and n2 are taken with replacement from the first and second groups, respectively, and combined to produce a single bootstrap sample." One way to carry out this sampling scheme is to use the STRATA statement in PROC SURVEYSELECT to sample (with replacement) from the "SUV" and "Sedan" groups. To perform stratified sampling, sort the data by the STRATA variable. The following statements sort the data and generate 10,000 bootstrap samples by drawing random samples (with replacement) from each group:

/* 2. Sample with replacement from each stratum. First sort by the STRATA variable. */
proc sort data=Sample; by Type; run;
 
/* Then perform stratified sampling with replacement */
proc surveyselect data=Sample out=BootSamples noprint seed=123 
                  method=urs              /* with replacement */
                  /* OUTHITS */           /* use OUTHITS option when you do not want a frequency variable */
                  samprate=1 reps=10000;  /* 10,000 resamples */
   strata Type;   /* sample N1 from first group and N2 from second */
run;

The BootSamples data set contains 10,000 random resamples. Each sample contains 60 SUVs and 262 sedans, just like the original data. The BootSamples data contains a variable named NumberHits that contains the frequency with which each original observation appears in the resample. If you prefer to use duplicated observations, specify the OUTHITS option in the PROC SURVEYSELECT statement. The different samples are identified by the values of the Replicate variable.

BY-group analysis of bootstrap samples

Recall that a BY-group analysis is an efficient way to process 10,000 bootstrap samples. Recall also that it is efficient to suppress output when you perform a large BY-group analysis. The following macros encapsulate the commands that suppress ODS objects prior to a simulation or bootstrap analysis and then permit the objects to appear after the analysis is complete:

/* Define useful macros */
%macro ODSOff(); /* Call prior to BY-group processing */
ods graphics off;  ods exclude all;  ods noresults;
%mend;
 
%macro ODSOn(); /* Call after BY-group processing */
ods graphics on;  ods exclude none;  ods results;
%mend;

With these definitions, the following call to PROC TTEST computes the Satterthwaite test statistic for each bootstrap sample. Notice that you need to sort the data by the Replicate variable because the BootSamples data are ordered by the values of the Type variable. Note also that the NumberHits variable is used as a FREQ variable.

/* 3. Compute statistics */
proc sort data = BootSamples; by Replicate Type; run;
 
%ODSOff                          /* suppress output */
proc ttest data=BootSamples;
   by Replicate;
   class Type;
   var MPG_City;
   freq NumberHits;              /* Use FREQ variable in analysis (or use OUTHITS option) */
   ods output ConfLimits=BootDist(where=(method="Satterthwaite")
              keep=Replicate Variable Class Method Mean rename=(Mean=DiffMeans)); 
run; 
%ODSOn                           /* enable output   */

Obtain estimates from the bootstrap distribution

At this point in the bootstrap example, the data set BootDist contains the bootstrap distribution in the variable DiffMeans. You can use this variable to compute various bootstrap statistics. For example, the bootstrap estimate of the standard error is the standard deviation of the DiffMeans variable. The estimate of bias is the difference between the mean of the bootstrap estimates and the original statistic. The percentiles of the DiffMeans variable can be used to construct a confidence interval. (Or you can use a different interval estimate, such as the bias-adjusted and corrected interval.) You might also want to graph the bootstrap distribution. The following statements use PROC UNIVARIATE to compute these estimates:

/* 4. Plot sampling distribution of difference of sample means. Write stats to BootStats data set */ 
proc univariate data=BootDist; /* use NOPRINT option to suppress output and graphs */
   var DiffMeans;
   histogram DiffMeans;      /* OPTIONAL */
   output out=BootStats pctlpts =2.5  97.5  pctlname=P025 P975
                  pctlpre =Mean_ mean=BootMean std=BootStdErr;
run;
 
/* use original sample statistic to compute bias */
data BootStats;
set BootStats;
Bias = BootMean - &Statistic;
label Mean_P025="Lower 95% CL"  Mean_P975="Upper 95% CL";
run;
 
proc print data=BootStats noobs; 
   var BootMean BootStdErr Bias Mean_P025 Mean_P975;
run;

The results are shown. The bootstrap distribution appears to be normally distributed. This indicates that the bootstrap estimates will probably be similar to the classical parametric estimates. For this problem, the classical estimate of the standard error is 0.448 and a 95% confidence interval for the difference of means is [-5.87, -4.10]. In comparison, the bootstrap estimates are 0.444 and [-5.87, -4.13]. In spite of the skewness of the MPG_City variable for the "Sedan" group, the two-sample Satterthwaite t provides similar estimates regarding the accuracy of the point estimate for the difference of means. The bootstrap statistics also are similar to the statistics that you can obtain by using the BOOTSTRAP statement in PROC TTEST in SAS/STAT 14.3.

In summary, you can use Base SAS and SAS/STAT procedures to compute a bootstrap analysis of a two-sample t test. Although the "manual" bootstrap requires more programming effort than using the BOOTSTRAP statement in PROC TTEST, the example in this article generalizes to other statistics for which a built-in bootstrap option is not supported. This article also shows how to use PROC SURVEYSELECT to perform stratified sampling as part of a bootstrap analysis that involves sampling from multiple groups.

The post The bootstrap method in SAS: A t test example appeared first on The DO Loop.

6月 082018
 

I recently recorded a short video about the new syntax for specifying and manipulating lists in SAS/IML 14.3. This is a video of my Super Demo at SAS Global Forum 2018. The new syntax supports dynamic arrays, associative arrays ("named lists"), and hierarchical data structures such as lists of lists.


If your browser does not support embedded video, you can go directly to the video on YouTube.

If you prefer to read about lists, the following references provide more information and additional examples of ways to use lists:

The post Video: A new syntax for lists in SAS/IML appeared first on The DO Loop.

4月 162018
 

A colleague and I recently discussed how to generate random permutations without encountering duplicates. Given a set of n items, there are n! permutations My colleague wants to generate k unique permutations at random from among the total of n!. Said differently, he wants to sample without replacement from the set of all possible permutations.

In SAS, you can use the RANPERM function to generate ranom permutations. For example, the statement p = ranperm(4) generates a random permutation of the elements in the set {1, 2, 3, 4}. However, the function does not ensure that the permutations it generates will be unique.

I can think of two ways to generate random permutations without replacement. The first is to generate all permutations and then choose a random subset of size k without replacement from the set of n! possibilities. This technique is efficient when n is small or when k is a substantial proportion of n!. The second technique is to randomly generate permutations until you get a total of k that are unique. This technique is efficient when n is large and k is small relative to n!.

Generate all permutations, then sample without replacement

For small sets (say, n ≤ 8), an efficient way to generate random permutations is to generate all permutations and then extract a random sample without replacement. In the SAS/IML language you can use the ALLPERM function to generate all permutations of a set of n items, where n ≤ 18. The ALLPERM function returns a matrix that has n! rows and n columns. Each row is a permutation. You can then use the SAMPLE function to sample without replacement from among the rows, as follows:

proc iml;
call randseed(123);
n = 4;                       /* permutations of n items */
k = 3;                       /* want k unique permutation */
 
p = allperm(n);               /* matrix has n! rows; each row is a permutation */
rows = sample(1:n, k, "WOR"); /* random rows w/o replacement */
ranPermWOR = p[rows, ];       /* extract rows */
print ranPermWOR;

Generate random permutations, then check for uniqueness

The matrix of all permutations has n! rows, which gets big fast. If you want only a small number of permutations from among a huge set of possible permutations, it is more efficient to use the RANPERM function to generate permutations, then discard duplicates. Last week I showed how to eliminate duplicate rows from a numeric matrix so that the remaining rows are unique. The following SAS/IML statements use the UniqueRows function, which is defined in the previous post. You must define or load the module before you can use it.

/* <define or load the DupRows and UniqueRows modules HERE> */
 
n = 10;                      /* permutations of n items */
k = 5;                       /* want k unique permutation; k << n! */
p = ranperm(n, 2*k);         /* 2x more permutations than necessary */
U = UniqueRows(p);           /* get unique rows of 2k x n matrix */
if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
else do;
   U = UniqueRows( U // ranperm(n, 10*k) ); /* generate more... repeat as necessary */
   if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
end;
print U;

Notice in the previous statements that the call to RANPERM requests 2*k random permutations, even though we only want k. You should ask for more permutations than you need because some of them might be duplicates. The factor of 2 is ad hoc; it is not based on any statistical consideration.

If k is much smaller than n!, then you might think that the probability of generating duplicate a permutation is small. However, the Birthday Problem teaches us that duplicates arise much more frequently than we might expect, so it is best to expect some duplicates and generate more permutations than necessary. When k is moderately large relative to n!, you might want to use a factor of 5, 10, or even 100. I have not tried to compute the number of permutations that would generate k unique permutations with 95% confidence, but that would be a fun exercise. In my program, if the initial attempt generates fewer than k unique rows, it generates an additional 10*k permutations. You could repeat this process, if necessary.

In summary, if you want to generate random permutations without any duplicates, you have at least two choices. You can generate all permutations and then extract a random sample of k rows. This method works well for small values of n, such as n ≤ 8. For larger values of n, you might want to generate random permutation (more than k) and use the first k unique rows of the matrix of permutations.

The post Random permutations without duplicates appeared first on The DO Loop.

4月 042018
 

Correlation is a statistic that measures how closely two variables are related to each other. The most popular definition of correlation is the Pearson product-moment correlation, which is a measurement of the linear relationship between two variables. Many textbooks stress the linear nature of the Pearson correlation and emphasize that a zero value for the Pearson correlation does not imply that the variables are independent. A classic example is to define X to be a random variable on [-1, 1] and define Y=X2. X and Y are clearly not independent, yet you can show that the Pearson correlation between X and Y is 0.

In 2007, G. Szekely, M. Rizzo, and N. Bakirov published a paper in the The Annals of Statistics called "Measuring and Testing Dependence by Correlation of Distances." This paper defines a distance-based correlation that can detect nonlinear relationships between variables. This blog post describes the distance correlation and implements it in SAS.

An overview of distance correlation

It is impossible to adequately summarize a 27-page paper from the Annals in a few paragraphs, but I'll try. The Szekely-Rizzo-Bakirov paper defines the distance covariance between two random variables. It shows how to estimate the distance correlation from data samples. A practical implication is that you can estimate the distance correlation by computing two matrices: the matrix of pairwise distances between observations in a sample from X and the analogous distance matrix for observations from Y. If the elements in these matrices co-vary together, we say that X and Y have a large distance correlation. If they do not, they have a small distance correlation.

For motivation, recall that the Pearson covariance between X and Y (which is usually defined as the inner product of two centered vectors) can be written in terms of the raw observations:
Classical covariance formula

The terms (xi – xj) and (yi – yj) can be thought of as the one-dimensional signed distances between the i_th and j_th observations. Szekely et al. replace those terms with centered Euclidean distances D(xi, xj) and define the distance covarariance as follows:
Distance covariance formula

The distance covariance between random vectors X and Y has the following properties:

  • X and Y are independent if and only if dCov(X,Y) = 0.
  • You can define the distance variance dVar(X) = dCov(X,X) and the distance correlation as dCor(X,Y) = dCov(X,Y) / sqrt( dVar(X) dVar(Y) ) when both variances are positive.
  • 0 ≤ dCor(X,Y) ≤ 1 for all X and Y. Note this is different from the Pearson correlation, for which negative correlation is possible.
  • dCov(X,Y) is defined for random variables in arbitrary dimensions! Because you can compute the distance between observations in any dimensions, you can compute the distance covariance regardless of dimensions. For example, X can be a sample from a 3-dimensional distribution and Y can be a sample from a 5-dimensional distribution.
  • You can use the distance correlation to define a statistical test for independence. I don't have space in this article to discuss this fact further.

Distance correlation in SAS

The following SAS/IML program defines two functions. The first "double centers" a distance matrix by subtracting the row and column marginals. The second is the distCorr function, which computes the Szekely-Rizzo-Bakirov distance covariance, variances, and correlation for two samples that each have n rows. (Recall that X and Y can have more than one column.) The function returns a list of statistics. This lists syntax is new to SAS/IML 14.3, so if you are running an older version of SAS, modify the function to return a vector.

proc iml;
start AdjustDist(A);    /* double centers matrix by subtracting row and column marginals */
   rowMean = A[:, ];
   colMean = rowMean`;  /* the same, by symmetry */
   grandMean = rowMean[:];
   A_Adj = A - rowMean - colMean + grandMean;
   return (A_Adj);
finish;
 
/* distance correlation: G. Szekely, M. Rizzo, and N. Bakirov, 2007, Annals of Statistics, 35(6) */
start DistCorr(x, y);
   DX = distance(x);    DY = distance(y); 
   DX = AdjustDist(DX); DY = AdjustDist(DY);
   V2XY = (DX # DY)[:];  /* mean of product of distances */
   V2X  = (DX # DX)[:];  /* mean of squared (adjusted) distance */
   V2Y  = (DY # DY)[:];  /* mean of squared (adjusted) distance */
 
   dCov = sqrt( V2XY );     /* distance covariance estimate */
   denom = sqrt(V2X * V2Y); /* product of std deviations */
   if denom > 0 then R2 = V2XY / denom;    /* R^2 = (dCor)^2 */
   else              R2 = 0;
   dCor = sqrt(R2);     
   T = nrow(DX)*V2XY;   /* test statistic p. 2783. Reject indep when T>=z */
 
   /* return List of resutls: */
   L = [#"dCov"=dCov,  #"dCor"=dCor,  #"dVarX"=sqrt(V2X), #"dVarY"=sqrt(V2Y), #"T"=T];
   /* or return a vector: L = dCov || dCor || sqrt(V2X) || sqrt(V2Y) || T; */
   return L;
finish;

Let's test the DistCorr function on two 4-element vectors. The following (X,Y) ordered pairs lie one the intersection of the unit circle and the coordinate axes. The Pearson correlation for these observations is 0 because there is no linear association. In contrast, the distance-based correlation is nonzero. The distance correlation detects a relationship between these points (namely, that they lie along the unit circle) and therefore the variables are not independent.

x = {1,0,-1, 0};
y = {0,1, 0,-1};
results = DistCorr(x, y);
/* get itenms from results */
dCov=results$"dCov";  dCor=results$"dCor";  dVarX=results$"dVarX"; dVarY=results$"dVarY";
/* or from vector: dCov=results[1];  dCor=results[2];  dVarX=results[3]; dVarY=results[4]; */
print dCov dCor dVarX dVarY;
Distance correlation for points on a circle

Examples of distance correlations

Let's look at a few other examples of distance correlations in simulated data. To make it easier to compare the Pearson correlation and the distance correlation, you can define two helper functions that return only those quantities.

Classic example: Y is quadratic function of X

The first example is the classic example (mentioned in the first paragraph of this article) that shows that a Pearson correlation of 0 does not imply independence of variables. The vector X is distributed in [-1,1] and the vector Y is defined as X2:

/* helper functions */
start PearsonCorr(x,y);
   return( corr(x||y)[1,2] );
finish;
start DCorr(x,y);
   results = DistCorr(X, Y);
   return( results$"dCor" );  /* if DistCorr returns vector: return( results[2] ); */
finish;
 
x = do(-1, 1, 0.1)`;    /* X is in [-1, 1] */
y = x##2;               /* Y = X^2 */
PCor = PearsonCorr(x,y);
DCor = DCorr(x,y);
print PCor DCor;
Example where Pearson correlation equals 0 but distance correlation is nonzero

As promised, the Pearson correlation is zero but the distance correlation is nonzero. You can use the distance correlation as part of a formal hypothesis test to conclude that X and Y are not independent.

Distance correlation for multivariate normal data

You might wonder how the distance correlation compares with the Pearson correlation for bivariate normal data. Szekely et al. prove that the distance correlation is always less than the absolute value of the population parameter: dCor(X,Y) ≤ |ρ|. The following statements generate a random sample from a bivariate normal distribution with Pearson correlation ρ for a range of positive and negative ρ values.

N = 500;                   /* sample size */
mu = {0 0};  Sigma = I(2); /* parameters for bivaraite normal distrib */
rho = do(-0.9, 0.9, 0.1);  /* grid of correlations */
PCor = j(1, ncol(rho), .); /* allocate vectors for results */
DCor = j(1, ncol(rho), .);
call randseed(54321);
do i = 1 to ncol(rho);     /* for each rho, simulate bivariate normal data */
   Sigma[1,2] = rho[i]; Sigma[2,1] = rho[i]; /* population covariance */
   Z = RandNormal(N, mu, Sigma);        /* bivariate normal sample */
   PCor[i] = PearsonCorr(Z[,1], Z[,2]); /* Pearson correlation */
   DCor[i] = DCorr(Z[,1], Z[,2]);       /* distance correlation */
end;

If you graph the Pearson and distance correlation against the parameter values, you obtain the following graph:

Graph of Pearson correlation and distance correlation for samples of bivariate normal data with correlation rho

You can see that the distance correlation is always positive. It is close to, but in many cases less than, the absolute value of the Pearson estimate. Nevertheless, it is comforting that the distance correlation is closely related to the Pearson correlation for correlated normal data.

Distance correlation for data of different dimensions

As the last example, let's examine the most surprising property of the distance correlation, which is that it enables you to compute correlations between variables of different dimensions. In contrast, the Pearson correlation is defined only for univariate variables. The following statements generate two independent random normal samples with 1000 observations. The variable X is a bivariate normal sample. The variable Y is a univariate normal sample. The distance correlation for the sample is close to 0. (Because the samples were drawn independently, the distance correlation for the populations is zero.)

/* Correlation betwee a 2-D distribution and a 1-D distribution */
call randseed(12345, 1);          /* reset random number seed */
N = 1000;
mu = {0 0};
Sigma = { 1.0 -0.5, 
         -0.5  1.0};
X = RandNormal(N, mu, Sigma);     /* sample from 2-D distribution */
Y = randfun(N, "Normal", 0, 0.6); /* uncorrelated 1-D sample */
 
DCor = DCorr(X, Y);
print DCor;
Distance correlation for samples of different dimensions

Limitations of the distance correlation

The distance correlation is an intriguing idea. You can use it to test whether two variables (actually, sets of variables) are independent. As I see it, the biggest drawbacks of distance correlation are

  • The distance correlation is always positive because distances are always positive. Most analysts are used to seeing negative correlations when two variables demonstrate a negative linear relationship.
  • The distance correlation for a sample of size n must compute the n(n–1)/2 pairwise distances between observations. This implies that the distance correlation is an O(n2) operation, as opposed to Pearson correlation, which is a much faster O(n) operation. The implementation in this article explicitly forms the n x n distance matrix, which can become very large. For example, if n = 100,000 observations, each distance matrix requires more than 74 GB of RAM. There are ways to use less memory, but the distance correlation is still a relatively expensive computation.

Summary

In summary, this article discusses the Szekely-Rizzo-Bakirov distance-based covariance and correlation. The distance correlation can be used to create a statistical test of independence between two variables or sets of variables. The idea is interesting and appealing for small data sets. Unfortunately, the performance of the algorithm is quadratic in the number of observations, so the algorithm does not scale well to big data.

You can download the SAS/IML program that creates the computations and graphs in this article. If you do not have SAS/IML, T. Billings (2016) wrote a SAS macro that uses PROC DISTANCE to compute the distance correlation between two vectors. Rizzo and Szekely implemented their method in the 'energy' package of the R software product.

The post Distance correlation appeared first on The DO Loop.

3月 282018
 

Suppose you want to find observations in multivariate data that are closest to a numerical target value. For example, for the students in the Sashelp.Class data set, you might want to find the students whose (Age, Height, Weight) values are closest to the triplet (13, 62, 100). The way to do this is to compute a distance from each observation to the target. Unfortunately, there are many definitions of distance! Which distance should you use? This article describes and compares a few common distances and shows how to compute them in SAS.

Euclidean and other distances

The two most widely used distances are the Euclidean distance (called the L2 distance by mathematicians) and the "sum of absolute differences" distance, which is better known as the L1 distance or occasionally the taxicab distance. In one dimension, these distances are equal. However, when you have multiple coordinates, the distances are different. The Euclidean and L1 distances between (x1, x2, ..., xn) and a target vector (t1, t2, ..., tn) are defined as follows:
L1 distance: |x1-t1| + |x2-t2| + ... + |xn-tn|
Euclidean distance: sqrt( (x1-t1)2 + (x2-t2)2 + ... + (xn-tn)2 )

Both of these distances are supported in the SAS DATA step. You can use the EUCLID function to compute Euclidean distance and use the SUBABS function to compute the L1 distance. For example, the following DATA step computes the distance from each observation to the target value (Age, Height, Weight) = (13, 62, 100):

data Closest;
/* target (Age, Height, Weight) = (13, 62, 100) */
set Sashelp.Class;
EuclidDist = euclid(Age-13, Height-62, Weight-100);
L1Dist     = sumabs(Age-13, Height-62, Weight-100);
run;
 
/* sort by Euclidean distance */
proc sort data=Closest out=Euclid; by EuclidDist; run;
 
/* plot the distances for each observation */
title "Distance to Target Values";
proc sgplot data=Euclid;
   series x=Name y=EuclidDist / curvelabel="Euclidean";  *datalabel=Weight;
   series x=Name y=L1Dist     / curvelabel="L1";         *datalabel=Weight;
   yaxis grid label="Distance";
   xaxis grid discreteorder=data;
run;
Euclidean and L1 distances between observations and a target value

The graph shows the Euclidean and L1 distances from each student's data to the target value. The X axis is ordered by the Euclidean distance from each observation to the target value. If you add the DATALABEL=Weight or DATALABEL=Height options to the SERIES statements, you can see that the students who appear near the left side of the X axis have heights and weights that are close to the target values (Height, Weight) = (62, 100). The students who appear to the right are much taller/heavier or shorter/lighter than the target values. In particular, Joyce is the smallest student and Philip is the largest.

If you order the students by their L1 distances, you will obtain a different ordering. For example, in an L1 ranking, John and Henry would switch positions and Jeffery would be ranked 7th instead of 12th.

Distances that account for scale

There is a problem with the computations in the previous section: the variables are measured in different units, but the computations do not account for these differences. In particular, Age is an important factor in the distance computations because all student ages are within three years of the target age. In contrast, the heaviest student (Phillip) is 88 pounds more than the target weight.

The distance formula needs to account for amount of variation within each variable. If you run PROC MEANS on the data, you will discover that the standard deviation of the Age variable is about 1.5, whereas the standard deviations for the Height and Weight variables are about 5.1 and 22.8, respectively. It follows that one year should be treated as a substantial difference, whereas one pound of weight should be treated as a small difference.

A way to correct for differences among the scales of variables is to standardize the variables. In SAS, you can use PROC STDIZE to standardize the variables in a variety of ways. By default, the procedure centers each variable by subtracting the sample mean; it scales by dividing by the standard deviations. The following procedure all also writes the mean and standard deviations to a data set, which is displayed by using PROC PRINT:

proc stdize data=Sashelp.Class out=StdClass outstat=StdIn method=STD;
   var Age Height Weight;
run;
 
proc print data=StdIn(obs=2);
run;

Notice that target value is not part of the standardization! Only the data are used. However, you must convert the target value to the new coordinates before you can compute the standardized distances. You can standardize the target value by using the METHOD=IN option in PROC STDIZE to tell the procedure to transform the target value by using the location and scale values in the StdIn data set, as follows:

data Target;
   Age=13; Height=62; Weight=100;
run;
 
proc stdize data=Target out=StdTarget method=in(StdIn);
   var Age Height Weight;
run;
 
proc print data=StdTarget; run;

The data and the target values are now in standardized coordinates. You can therefore repeat the earlier DATA step and compute the Euclidean and L1 distances in the new coordinates. The graph of the standardized distances are shown below:

In the standardized coordinates "one unit" corresponds to one standard deviation from the mean in each variable. Thus Age is measured in units of 1.5 years, Height in units of 5.1 inches, and Weight in units of 22.8 pounds. Notice that some of the students have changed positions. Carol is still very similar to the target value, but Jeffrey is now more similar to the target value than previously thought. The smallest student (Joyce) and largest student (Philip) are still dissimilar (very distant) from the target.

Distances that account for correlation

In the previous section, different scales in the data are handled by using distances in a standardized coordinate system. This is an improvement over the unstandardized computations. However, there is one more commonly used distance computation, called the Mahalanobis distance. The Mahalanobis distance is similar to the standardized L2 distance but also accounts for correlations between the variables. I have previously discussed the meaning of Mahalanobis distance and have described how you can use the inverse Cholesky decomposition to uncorrelate variables. I won't repeat the arguments, but the following graph displays the Mahalanobis distance between each observation and the target value. Some students are in a different order than they were for the standardized distances:

If you would like to examine or modify any of my computations, you can download the SAS program that computes the various distances.

In summary, there are several reasonable definitions of distance between multivariate observations. The simplest is the raw Euclidean distance, which is appropriate when all variables are measured on the same scale. The next simplest is the standardized distance, which is appropriate if the scales of the variables are vastly different and the variables are uncorrelated. The third is the Mahalanobis distance, which becomes important if you want to measure distances in a way that accounts for correlation between variables.

For other articles about Mahalanobis distance or distance computations in SAS, see:

The post Find the distance between observations and a target value appeared first on The DO Loop.