Statistical Programming

2月 132019
 

When you use maximum likelihood estimation (MLE) to find the parameter estimates in a generalized linear regression model, the Hessian matrix at the optimal solution is very important. The Hessian matrix indicates the local shape of the log-likelihood surface near the optimal value. You can use the Hessian to estimate the covariance matrix of the parameters, which in turn is used to obtain estimates of the standard errors of the parameter estimates. Sometimes SAS programmers ask how they can obtain the Hessian matrix at the optimal solution. This article describes three ways:

  • For some SAS regression procedures, you can store the model and use the SHOW HESSIAN statement in PROC PLM to display the Hessian.
  • Some regression procedures support the COVB option ("covariance of the betas") on the MODEL statement. You can compute the Hessian as the inverse of that covariance matrix.
  • The NLMIXED procedure can solve general regression problems by using MLE. You can use the HESS option on the PROC NLMIXED statement to display the Hessian.

The next section discusses the relationship between the Hessian and the estimate of the covariance of the regression parameters. Briefly, they are inverses of each other. You can download the complete SAS program for this blog post.

Hessians, covariance matrices, and log-likelihood functions

The Hessian at the optimal MLE value is related to the covariance of the parameters. The literature that discusses this fact can be confusing because the objective function in MLE can be defined in two ways. You can maximize the log-likelihood function, or you can minimize the NEGATIVE log-likelihood.

In statistics, the inverse matrix is related to the covariance matrix of the parameters. A full-rank covariance matrix is always positive definite. If you maximize the log-likelihood, then the Hessian and its inverse are both negative definite. Therefore, statistical software often minimizes the negative log-likelihood function. Then the Hessian at the minimum is positive definite and so is its inverse, which is an estimate of the covariance matrix of the parameters. Unfortunately, not every reference uses this convention.

For details about the MLE process and how the Hessian at the solution relates to the covariance of the parameters, see the PROC GENMOD documentation. For a more theoretical treatment and some MLE examples, see the Iowa State course notes for Statistics 580.

Use PROC PLM to obtain the Hessian

I previously discussed how to use the STORE statement to save a generalized linear model to an item store, and how to use PROC PLM to display information about the model. Some procedures, such as PROC LOGISTIC, save the Hessian in the item store. For these procedures, you can use the SHOW HESSIAN statement to display the Hessian. The following call to PROC PLM continues the PROC LOGISTIC example from the previous post. (Download the example.) The call displays the Hessian matrix at the optimal value of the log-likelihood. It also saves the "covariance of the betas" matrix in a SAS data set, which is used in the next section.

/* PROC PLM provides the Hessian matrix evaluated at the optimal MLE */
proc plm restore=PainModel;
   show Hessian CovB;
   ods output Cov=CovB;
run;

Not every SAS procedure stores the Hessian matrix when you use the STORE statement. If you request a statistic from PROC PLM that is not available, you will get a message such as the following: NOTE: The item store WORK.MYMODEL does not contain a Hessian matrix. The option in the SHOW statement is ignored.

Use the COVB option in a regression procedure

Many SAS regression procedures support the COVB option on the MODEL statement. As indicated in the previous section, you can use the SHOW COVB statement in PROC PLM to display the covariance matrix. A full-rank covariance matrix is positive definite, so the inverse matrix will also be positive definite. Therefore, the inverse matrix represents the Hessian at the minimum of the NEGATIVE log-likelihood function. The following SAS/IML program reads in the covariance matrix and uses the INV function to compute the Hessian matrix for the logistic regression model:

proc iml;
use CovB nobs p;                         /* open data; read number of obs (p) */
   cols = "Col1":("Col"+strip(char(p))); /* variable names are Col1 - Colp */
   read all var cols into Cov;           /* read COVB matrix */
   read all var "Parameter";             /* read names of parameters */
close;
 
Hessian = inv(Cov);                      /* Hessian and covariance matrices are inverses */
print Hessian[r=Parameter c=Cols F=BestD8.4];
quit;

You can see that the inverse of the COVB matrix is the same matrix that was displayed by using SHOW HESSIAN in PROC PLM. Be aware that the parameter estimates and the covariance matrix depend on the parameterization of the classification variables. The LOGISTIC procedure uses the EFFECT parameterization by default. However, if you instead use the REFERENCE parameterization, you will get different results. If you use a singular parameterization, such as the GLM parameterization, some rows and columns of the covariance matrix will contain missing values.

Define your own log-likelihood function

SAS provides procedures for solving common generalized linear regression models, but you might need to use MLE to solve a nonlinear regression model. You can use the NLMIXED procedure to define and solve general maximum likelihood problems. The PROC NLMIXED statement supports the HESS and COV options, which display the Hessian and covariance of the parameters, respectively.

To illustrate how you can get the covariance and Hessian matrices from PROC NLMIXED, let's define a logistic model and see if we get results that are similar to PROC LOGISTIC. We shouldn't expect to get exactly the same values unless we use exactly the same optimization method, convergence options, and initial guesses for the parameters. But if the model fits the data well, we expect that the NLMIXED solution will be close to the LOGISTIC solution.

The NLMIXED procedure does not support a CLASS statement, but you can use another SAS procedure to generate the design matrix for the desired parameterization. The following program uses the OUTDESIGN= option in PROC LOGISTIC to generate the design matrix. Because PROC NLMIXED requires a numerical response variable, a simple data step encodes the response variable into a binary numeric variable. The call to PROC NLMIXED then defines the logistic regression model in terms of a binary log-likelihood function:

/* output design matrix and EFFECT parameterization */
proc logistic data=Neuralgia outdesign=Design outdesignonly;
   class Pain Sex Treatment;
   model Pain(Event='Yes')= Sex Age Duration Treatment; /* use NOFIT option for design only */
run;
/* PROC NLMIXED required a numeric response */
data Design;
   set Design;
   PainY = (Pain='Yes');  
run;
 
ods exclude IterHistory;
proc nlmixed data=Design COV HESS;
   parms b0 -18 bSexF bAge bDuration bTreatmentA bTreatmentB 0;
   eta    = b0 + bSexF*SexF + bAge*Age + bDuration*Duration +
                 bTreatmentA*TreatmentA + bTreatmentB*TreatmentB;
   p = logistic(eta);       /* or 1-p to predict the other category */
   model PainY ~ binary(p);
run;

Success! The parameter estimates and the Hessian matrix are very close to those that are computed by PROC LOGISTIC. The covariance matrix of the parameters, which requires taking an inverse of the Hessian matrix, is also close, although there are small differences from the LOGISTIC output.

Summary

In summary, this article shows three ways to obtain the Hessian matrix at the optimum for an MLE estimate of a regression model. For some SAS procedures, you can store the model and use PROC PLM to obtain the Hessian. For procedures that support the COVB option, you can use PROC IML to invert the covariance matrix. Finally, if you can define the log-likelihood equation, you can use PROC NLMIXED to solve for the regression estimates and output the Hessian at the MLE solution.

The post 3 ways to obtain the Hessian at the MLE solution for a regression model appeared first on The DO Loop.

2月 112019
 

Have you ever run a regression model in SAS but later realize that you forgot to specify an important option or run some statistical test? Or maybe you intended to generate a graph that visualizes the model, but you forgot? Years ago, your only option was to modify your program and rerun it. Current versions of SAS support a less painful alternative: you can use the STORE statement in many SAS/STAT procedures to save the model to an item store. You can then use the PLM procedure to perform many post-modeling analyses, including performing hypothesis tests, showing additional statistics, visualizing the model, and scoring the model on new data. This article shows four ways to use PROC PLM to obtain results from your regression model.

What is PROC PLM?

PROC PLM enables you to analyze a generalized linear model (or a generalized linear mixed model) long after you quit the SAS/STAT procedure that fits the model. PROC PLM was released with SAS 9.22 in 2010. This article emphasizes four features of PROC PLM:

  • You can use the SCORE statement to score the model on new data.
  • You can use the EFFECTPLOT statement to visualize the model.
  • You can use the ESTIMATE, LSMEANS, SLICE, and TEST statements to estimate parameters and perform hypothesis tests.
  • You can use the SHOW statement to display statistical tables such as parameter estimates and fit statistics.

For an introduction to PROC PLM, see "Introducing PROC PLM and Postfitting Analysis for Very General Linear Models" (Tobias and Cai, 2010). The documentation for the PLM procedure includes more information and examples.

To use PROC PLM you must first use the STORE statement in a regression procedure to create an item store that summarizes the model. The following procedures support the STORE statement: GEE, GENMOD, GLIMMIX, GLM, GLMSELECT, LIFEREG, LOGISTIC, MIXED, ORTHOREG, PHREG, PROBIT, SURVEYLOGISTIC, SURVEYPHREG, and SURVEYREG.

The example in this article uses PROC LOGISTIC to analyze data about pain management in elderly patients who have neuralgia. In the PROC LOGISTIC documentation, PROC LOGISTIC fits the model and performs all the post-fitting analyses and visualization. In the following program, PROC LOGIST fits the model and stores it to an item store named PainModel. In practice, you might want to store the model to a permanent libref (rather than WORK) so that you can access the model days or weeks later.

Data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   datalines;
P F 68  1 No  B M 74 16 No  P F 67 30 No  P M 66 26 Yes B F 67 28 No  B F 77 16 No
A F 71 12 No  B F 72 50 No  B F 76  9 Yes A M 71 17 Yes A F 63 27 No  A F 69 18 Yes
B F 66 12 No  A M 62 42 No  P F 64  1 Yes A F 64 17 No  P M 74  4 No  A F 72 25 No
P M 70  1 Yes B M 66 19 No  B M 59 29 No  A F 64 30 No  A M 70 28 No  A M 69  1 No
B F 78  1 No  P M 83  1 Yes B F 69 42 No  B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No  A F 69 12 No  B F 65 14 No  B M 70  1 No  B M 67 23 No  A M 76 25 Yes
P M 78 12 Yes B M 77  1 Yes B F 69 24 No  P M 66  4 Yes P F 65 29 No  P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No  P F 72 27 No  P F 70 13 Yes A M 75  6 Yes
B F 65  7 No  P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No  A M 65 15 No
P F 67  1 Yes A M 67 10 No  P F 72 11 Yes A F 74  1 No  B M 80 21 Yes A F 69  3 No
;
 
title 'Logistic Model on Neuralgia';
proc logistic data=Neuralgia;
   class Sex Treatment;
   model Pain(Event='Yes')= Sex Age Duration Treatment;
   store PainModel / label='Neuralgia Study';  /* or use mylib.PaimModel for permanent storage */
run;

The LOGISTIC procedure models the presence of pain based on a patient's medication (Drug A, Drug B, or placebo), gender, age, and duration of pain. After you fit the model and store it, you can use PROC PLM to perform all sorts of additional analyses, as shown in the subsequent sections.

Use PROC PLM to score new data

An important application of regression models is to predict the response variable for new data. The following DATA step defines three new patients. The first two are females who are taking Drug B. The third is a male who is taking Drug A:

/* 1.Use PLM to score future obs */
data NewPatients;
   input Treatment $ Sex $ Age Duration;
   datalines;
B F 63  5 
B F 79 16 
A M 74 12 
;
 
proc plm restore=PainModel;
   score data=NewPatients out=NewScore predicted LCLM UCLM / ilink; /* ILINK gives probabilities */
run;
 
proc print data=NewScore;
run;

The output shows the predicted pain level for the three patients. The younger woman is predicted to have a low probability (0.01) of pain. The model predicts a moderate probability of pain (0.38) for the older woman. The model predicts a 64% chance that the man will experience pain.

Notice that the PROC PLM statement does not use the original data. In fact, the procedure does not support a DATA= option but instead uses the RESTORE= option to read the item store. The PLM procedure cannot create plots or perform calculations that require the data because the data are not part of the item store.

Use PROC PLM to visualize the model

I've previously written about how to use the EFFECTPLOT statement to visualize regression models. The EFFECTPLOT statement has many options. However, because PROC PLM does not have access to the original data, the EFFECTPLOT statement in PROC PLM cannot add observations to the graphs.

Although the EFFECTPLOT statement is supported natively in the LOGISTIC and GENMOD procedure, it is not directly supported in other procedures such as GLM, MIXED, GLIMMIX, PHREG, or the SURVEY procedures. Nevertheless, because these procedures support the STORE statement, you can use the EFFECTPLOT statement in PROC PLM to visualize the models for these procedures. The following statement uses the EFFECTPLOT statement to visualize the probability of pain for female and male patients that are taking each drug treatment:

/* 2. Use PROC PLM to create an effect plot */
proc plm restore=PainModel;
   effectplot slicefit(x=Age sliceby=Treatment plotby=Sex);
run;

The graphs summarize the model. For both men and women, the probability of pain increases with age. At a given age, the probability of pain is lower for the non-placebo treatments, and the probability is slightly lower for the patients who use Drug B as compared to Drug A. These plots are shown at the mean value of the Duration variable.

Use PROC PLM to compute contrasts and other estimates

One of the main purposes of PROC PLM Is to perform postfit estimates and hypothesis tests. The simplest is a pairwise comparison that estimates the difference between two levels of a classification variable. For example, in the previous graph the probability curves for the Drug A and Drug B patients are close to each other. Is there a significant difference between the two effects? The following ESTIMATE statement estimates the (B vs A) effect. The EXP option exponentiates the estimate so that you can interpret the 'Exponentiated' column as the odds ratio between the drug treatments. The CL option adds confidence limits for the estimate of the odds ratio. The odds ratio contains 1, so you cannot conclude that Drug B is significantly more effective that Drug A at reducing pain.

/* 3. Use PROC PLM to create contrasts and estimates */
proc plm restore=PainModel;
   /* 'Exponentiated' column is odds ratio between treatments */
   estimate 'Pairwise B vs A' Treatment 1 -1 / exp CL;
run;

Use PROC PLM to display statistics from the analysis

One of the more useful features of PROC PLM is that you can use the SHOW statement to display tables of statistics from the original analysis. If you want to see the ParameterEstimates table again, you can do that (SHOW PARAMETERS). You can even display statistics that you did not compute originally, such as an estimate of the covariance of the parameters (SHOW COVB). Lastly, if you have the item store but have forgotten what program you used to generate the model, you can display the program (SHOW PROGRAM). The following statements demonstrate the SHOW statement. The results are not shown.

/* 4. Use PROC PLM to show statistics or the original program */
proc plm restore=PainModel;
   show Parameters COVB Program;
run;

Summary

In summary, the STORE statement in many SAS/STAT procedures enables you to store various regression models into an item store. You can use PROC PLM to perform additional postfit analyses on the model, including scoring new data, visualizing the model, hypothesis testing, and (re)displaying additional statistics. This technique is especially useful for long-running models, but it is also useful for confidential data because the data are not needed for the postfit analyses.

The post 4 reasons to use PROC PLM for linear regression models in SAS appeared first on The DO Loop.

1月 282019
 

This article shows how to use SAS to simulate data that fits a linear regression model that has categorical regressors (also called explanatory or CLASS variables). Simulating data is a useful skill for both researchers and statistical programmers. You can use simulation for answering research questions, but you can also use it generate "fake" (or "synthetic") data for a presentation or blog post when the real data are confidential. I have previously shown how to use SAS to simulate data for a linear model and for a logistic model, but both articles used only continuous regressors in the model.

This discussion and SAS program are based on Chapter 11 in Simulating Data with SAS (Wicklin, 2013, p. 208-209). The main steps in the simulation are as follows. The comments in the SAS DATA program indicate each step:

  1. Macro variables are used to define the number of continuous and categorical regressors. Another macro variable is used to specify the number of levels for each categorical variable. This program simulates eight continuous regressors (x1-x8) and four categorical regressors (c1-c4). Each categorical regressor in this simulation has three levels.
  2. The continuous regressors are simulated from a normal distribution, but you can use any distribution you want. The categorical levels are 1, 2, and 3, which are generated uniformly at random by using the "Integer" distribution. The discrete "Integer" distribution was introduced in SAS 9.4M5; for older versions of SAS, use the %RandBetween macro as shown in the article "How to generate random integers in SAS." You can also generate the levels non-uniformly by using the "Table" distribution.
  3. The response variable, Y, is defined as a linear combination of some explanatory variables. In this simulation, the response depends linearly on the x1 and x8 continuous variables and on the levels of the C1 and C4 categorical variables. Noise is added to the model by using a normally distributed error term.
/* Program based on Simulating Data with SAS, Chapter 11 (Wicklin, 2013, p. 208-209) */
%let N = 10000;                  /* 1a. number of observations in simulated data */
%let numCont =  8;               /* number of continuous explanatory variables */
%let numClass = 4;               /* number of categorical explanatory variables */
%let numLevels = 3;              /* (optional) number of levels for each categorical variable */
 
data SimGLM; 
  call streaminit(12345); 
  /* 1b. Use macros to define arrays of variables */
  array x[&numCont]  x1-x&numCont;   /* continuous variables named x1, x2, x3, ... */
  array c[&numClass] c1-c&numClass;  /* CLASS variables named c1, c2, ... */
  /* the following statement initializes an array that contains the number of levels
     for each CLASS variable. You can hard-code different values such as (2, 3, 3, 2, 5) */
  array numLevels[&numClass] _temporary_ (&numClass * &numLevels);  
 
  do k = 1 to &N;                 /* for each observation ... */
    /* 2. Simulate value for each explanatory variable */ 
    do i = 1 to &numCont;         /* simulate independent continuous variables */
       x[i] = round(rand("Normal"), 0.001);
    end; 
    do i = 1 to &numClass;        /* simulate values 1, 2, ..., &numLevels with equal prob */
       c[i] = rand("Integer", numLevels[i]);        /* the "Integer" distribution requires SAS 9.4M5 */
    end; 
 
    /* 3. Simulate response as a function of certain explanatory variables */
    y = 4 - 3*x[1] - 2*x[&numCont] +                /* define coefficients for continuous effects */
       -3*(c[1]=1) - 4*(c[1]=2) + 5*c[&numClass]    /* define coefficients for categorical effects */
        + rand("Normal", 0, 3);                     /* normal error term */
    output;  
  end;  
  drop i k;
run;     
 
proc glm data=SimGLM;
   class c1-c&numClass;
   model y = x1-x&numCont c1-c&numClass / SS3 solution;
   ods select ModelANOVA ParameterEstimates;
quit;
Parameter estimates for synthetic (simulated) data that follows a regression model.

The ModelANOVA table from PROC GLM (not shown) displays the Type 3 sums of squares and indicates that the significant terms in the model are x1, x8, c1, and c4.

The parameter estimates from PROC GLM are shown to the right. You can see that each categorical variable has three levels, and you can use PROC FREQ to verify that the levels are uniformly distributed. I have highlighted parameter estimates for the significant effects in the model:

  • The estimates for the significant continuous effects are highlighted in red. The estimates for the coefficients of x1 and x8 are about -3 and -2, respectively, which are the parameter values that are specified in the simulation.
  • The estimates for the levels of the C1 CLASS variable are highlighted in green. They are close to (-3, -4, 0), which are the parameter values that are specified in the simulation.
  • The estimates for the Intercept and the C4 CLASS variable are highlighted in magenta. Notice that they seem to differ from the parameters in the simulation. As discussed previously, the simulation and PROC GLM use different parameterizations of the C4 effect. The simulation assigns Intercept = 4. The contribution of the first level of C4 is 5*1, the contribution for the second level is 5*2, and the contribution for the third level is 5*3. As explained in the previous article, the GLM parameterization reparameterizes the C4 effect as (4 + 15) + (5 - 15)*(C4=1) + (10 - 15)*(C4=2). The estimates are very close to these parameter values.

Although this program simulates a linear regression model, you can modify the program and simulate from a generalized linear model such as the logistic model. You just need to compute the linear predictor, eta (η), and then use the link function and the RAND function to generate the response variable, as shown in a previous article about how to simulate data from a logistic model.

In summary, this article shows how to simulate data for a linear regression model in the SAS DATA step when the model includes both categorical and continuous regressors. The program simulates arbitrarily many continuous and categorical variables. You can define a response variable in terms of the explanatory variables and their interactions.

The post Simulate data for a regression model with categorical and continuous variables appeared first on The DO Loop.

1月 232019
 

Recently I was asked to explain the result of an ANOVA analysis that I posted to a statistical discussion forum. My program included some simulated data for an ANOVA model and a call to the GLM procedure to estimate the parameters. I was asked why the parameter estimates from PROC GLM did not match the parameter values that were specified in the simulation. The answer is that there are many ways to parameterize the categorical effects in a regression model. SAS regression procedures support many different parameterizations, and each parameterization leads to a different set of parameter estimates for the categorical effects. The GLM procedure uses the so-called GLM-parameterization of classification effects, which sets to zero the coefficient of the last level of a categorical variable. If your simulation specifies a non-zero value for that coefficient, the parameters that PROC GLM estimates are different from the parameters in the simulation.

An example makes this clearer. The following SAS DATA step simulates 300 observations for a categorical variable C with levels 'C1', 'C2', and 'C3' in equal proportions. The simulation creates a response variable, Y, based on the levels of the variable C. The GLM procedure estimates the parameters from the simulated data:

data Have;
call streaminit(1);
do i = 1 to 100;
   do C = 'C1', 'C2', 'C3';
      eps = rand("Normal", 0, 0.2);
      /* In simulation, parameters are Intercept=10, C1=8, C2=6, and C3=1 
         This is NOT the GLM parameterization. */
      Y = 10 + 8*(C='C1') + 6*(C='C2') + 1*(C='C3') + eps;  /* C='C1' is a 0/1 binary variable */
      output;
   end;
end;
keep C Y;
run;
 
proc glm data=Have plots=none;
   class C;
   model Y = C / SS3 solution;
   ods select ParameterEstimates;
quit;

The output from PROC GLM shows that the parameter estimates are very close to the following values: Intercept=11, C1=7, C2=5, and C3=0. Although these are not the parameter values that were specified in the simulation, these estimates make sense if you remember the following:

In other words, you can use the parameter values in the simulation to convert to the corresponding parameters for the GLM parameterization. In the following DATA step, the Y and Y2 variables contain exactly the same values, even though the formulas look different. The Y2 variable is simulated by using a GLM parameterization of the C variable:

data Have;
call streaminit(1);
refEffect = 1;
do i = 1 to 100;
   do C = 'C1', 'C2', 'C3';
      eps = rand("Normal", 0, 0.2);
      /* In simulation, parameters are Intercept=10, C1=8, C2=6, and C3=1 */
      Y  = 10 + 8*(C='C1') + 6*(C='C2') + 1*(C='C3') + eps;
      /* GLM parameterization for the same response: Intercept=11, C1=7, C2=5, C3=0 */
      Y2 = (10 + refEffect) + (8-refEffect)*(C='C1') + (6-refEffect)*(C='C2') + eps;
      diff = Y - Y2;         /* Diff = 0 when Y=Y2 */
      output;
   end;
end;
keep C Y Y2 diff;
run;
 
proc means data=Have;
   var Y Y2 diff;
run;

The output from PROC MEANS shows that the Y and Y2 variables are exactly equal. The coefficients for the Y2 variable are 11 (the intercept), 7, 5, and 0, which are the parameter values that are estimated by PROC GLM.

Of course, other parameterizations are possible. For example, you can create the simulation by using other parameterizations such as the EFFECT coding. (The EFFECT coding is the default coding in PROC LOGISTIC.) For the effect coding, parameter estimates of main effects indicate the difference of each level as compared to the average effect over all levels. The following statements show the effect coding for the variable Y3. The values of the Y3 variable are exactly the same as Y and Y2:

avgEffect = 5;   /* average effect for C is (8 + 6 + 1)/3 = 15/3 = 5 */
...
      /* EFFECT parameterization: Intercept=15, C1=3, C2=1, C3=0 */
      Y3 = 10 + avgEffect + (8-avgEffect)*(C='C1') + (6-avgEffect)*(C='C2') + eps;

In summary, when you write a simulation that includes categorical data, there are many equivalent ways to parameterize the categorical effects. When you use a regression procedure to analyze the simulated data, the procedure and simulation might use different parameterizations. If so, the estimates from the procedure might be quite different from the parameters in your simulation. This article demonstrates this fact by using the GLM parameterization and the EFFECT parameterization, which are two commonly used parameterizations in SAS. See the SAS/STAT documentation for additional details about the different parameterizations of classification variables in SAS.

The post Coding and simulating categorical variables in regression models appeared first on The DO Loop.

1月 212019
 

In machine learning and other model building techniques, it is common to partition a large data set into three segments: training, validation, and testing.

  • Training data is used to fit each model.
  • Validation data is a random sample that is used for model selection. These data are used to select a model from among candidates by balancing the tradeoff between model complexity (which fit the training data well) and generality (but they might not fit the validation data). These data are potentially used several times to build the final model
  • Test data is a hold-out sample that is used to assess final model and estimate its prediction error. It is only used at the end of the model-building process.

I've seen many questions about how to use SAS to split data into training, validation, and testing data. (A common variation uses only training and validation.) There are basically two approaches to partitioning data:

  • Specify the proportion of observations that you want in each role. For each observation, randomly assign it to one of the three roles. The number of observations assigned to each role will be a multinomial random variable with expected value N pk, where N is the number of observations and pk (k = 1, 2, 3) is the probability of assigning an observation to the k_th role. For this method, if you change the random number seed you will usually get a different number of observations each role because the size is a random variable.
  • Specify the number of observations that you want in each role and randomly allocate that many observations.

This article uses the SAS DATA step to accomplish the first task and uses PROC SURVEYSELECT to accomplish the second. I also discuss how to split data into only two roles: training and validation.

It is worth mentioning that many model-selection routines in SAS enable you to split data by using the PARTITION statement. Example include the "SELECT" procedures (GLMSELECT, QUANTSELECT, HPGENSELECT...) and the ADAPTIVEREG procedure. However, be aware that the procedures might ignore observations that have missing values for the variables in the model.

Random partition into training, validation, and testing data

When you partition data into various roles, you can choose to add an indicator variable, or you can physically create three separate data sets. The following DATA step creates an indicator variable with values "Train", "Validate", and "Test". The specified proportions are 60% training, 30% validation, and 10% testing. You can change the values of the SAS macro variables to use your own proportions. The RAND("Table") function is an efficient way to generate the indicator variable.

data Have;             /* the data to partition  */
   set Sashelp.Heart;  /* for example, use Heart data */
run;
 
/* If propTrain + propValid = 1, then no observation is assigned to testing */
%let propTrain = 0.6;         /* proportion of trainging data */
%let propValid = 0.3;         /* proportion of validation data */
%let propTest = %sysevalf(1 - &propTrain - &propValid); /* remaining are used for testing */
 
/* Randomly assign each observation to a role; _ROLE_ is indicator variable */
data RandOut;
   array p[2] _temporary_ (&propTrain, &propValid);
   array labels[3] $ _temporary_ ("Train", "Validate", "Test");
   set Have;
   call streaminit(123);         /* set random number seed */
   /* RAND("table") returns 1, 2, or 3 with specified probabilities */
   _k = rand("Table", of p[*]); 
   _ROLE_ = labels[_k];          /* use _ROLE_ = _k if you prefer numerical categories */
   drop _k;
run;
 
proc freq data=RandOut order=freq;
   tables _ROLE_ / nocum;
run;

A shown by the output of PROC FREQ, the proportion of observations in each role is approximately the same as the specified proportions. For this random number seed, the proportions are 59.1%, 30.4%, and 10.6%. If you change the random number seed, you will get a different assignment of observations to roles and also a different proportion of data in each role.

The observant reader will notice that there are only two elements in the array of probabilities (p) that is used in the RAND("Table") call. This is intentional. The documentation for the RAND("Table") function states that if the sum of the specified probabilities is less than 1, then the quantity 1 – Σ pi is used as the probability of the last event. By specifying only two values in the p array, the same program works for partitioning the data into two pieces (training and validation) or three pieces (and testing).

Create random training, validation, and testing data sets

Some practitioners choose to create three separate data sets instead of adding an indicator variable to the existing data. The computation is exactly the same, but you can use the OUTPUT statement to direct each observation to one of three output data sets, as follows:

/* create a separate data set for each role */
data Train Validate Test;
array p[2] _temporary_ (&propTrain, &propValid);
set Have;
call streaminit(123);         /* set random number seed */
/* RAND("table") returns 1, 2, or 3 with specified probabilities */
_k = rand("Table", of p[*]);
if      _k = 1 then output Train;
else if _k = 2 then output Validate;
else                output Test;
drop _k;
run;
NOTE: The data set WORK.TRAIN has 3078 observations and 17 variables.
NOTE: The data set WORK.VALIDATE has 1581 observations and 17 variables.
NOTE: The data set WORK.TEST has 550 observations and 17 variables.

This example uses the same random number seed as the previous example. Consequently, the three output data sets have the same observations as are indicated by the partition variable (_ROLE_) in the previous example.

Specify the number of observations in each role

Instead of specifying a proportion, you might want to specify the exact number of observations that are randomly assigned to each role. The advantage of this technique is that changing the random number seed does not change the number of observations in each role (although it does change which observations are assigned to each role). The SURVEYSELECT procedure supports the GROUPS= option, which you can use to specify the number of observations.

The GROUPS= option requires that you specify integer values. For this example, the original data contains 5209 observations but 60% of 5209 is not an integer. Therefore, the following DATA step computes the number of observations as ROUND(N p) for the training and validation sets. These integer values are put into macro variables and used in the GROUPS= option on the PROC SURVEYSELECT statement. You can, of course, skip the DATA step and specify your own values such as groups=(3200, 1600, 409).

/* Specify the sizes of the train/validation/test data from proportions */
data _null_;
   if 0 then set sashelp.heart nobs=N;  /* N = total number of obs */
   nTrain = round(N * &propTrain);      /* size of training data */
   nValid = round(N * &propValid);      /* size of validataion data */
   call symputx("nTrain", nTrain);      /* put integer into macro variable */
   call symputx("nValid", nValid);
   call symputx("nTest", N - nTrain - nValid);
run;
 
/* randomly assign observations to three groups */
proc surveyselect data=Have seed=12345 out=SSOut
     groups=(&nTrain, &nValid, &nTest); /* if no Test data, use  GROUPS=(&nTrain, &nValid) */
run;
 
proc freq data=SSOut order=freq;
   tables GroupID / nocum;           /* GroupID is name of indicator variable */
run;

The training, validation, and testing groups contain 3125, 1563, and 521 observations, respectively. These numbers are the closest integer approximations to 60%, 30% and 10% of the 5209 observations. Notice that the output from the SURVEYSELECT procedure uses the values 1, 2, and 3 for the GroupID indicator variable. You can use PROC FORMAT to associate those numbers with labels such as "Train", "Validate", and "Test".

In summary, there are two basic programming techniques for randomly partitioning data into training, validation, and testing roles. One way uses the SAS DATA step to randomly assign each observation to a role according to proportions that you specify. If you use this technique, the size of each group is random. The other way is to use PROC SURVEYSELECT to randomly assign observations to roles. If you use this technique, you must specify the number of observation in each group.

The post Create training, validation, and test data sets in SAS appeared first on The DO Loop.

1月 162019
 

A quantile-quantile plot (Q-Q plot) is a graphical tool that compares a data distribution and a specified probability distribution. If the points in a Q-Q plot appear to fall on a straight line, that is evidence that the data can be approximately modeled by the target distribution. Although it is not necessary, some data analysts like to overlay a reference line to help "guide their eyes" as to whether the values in the plot fall on a straight line. This article describes three ways to overlay a reference line on a Q-Q plot. The first two lines are useful during the exploratory phase of data analysis; the third line visually represents the estimates of the location and scale parameters in the fitted model distribution. The three lines are:

  • A line that connect the 25th and 75th percentiles of the data and reference distributions
  • A least squares regression line
  • A line whose intercept and slope are determined by maximum likelihood estimates of the location and scale parameters of the target distribution.

If you need to review Q-Q plots, see my previous article that describes what a Q-Q plot is, how to construct a Q-Q plot in SAS, and how to interpret a Q-Q plot.

Create a basic Q-Q plot in SAS

Let me be clear: It is not necessary to overlay a line on a Q-Q plot. You can display only the points on a Q-Q plot and, in fact, that is the default behavior in SAS when you create a Q-Q plot by using the QQPLOT statement in PROC UNIVARIATE.

The following DATA step generates 97 random values from an exponential distribution with shape parameter σ = 2 and three artificial "outliers." The call to PROC UNIVARIATE creates a Q-Q plot, which is shown:

data Q(keep=y);
call streaminit(321);
do i = 1 to 97;
   y = round( rand("Expon", 2), 0.001);  /* Y ~ Exp(2), rounded to nearest 0.001 */
   output;
end;
do y = 10,11,15; output; end;   /* add outliers */
run;
 
proc univariate data=Q;
   qqplot y / exp grid;         /* plot data quantiles against Exp(1) */
   ods select QQPlot;
   ods output QQPlot=QQPlot;    /* for later use: save quantiles to a data set */
run;
Q-Q plot in SAS without a reference line

The vertical axis of the Q-Q plot displays the sorted values of the data; the horizontal axis displays evenly spaced quantiles of the standardized target distribution, which in this case is the exponential distribution with scale parameter σ = 1. Most of the points appear to fall on a straight line, which indicates that these (simulated) data might be reasonably modeled by using an exponential distribution. The slope of the line appears to be approximately 2, which is a crude estimate of the scale parameter (σ). The Y-intercept of the line appears to be approximately 0, which is a crude estimate of the location parameter (the threshold parameter, θ).

Although the basic Q-Q plot provides all the information you need to decide that these data can be modeled by an exponential distribution, some data sets are less clear. The Q-Q plot might show a slight bend or wiggle, and you might want to overlay a reference line to assess how severely the pattern deviates from a straight line. The problem is, what line should you use?

A reference line for the Q-Q plot

Cleveland (Visualiizing Data, 1993, p. 31) recommends overlaying a line that connects the first and third quartiles. That is, let p25 and p75 be the 25th and 75th percentiles of the target distribution, respectively, and let y25 and y75 be the 25th and 75th percentiles of the ordered data values. Then Cleveland recommends plotting the line through the ordered pairs (p25, y25) and (p75, yy5).

In SAS, you can use PROC MEANS to compute the 25th and 75th percentiles for the X and Y variables in the Q-Q plot. You can then use the DATA step or PROC SQL to compute the slope of the line that passes between the percentiles. The following statements analyze the Q-Q plot data that was created by using the ODS OUTPUT statement in the previous section:

proc means data=QQPlot P25 P75;
   var Quantile Data;        /* ODS OUTPUT created the variables Quantile (X) and Data (Y) */
   output out=Pctl P25= P75= / autoname;
run;
 
data _null_;
set Pctl;
slope = (Data_P75 - Data_P25) / (Quantile_P75 - Quantile_P25); /* dy / dx */
/* if desired, put point-slope values into macro variables to help plot the line */
call symputx("x1", Quantile_P25);
call symputx("y1", Data_P25);
call symput("Slope", putn(slope,"BEST5."));
run;
 
title "Q-Q Plot with Reference Line";
title2 "Reference Line through First and Third Quartiles";
title3 "Slope = &slope";
proc sgplot data=QQPlot;
   scatter x=Quantile y=Data;
   lineparm x=&x1 y=&y1 slope=&slope / lineattrs=(color=Green) legendlabel="Percentile Estimate";
   xaxis grid label="Exponential Quantiles"; yaxis grid;
run;
Q-Q plot in SAS with reference line through first and third quartiles

Because the line passes through the first and third quartiles, the slope of the line is robust to outliers in the tails of the data. The line often provides a simple visual guide to help you determine whether the central portion of the data matches the quantiles of the specified probability distribution.

Keep in mind that this is a visual guide. The slope and intercept for this line should not be used as parameter estimates for the location and scale parameters of the probability distribution, although they could be used as an initial guess for an optimization that estimates the location and scale parameters for the model distribution.

Regression lines as visual guides for a Q-Q plot

Let's be honest, when a statistician sees a scatter plot for which the points appear to be linearly related, there is a Pavlovian reflex to fit a regression line to the values in the plot. However, I can think of several reasons to avoid adding a regression line to a Q-Q plot:

  • The values in the Q-Q plot do not satisfy the assumptions of ordinary least squares (OLS) regression. For example, the points are not a random sample and there is no reason to assume that the errors in the Y direction are normally distributed.
  • In practice, the tails of the probability distribution rarely match the tails of the data distribution. In fact, the points to the extreme left and right of a Q-Q plot often exhibit a systematic bend away from a straight line. In an OLS regression, these extreme points will be high-leverage points that will unduly affect the OLS fit.

If you choose to ignore these problems, you can use the REG statement in PROC SGPLOT to add a reference line. Alternatively, you can use PROC REG in SAS (perhaps with the NOINT option if the location parameter is zero) to obtain an estimate of the slope:

proc reg data=QQPlot plots=NONE;
   model Data = Quantile / NOINT;  /* use NOINT when location parameter is 0 */
   ods select ParameterEstimates;
quit;
 
title2 "Least Squares Reference Line";
proc sgplot data=QQPlot;
   scatter x=Quantile y=Data;
   lineparm x=0 y=0 slope=2.36558 / lineattrs=(color=Red) legendlabel="OLS Estimate";
   xaxis grid label="Exponential Quantiles"; yaxis grid;
run;
Q-Q plot in SAS with regression line (not recommended)

For these data, I used the NOINT option to set the threshold parameter to 0. The zero-intercept line with slope 2.36558 is overlaid on the Q-Q plot. As expected, the outliers in the upper-right corner of the Q-Q plot have pulled the regression line upward, so the regression line has a steeper slope than the reference line based on the first and third quartiles. Because the tails of an empirical distribution often differ from the tails of the target distribution, the regression-based reference line can be misleading. I do not recommend its use.

Maximum likelihood estimates

The previous sections describe two ways to overlay a reference line during the exploratory phase of the data analysis. The purpose of the reference line is to guide your eye and help you determine whether the points in the Q-Q plot appear to fall on a straight line. If so, you can move to the modeling phase.

In the modeling phase, you use a parameter estimation method to fit the parameters in the target distribution. Maximum likelihood estimation (MLE) is often the method-of-choice for estimating parameters from data. You can use the HISTOGRAM statement in PROC UNIVARIATE to obtain a maximum likelihood estimate of the shape parameter for the exponential distribution, which turns out to be 2.21387. If you specify the location and scale parameters in the QQPLOT statement, PROC UNIVARIATE will automatically overlay a line that represents that fitted values:

proc univariate data=Q;
   histogram y / exp;
   qqplot y / exp(threshold=0 scale=est) odstitle="Q-Q Plot with MLE Estimate" grid;
   ods select ParameterEstimates GoodnessOfFit QQPlot;
run;
Parameter estimates and goodness-of-fit test for a maximum likelihood estimate of parameters in an exponential distribution

The ParameterEstimates table shows the maximum likelihood estimate. The GoodnessOfFit table shows that there is no evidence to reject the hypothesis that these data came from an Exp(σ=2.21) distribution.

Q-Q plot in SAS with line formed by using maximum likelihood estimates

Notice the distinction between this line and the previous lines. This line is the result of fitting the target distribution to the data (MLE) whereas the previous lines were visual guides. When you display a Q-Q plot that has a diagonal line, you should state how the line was computed.

In conclusion, you can display a Q-Q plot without adding any reference line. If you choose to overlay a line, there are three common methods. During the exploratory phase of analysis, you can display a line that connects the 25th and 75th percentiles of the data and target distributions. (Some practitioners use an OLS regression line, but I do not recommend it.) During the modeling phase, you can use maximum likelihood estimation or some other fitting method to estimate the location and scale of the target distribution. Those estimates can be used as the intercept and slope, respectively, of a line on the Q-Q plot. PROC UNIVARIATE in SAS displays this line automatically when you fit a distribution.

The post Three ways to add a line to a Q-Q plot appeared first on The DO Loop.

1月 092019
 

Numbers don't lie, but sometimes they don't reveal the full story. Last week I wrote about the most popular articles from The DO Loop in 2018. The popular articles are inevitably about elementary topics in SAS programming or statistics because those topics have broad appeal. However, I also write about advanced topics, which are less popular but fill an important niche in the SAS community. Not everyone needs to know how to fit a Pareto distribution in SAS or how to compute distance-based measures of correlation in SAS. Nevertheless, these topics are interesting to think about.

I believe that learning should not stop when we leave school. If you, too, are a lifelong learner, the following topics deserve a second look. I've included articles from four different categories.

Data Visualization

  • Fringe plot: When fitting a logistic model, you can plot the predicted probabilities versus a continuous covariate or versus the empirical probability. You can use a fringe plot to overlay the data on the plot of predicted probabilities. The SAS developer of PROC LOGISTIC liked this article a lot, so look for fringe plots in a future release of SAS/STAT software!
  • Order variables in a correlation matrix or scatter plot matrix: When displaying a graph that shows many variables (such as a scatter plot matrix), you can make the graph more understandable by ordering the variables so that similar variables are adjacent to each other. The article uses single-link clustering to order the variables, as suggested by Hurley (2004).
  • A stacked band plot, created in SAS by using PROC SGPLOT
  • Stacked band plot: You can use PROC SGPLOT to automatically create a stacked bar plot. However, when the bars represent an ordered categorical variable (such as months or years), you might want to create a stacked band plot instead. This article shows how to create a stacked band plot in SAS.

Statistics and Data Analysis

Random numbers and resampling methods

Process flow diagram shows how to resample data to create a bootstrap distribution.

Optimization

These articles are technical but provide tips and techniques that you might find useful. Choose a few topics that are unfamiliar and teach yourself something new in this New Year!

Do you have a favorite article from 2018 that I did not include on the list? Share it in a comment!

The post 10 posts from 2018 that deserve a second look appeared first on The DO Loop.

12月 052018
 

Recently a SAS programmer wanted to obtain a table of counts that was based on a histogram. I showed him how you can use the OUTHIST= option on the HISTOGRAM statement in PROC UNIVARIATE to obtain that information. For example, the following call to PROC UNIVARIATE creates a histogram for the MPG_City variable in the Sashelp.Cars data set. The histogram has 11 bins. The OUTHIST= option writes the counts for each bin to a SAS data set:

proc univariate data=Sashelp.Cars noprint;
   var MPG_City;
   histogram MPG_City / barlabel=count outhist=MidPtOut;
run;
 
proc print data=MidPtOut label;
   label _MIDPT_ = "Midpoint" _COUNT_="Frequency";
   var _MIDPT_ _COUNT_;
run;

Endpoints versus midpoints

As I've previously discussed, PROC UNIVARIATE supports two options for specifying the locations of bins. The MIDPOINTS option specifies that "nice" numbers (for example, multiples of 2, 5, or 10) are used for the midpoints of the bins; the ENDPOINTS option specifies that nice numbers are used for the endpoints of the bins; By default, midpoints are used, as shown in the previous section. The following call to PROC UNIVARIATE uses the ENDPOINTS option and writes the new bin counts to a data set. The histogram is not shown.

proc univariate data=Sashelp.Cars noprint;
   var MPG_City;
   histogram MPG_City / barlabel=count endpoints outhist=EndPtOut;
run;
 
proc print data=EndPtOut;
   label _MINPT_ = "Left Endpoint" _COUNT_="Frequency";
   var _MINPT_ _COUNT_;
run;

Tabulating counts in the SAS/IML language

If you want to "manually" count the number of observations in each bin, you have a few choices. If you already know the bin width and anchor position for the bins, then you can use a DATA step array to accumulate the counts. You can also use PROC FORMAT to define a format to bin the observations and use PROC FREQ to tabulate the counts.

The harder problem is when you do not have a prior set of "nice" values to use as the endpoints of bins. It is usually not satisfactory to use the minimum and maximum data values as endpoints of the binning intervals because that might result in intervals whose endpoints are long decimal values such as [3.4546667 4.0108333].

Fortunately, the SAS/IML language provides the GSCALE subroutine, which computes "nice" values from a vector of data and the number of bins. The GSCALE routine returns a three-element vector. The first element is the minimum value of the leftmost interval, the second element is the maximum value of the rightmost interval, and the third element is the bin width. For example, the following SAS/IML statements compute nice intervals for the data in the MPG_City variable:

proc iml;
use Sashelp.Cars;
   read all var "MPG_City" into X;
close;
 
/* GSCALE subroutine computes "nice" tick values: s[1]<=min(x); s[2]>=max(x) */
call gscale(s, x, 10);  /* ask for about 10 intervals */
print s[rowname={"Start" "Stop" "Increment"}];

The output from the GSCALE subroutine suggests that a good set of intervals to use for binning the data are [10, 15), [15, 20), ..., [55, 60]. These are the same endpoints that are generated by using the ENDPOINTS option in PROC UNIVARIATE. (Actually, the procedure uses half-open intervals for all bins, so it adds the extra interval [60, 65) to the histogram.)

I've previously shown how to use the BIN and TABULATE functions in SAS/IML to count the observations in a set of bins. The following statements use the values from the GSCALE routine to form evenly spaced cutpoints for the binning:

cutPoints = do(s[1], s[2], s[3]);    /* use "nice" cutpoints from GSCALE */
*cutPoints = do(s[1], s[2]+s[3], s[3]);  /* ALTERNATIVE: add additional cutpoint to match UNIVARIATE */
b = bin(x, cutPoints);               /* find bin for each obs */
call tabulate(bins, freq, b);        /* count how many obs in each bin */
binLabels = char(cutPoints[bins]);   /* use left endpoint as labels for bins */
print freq[colname = binLabels label="Count"];

Except for the last interval, the counts are the same as for the ENDPOINTS option in PROC UNIVARIATE. It is a matter of personal preference whether you want to treat the last interval as a closed interval or whether you want all intervals to be half open. If you want to exactly match PROC UNIVARIATE, you can modify the definition of the cutPoints variable, as indicated in the program comments.

Notice that the TABULATE routine only reports the bins that have nonzero counts. If you prefer to obtain counts for ALL bins—even bins with zero counts—you can use the TabulateLevels module, which I described in a previous blog post.

In summary, you can use PROC UNIVARIATE or SAS/IML to create a tabular representation of a histogram. Both procedures provide a way to obtain "nice" values for the bin endpoints. If you already know the endpoints for the bins, you can use other techniques in SAS to produce the table.

The post When is a histogram not a histogram? When it's a table! appeared first on The DO Loop.

10月 032018
 

It is sometimes necessary for researchers to simulate data with thousands of variables. It is easy to simulate thousands of uncorrelated variables, but more difficult to simulate thousands of correlated variables. For that, you can generate a correlation matrix that has special properties, such as a Toeplitz matrix or a first-order autoregressive (AR(1)) correlation matrix. I have previously written about how to generate a large Toeplitz matrix in SAS. This article describes three useful results for an AR(1) correlation matrix:

  • How to generate an AR(1) correlation matrix in the SAS/IML language
  • How to use a formula to compute the explicit Cholesky root of an AR(1) correlation matrix.
  • How to efficiently simulate multivariate normal variables with AR(1) correlation.

Generate an AR(1) correlation matrix in SAS

The AR(1) correlation structure is used in statistics to model observations that have correlated errors. (For example, see the documentation of PROC MIXED in SAS.) If Σ is AR(1) correlation matrix, then its elements are constant along diagonals. The (i,j)th element of an AR(1) correlation matrix has the form Σ[i,j] = ρ|i – j|, where ρ is a constant that determines the geometric rate at which correlations decay between successive time intervals. The exponent for each term is the L1 distance between the row number and the column number. As I have shown in a previous article, you can use the DISTANCE function in SAS/IML 14.3 to quickly evaluate functions that depend on the distance between two sets of points. Consequently, the following SAS/IML function computes the correlation matrix for a p x p AR(1) matrix:

proc iml;
/* return p x p matrix whose (i,j)th element is rho^|i - j| */
start AR1Corr(rho, p); 
   return rho##distance(T(1:p), T(1:p), "L1");
finish;
 
/* test on 10 x 10 matrix with rho = 0.8 */
rho = 0.8;  p = 10;
Sigma = AR1Corr(rho, p);
print Sigma[format=Best7.];

A formula for the Cholesky root of an AR(1) correlation matrix

Every covariance matrix has a Cholesky decomposition, which represents the matrix as the crossproduct of a triangular matrix with itself: Σ = RTR, where R is upper triangular. In SAS/IML, you can use the ROOT function to compute the Cholesky root of an arbitrary positive definite matrix. However, the AR(1) correlation matrix has an explicit formula for the Cholesky root in terms of ρ. This explicit formula does not appear to be well known by statisticians, but it is a special case of a general formula developed by V. Madar (Section 5.1, 2016), who presented a poster at a Southeast SAS Users Group (SESUG) meeting a few years ago. An explicit formula means that you can compute the Cholesky root matrix, R, in a direct and efficient manner, as follows:

/* direct computation of Cholesky root for an AR(1) matrix */
start AR1Root(rho, p);
   R = j(p,p,0);                /* allocate p x p matrix */
   R[1,] = rho##(0:p-1);        /* formula for 1st row */
   c = sqrt(1 - rho**2);        /* scaling factor: c^2 + rho^2 = 1 */
   R2 = c * R[1,];              /* formula for 2nd row */
   do j = 2 to p;               /* shift elements in 2nd row for remaining rows */
      R[j, j:p] = R2[,1:p-j+1]; 
   end;
   return R;
finish;
 
R = AR1Root(rho, p);   /* compare to R = root(Sigma), which requires forming Sigma */
print R[L="Cholesky Root" format=Best7.];

You can compute an AR(1) covariance matrix from the correlation matrix by multiplying the correlation matrix by a positive scalar, σ2.

Efficient simulation of multivariate normal variables with AR(1) correlation

An efficient way to simulate data from a multivariate normal population with covariance Σ is to use the Cholesky decomposition to induce correlation among a set of uncorrelated normal variates. This is the technique used by the RandNormal function in SAS/IML software. Internally, the RandNormal function calls the ROOT function, which can compute the Cholesky root of an arbitrary positive definite matrix.

When there are thousands of variables, the Cholesky decomposition might take a second or more. If you call the RandNormal function thousands of times during a simulation study, you pay that one-second penalty during each call. For the AR(1) covariance structure, you can use the explicit formula for the Cholesky root to save a considerable amount of time. You also do not need to explicitly form the p x p matrix, Σ, which saves RAM. The following SAS/IML function is an efficient way to simulate N observations from a p-dimensional multivariate normal distribution that has an AR(1) correlation structure with parameter ρ:

/* simulate multivariate normal data from a population with AR(1) correlation */
start RandNormalAR1( N, Mean, rho );
    mMean = rowvec(Mean);
    p = ncol(mMean);
    U = AR1Root(rho, p);      /* use explicit formula instead of ROOT(Sigma) */
    Z = j(N,p);
    call randgen(Z,'NORMAL');
    return (mMean + Z*U);
finish;
 
call randseed(12345);
p = 1000;                  /* big matrix */
mean = j(1, p, 0);         /* mean of MVN distribution */
 
/* simulate data from MVN distribs with different values of rho */
v = do(0.01, 0.99, 0.01);  /* choose rho from list 0.01, 0.02, ..., 0.99 */
t0 = time();               /* time it! */
do i = 1 to ncol(v);
   rho = v[i];
   X = randnormalAR1(500, mean, rho);  /* simulate 500 obs from MVN with p vars */
end;
t_SimMVN = time() - t0;     /* total time to simulate all data */
print t_SimMVN;

The previous loop generates a sample that contains N=500 observations and p=1000 variables. Each sample is from a multivariate normal distribution that has an AR(1) correlation, but each sample is generated for a different value of ρ, where ρ = 0.01. 0.02, ..., 0.99. On my desktop computer, this simulation of 100 correlated samples takes about 4 seconds. This is about 25% of the time for the same simulation that explicitly forms the AR(1) correlation matrix and calls RandNormal.

In summary, the AR(1) correlation matrix is an easy way to generate a symmetric positive definite matrix. You can use the DISTANCE function in SAS/IML 14.3 to create such a matrix, but for some applications you might only require the Cholesky root of the matrix. The AR(1) correlation matrix has an explicit Cholesky root that you can use to speed up simulation studies such as generating samples from a multivariate normal distribution that has an AR(1) correlation.

The post Fast simulation of multivariate normal data with an AR(1) correlation structure appeared first on The DO Loop.

9月 262018
 

A radial basis function is a scalar function that depends on the distance to some point, called the center point, c. One popular radial basis function is the Gaussian kernel φ(x; c) = exp(-||xc||2 / (2 σ2)), which uses the squared distance from a vector x to the center c to assign a weight. The weighted sum of Gaussian kernels, Σ wi φ(x; c) arises in many applications in statistics, including kernel density estimation, kernel smoothing, and machine learning algorithms such as support vector machines. It is therefore important to be able to efficiently evaluate a radial basis function and compute a weighted sum of several such kernel functions.

One of the many useful features of the SAS/IML language is its ability to compactly represent matrix and vector expressions. The expression ||xc|| looks like the distance between two vectors, but in the SAS/IML language the DISTANCE function can handle multiple sets of vectors:

  • The DISTANCE function can compute the distance between two vectors of arbitrary dimensions. Thus when x and c are both d-dimensional row vectors, you can compute the distance by using r = DISTANCE(x, c). The result is a scalar distance.
  • The DISTANCE function can compute the distance between multiple points and a center. Thus when x is an m x d matrix that contains m points, you can compute the m distances between the points and c by using r = DISTANCE(x, c). Again, the syntax is the same, but now r is an m x 1 vector of distances.
  • The DISTANCE function in SAS/IML 14.3 can compute the distance between multiple points and multiple centers. Thus when x is an m x d matrix that contains m points and c is a p x d matrix that contains p centers, you can compute the m*p distances between the points and c by using r = DISTANCE(x, c). The syntax is the same, but now r is an m x p matrix of distances.

A SAS/IML function that evaluates a Gaussian kernel function

The following SAS/IML statements define a Gaussian kernel function. Notice that the function is very compact! To test the function, define one center at C = (2.3, 3.2). Because SAS/IML is a matrix language, you can evaluate the Gaussian kernel on a grid of integer coordinates (x,y) where x is an integer in the range [1,5] and y is in the range [1,8]. Let Z be the matrix of the 40 ordered pairs. The following call evaluates the Gaussian kernel at the grid of points:

proc iml;
/* Radial basis function (Gaussian kernel). If z is m x d and c is n x d, this function
   returns the mxn matrix of values exp( -||z[i,] - c[j,]||**2 / (2*sigma**2) ) */
start GaussKernel(z, c, sigma=1);
   return exp( -distance(z,c)##2 / (2*sigma**2) );
finish;
 
/* test on small data: Z is an 5 x 8 grid and C = {2.3 3.2} */
xPts = 1:5; yPts = 1:8;
Z = expandgrid(xPts, yPts);              /* expand into (8*5) x 2 matrix */
C = {2.3 3.2};                           /* the center */ 
phi = GaussKernel(Z, C);                 /* phi is 40 x 1 vector */
 
print Z phi;                             /* print in expanded form */
phi_Grid = shapecol(phi, ncol(yPts));    /* reshape into grid (optional) */
print phi_Grid[c=(char(xPts)) r=(char(yPts)) F=4.2];

The table shows the Gaussian kernel evaluated at the grid points. The columns represent the values at the X locations and the rows indicate the Y locations. The function is largest at the value (x,y)=(2,3) because (2,3) is the grid point closest to the center (2.3, 3.2). The largest value 0.94. Notice that the function is essentially zero at points that are more than 3 units from the center, which you would expect from a Gaussian distribution with σ = 1.

You can use the HEATMAPCONT subroutine to make a heat map of the function values. However, notice that in the matrix the rows increase in the downward direction, whereas in the usual Cartesian coordinate system the Y direction increases upward. Consequently, you need to reverse the rows and the Y-axis labels when you create a heat map:

start PlotValues( v, xPts, yPts );
   G = shapecol(v, ncol(yPts));      /* reshape vector into grid */
   M =  G[nrow(G):1, ];              /* flip Y axis (rows) */
   yRev = yPts[, ncol(yPts):1];      /* reverse the Y-axis labels */
   call heatmapcont(M) xvalues=xPts yValues=yRev;
finish;
run PlotValues(phi, xPts, yPts);
Evaluate Gaussian kernel on grid of points. Visualize with a heat map.

Sums of radial basis functions

Locations of 86 US cities with population greater than 200,000

Often the "centers" are the locations of some resource such as a warehouse, a hospital, or an ATM. Let's use the locations of 86 large US cities, which I used in a previous article about spatial data analysis. A graph of the locations of the cities is shown to the right. (Click to enlarge.) The locations are in a standardized coordinate system, so they do not directly correspond to longitudes and latitudes.

If there are multiple centers, the GaussKernel function returns a column for every center. Many applications require a weighted sum of the columns. You can achieve a weighted sum by using a matrix-vector product A*w, where w is a column vector of weights. If you want an unweighted sum, you can use the SAS/IML subscript reduction operator to sum across the columns: A[,+].

For example, the following statements evaluate the Gaussian kernel function at each value in a grid (the Z matrix) and for each of 86 cities (the C matrix). The result is a 3726 x 86 matrix of values. You can use the subscript reduction operator to sum the kernel evaluations over the cities, as shown:

use BigCities;
   read all var {x y} into C;     /* C = (x,y) locations of centers */
   read all var "City";
close;
 
/* Z = a regular grid in (x,y) coordinates that contains the data */
XGridPts = round( do(-0.4, 0.4, 0.01), 0.001);
YGridPts = round( do(-0.2, 0.25, 0.01), 0.001);
Z = expandgrid( XGridPts, YGridPts );  /* 3,726 points on a 81x46 grid */
 
phi = GaussKernel(X, C, 0.025);   /* use smaller bandwidth */
sumPhi = phi[,+];                 /* for each grid point, add sum of kernel evaluations */
Sum of 86 Gaussian kernels evaluated on a regular grid

The resulting heat map shows blobs centered at each large city in the data. Locations near isolated cities (such as Oklahoma City) are lighter in color than locations near multiple nearby cities (such as southern California and the New York area) because the image shows the superposition of the kernel functions. At points that are far from any large city, the sum of the Gaussian kernel functions is essentially zero.

In summary, if you work with algorithms that use radial basis functions such as Gaussian kernels, you can use the SAS/IML language to evaluate these functions. By using the matrix features of the language and the fact that the DISTANCE function supports matrices as arguments, you can quickly and efficiently evaluate weighted sums of these kernel functions.

The post Radial basis functions and Gaussian kernels in SAS appeared first on The DO Loop.