Did you add "learn something new" to your list of New Year's resolutions? Last week, I wrote about the most popular articles from The DO Loop in 2019. The most popular articles are about elementary topics in SAS programming or univariate statistics because those topics have broad appeal.

Advanced topics and multivariate statistics are less popular but no less important. If you want to learn something new, check out this "Editor's Choice" list of articles that will broaden your statistical knowledge and enhance your SAS programming skills. I've grouped the articles into three broad categories. ### Regression statistics ### High-dimensional analyses and visualization: ### Low-dimensional data visualization

The tips and techniques in these articles are useful, so read a few articles today and teach yourself something new in this New Year!

Do you have a favorite article from 2019 that did not make either list? Share it in a comment!

The post 10 posts from 2019 that deserve a second look appeared first on The DO Loop.  Many SAS procedures can automatically create a graph that overlays multiple prediction curves and their prediction limits. This graph (sometimes called a "fit plot" or a "sliced fit plot") is useful when you want to visualize a model in which a continuous response variable depends on one continuous explanatory variable and one categorical (classification) variable. You can use the EFFECTPLOT statement in PROC PLM to create similar visualizations of other kinds of regression models. This article shows three ways to create a sliced fit plot: direct from the procedure, by using the EFFECTPLOT statement in PROC PLM, and by writing the predictions to a data set and using PROC SGPLOT to graph the results.

The data for this example is from the PROC LOGISTIC documentation. The response variable is the presence or absence of pain in seniors who are splits into two treatment groups (A and B) and a placebo group (P).

### Score new data by using the ILINK option

There are many reasons to use PROC PLM, but an important purpose of PROC PLM is to score new observations. Given information about new patients, you can use PROC PLM to predict the probability of pain if these patients are given a specific treatment. The following DATA step defines the characteristics of four patients who will receive Treatment B. The call to PROC PLM scores and uses the ILINK option to predict the probability of pain:

/* Use PLM to score new patients */ data NewPatients; input Treatment $Sex$ Age Duration; DROP Duration; datalines; B M 67 15 B F 73 5 B M 74 12 B F 79 16 ;   /* predictions on the DATA scale */ proc plm restore=PainModel noprint; score data=NewPatients out=ScoreILink predicted lclm uclm / ilink; /* ILINK gives probabilities */ run;   proc print data=ScoreILink; run; The Predicted column contains probabilities in the interval [0, 1]. The 95% prediction limits for the predictions are given by the LCLM and UCLM columns. For example, the prediction interval for the 67-year-old man is approximately [0.03, 0.48].

These values and intervals are transformations of analogous quantities on the linear scale. The logit transformation maps the predicted probabilities to the linear estimates. The inverse logit (logistic) transformation maps the linear estimates to the predicted probabilities.

### Linear estimates and the logistic transformation

The linear scale is important because effects are additive on this scale. If you are testing the difference of means between groups, the tests are performed on the linear scale. For example, the ESTIMATE, LSMEANS, and LSMESTIMATE statements in SAS perform hypothesis testing on the linear estimates. Each of these statements supports the ILINK option, which enables you to display predicted values on the data scale.

To demonstrate the connection between the predicted values on the linear and data scale, the following call to PROC PLM scores the same data according to the same model. However, this time the ILINK option is omitted, so the predictions are on the linear scale.

/* predictions on the LINEAR scale */ proc plm restore=PainModel noprint; score data=NewPatients out=ScoreXBeta( rename=(Predicted=XBeta LCLM=LowerXB UCLM=UpperXB)) predicted lclm uclm; /* ILINK not used, so linear predictor */ run;   proc print data=ScoreXBeta; run; I have renamed the variables that PROC PLM creates for the estimates on the linear scale. The XBeta column shows the predicted values. The LowerXB and UpperXB columns show the prediction interval for each patient. The XBeta column shows the values you would obtain if you use the parameter estimates table from PROC LOGISTIC and apply those estimates to the observations in the NewPatients data.

To demonstrate that the linear estimates are related to the estimates in the previous section, the following SAS DATA step uses the logistic (inverse logit) transformation to convert the linear estimates onto the predicted probabilities:

/* Use the logistic (inverse logit) to transform the linear estimates (XBeta) into probability estimates in [0,1], which is the data scale. You can use the logit transformation to go the other way. */ data LinearToData; set ScoreXBeta; /* predictions on linear scale */ PredProb = logistic(XBeta); LCLProb = logistic(LowerXB); UCLProb = logistic(UpperXB); run;   proc print data=LinearToData; var Treatment Sex Age PredProb LCLProb UCLProb; run; The transformation of the linear estimates gives the same values as the estimates that were obtained by using the ILINK option in the previous section.

### Summary

In summary, there are two different scales for predicting values for a generalized linear model. When you report predicted values, it is important to specify the scale you are using. The data scale makes intuitive sense because it is the same scale as the response variable. You can use the ILINK option in SAS to get predicted values and prediction intervals on the data scale.

The post Predicted values in generalized linear models: The ILINK option in SAS appeared first on The DO Loop. SAS/STAT software contains a number of so-called HP procedures for training and evaluating predictive models. ("HP" stands for "high performance.") A popular HP procedure is HPLOGISTIC, which enables you to fit logistic models on Big Data. A goal of the HP procedures is to fit models quickly. Inferential statistics such as standard errors, hypothesis tests, and p-values are less important for Big Data because, if the sample is big enough, all effects are significant and all hypothesis tests are rejected! Accordingly, many of the HP procedures do not support the same statistical tests as their non-HP cousins (for example, PROC LOGISTIC).

A SAS programmer recently posted an interesting question on the SAS Support Community. She is using PROC HPLOGISTIC for variable selection on a large data set. She obtained a final model, but she also wanted to estimate the covariance matrix for the parameters. PROC HPLOGISTIC does not support that statistic but PROC LOGISTIC does (the COVB option on the MODEL statement). She asks whether it is possible to get the covariance matrix for the final model from PROC LOGISTIC, seeing as PROC HPLOGISTIC has already fit the model? Furthermore, PROC LOGISTIC supports computing and graphing odds ratios, so is it possible to get those statistics, too?

It is an intriguing question. The answer is "yes," although PROC LOGISTIC still has to perform some work. The main idea is that you can tell PROC LOGISTIC to use the parameter estimates found by PROC HPLOGISTIC.

### What portion of a logistic regression takes the most time?

The main computational burden in logistic regression is threefold:

• Reading the data and levelizing the CLASS variables in order to form a design matrix. This is done once.
• Forming the sum of squares and crossproducts matrix (SSCP, also called the XX matrix). This is done once.
• Maximizing the likelihood function. The parameter estimates require running a nonlinear optimization. During the optimization, you need to evaluate the likelihood function, its gradient, and its Hessian many times. This is an expensive operation, and it must be done for each iteration of the optimization method.

The amount of time spent required by each step depends on the number of observations, the number of effects in the model, the number of levels in the classification variables, and the number of computational threads available on your system. You can use the Timing table in the PROC HPLOGISTIC output to determine how much time is spent during each step for your data/model combination. For example, the following results are for a simulated data set that has 500,000 observations, 30 continuous variables, four CLASS variables (each with 3 levels), and a main-effects model. It is running on a desktop PC with four cores. The iteration history is not shown, but this model converged in six iterations, which is relatively fast.

ods select PerformanceInfo IterHistory Timing; proc hplogistic data=simLogi OUTEST; class c1-c&numClass; model y(event='1') = x1-x&numCont c1-c&numClass; ods output ParameterEstimates=PE; performance details; run; For these data and model, the Timing table tells you that the HPLOGISITC procedure fit the model in 3.68 seconds. Of that time, about 17% was spent reading the data, levelizing the CLASS variables, and accumulating the SSCP matrix. The bulk of the time was spent evaluating the loglikelihood function and its derivatives during the six iterations of the optimization process.

### Reduce time in an optimization by improving the initial guess

If you want to improve the performance of the model fitting, about the only thing you can control is the number of iterations required to fit the model. Both HPLOGISTIC and PROC LOGISTIC support an INEST= option that enables you to provide an initial guess for the parameter estimates. If the guess is close to the final estimates, the optimization method will require fewer iterations to converge.

Here's the key idea: If you provide the parameter estimates from a previous run, then you don't need to run the optimization algorithm at all! You still need to read the data, form the SSCP matrix, and evaluate the loglikelihood function at the (final) parameter estimates. But you don't need any iterations of the optimization method because you know that the estimates are optimal.

How much time can you save by using this trick? Let's find out. Notice that the previous call used the ODS OUTPUT statement to output the final parameter estimates. (It also used the OUTEST option to add the ParmName variable to the output.) You can run PROC TRANSPOSE to convert the ParameterEstimates table into a data set that can be read by the INEST= option on PROC HPLOGISTIC or PROC LOGISTIC. For either procedure, set the option MAXITER=0 to bypass the optimization process, as follows:

/* create INEST= data set from ParameterEstimates */ proc transpose data=PE out=inest(type=EST) label=_TYPE_; label Estimate=PARMS; var Estimate; id ParmName; run;   ods select PerformanceInfo IterHistory Timing; proc hplogistic data=simLogi INEST=inest MAXITER=0; class c1-c&numClass; model y(event='1') = x1-x&numCont c1-c&numClass; performance details; run; The time required to read the data, levelize, and form the SSCP is unchanged. However, the time spent evaluating the loglikelihood and its derivatives is about 1/(1+NumIters) of the time from the first run, where NumIters is the number of optimization iterations (6, for these data). The second call to the HPLOGISTIC procedure still has to fit the loglikelihood function to form statistics like the AIC and BIC. It needs to evaluate the Hessian to compute standard errors of the estimates. But the total time is greatly decreased by skipping the optimization step.

### Use PROC HPLOGISTIC estimates to jump-start PROC LOGISTIC

You can use the same trick with PROC LOGISTIC. You can use the INEST= and MAXITER=0 options to greatly reduce the total time for the procedure. At the same time, you can request statistics such as COVB and ODDSRATIO that are not available in PROC HPLOGISTIC, as shown in the following example:

proc logistic data=simLogi INEST=inest plots(only MAXPOINTS=NONE)=oddsratio(range=clip); class c1-c&numClass; model y(event='1') = x1-x&numCont c1-c&numClass / MAXITER=0 COVB; oddsratio c1; run;`  The PROC LOGISTIC step takes about 4.5 seconds. It produces odds ratios and plots for the model effects and displays the covariance matrix of the betas (COVB). By using the parameter estimates that were obtained by PROC HPLOGISTIC, it was able to avoid the expensive optimization iterations.

You can also use the STORE statement in PROC LOGISTIC to save the model to an item store. You can then use PROC PLM to create additional graphs and to run additional post-fit analyses.

In summary, PROC LOGISTIC can compute statistics and hypothesis tests that are not available in PROC HPLOGISTIC. it is possible to fit a model by using PROC HPLOGISTIC and then use the INEST= and MAXITER=0 options to pass the parameter estimates to PROC LOGISTIC. This enables PROC LOGISTIC to skip the optimization iterations, which saves substantial computational time.

You can download the SAS program that creates the tables and graphs in this article. The program contains a DATA step that simulates logistic data of any size.

The post Jump-start PROC LOGISTIC by using parameter estimates from PROC HPLOGISTIC appeared first on The DO Loop. Have you noticed lower gasoline prices lately? How low will they go, and how long will they stay down? Let's use SAS to analyze some of the data!... First, let's look at just the price of gasoline over time. Here's a plot of the US average gasoline price, each week […] The background is from An Introduction of Generalized Linear Model by A.J.Dobson; It's described as below:

The data are from an experiment to promote the recovery of stroke patients. There were three experimental groups:

A was a new occupational therapy intervention;
B was the existing stroke rehabilitation program conducted in the same hospital where A was conducted;
C was the usual care regime for stroke patients provided in a different hospital.

There were eight patients in each experimental group. The response variable was a measure off unctional ability, the Bartel index; higher scores correspond to better outcomes and the maximum score is 100. Each patient was assessed weekly over the eight weeks of the study.

data table11_1;
input Subject    Group $week1-week8; datalines; 1 A 45 45 45 45 80 80 80 90 2 A 20 25 25 25 30 35 30 50 3 A 50 50 55 70 70 75 90 90 4 A 25 25 35 40 60 60 70 80 5 A 100 100 100 100 100 100 100 100 6 A 20 20 30 50 50 60 85 95 7 A 30 35 35 40 50 60 75 85 8 A 30 35 45 50 55 65 65 70 9 B 40 55 60 70 80 85 90 90 10 B 65 65 70 70 80 80 80 80 11 B 30 30 40 45 65 85 85 85 12 B 25 35 35 35 40 45 45 45 13 B 45 45 80 80 80 80 80 80 14 B 15 15 10 10 10 20 20 20 15 B 35 35 35 45 45 45 50 50 16 B 40 40 40 55 55 55 60 65 17 C 20 20 30 30 30 30 30 30 18 C 35 35 35 40 40 40 40 40 19 C 35 35 35 40 40 40 45 45 20 C 45 65 65 65 80 85 95 100 21 C 45 65 70 90 90 95 95 100 22 C 25 30 30 35 40 40 40 40 23 C 25 25 30 30 30 30 35 40 24 C 15 35 35 35 40 50 65 65 ; run; Here a linear model is used to study the relation of Group and Time. The model here used is: E(Y_{i,j,k}) = \alpha_{i} + \beta_{i} * t_{k} + e_{i,j,k} Here Y_{i,j,k} is the score at time t_{k} ( k=1 to 8) for patient j (j=1 to 8) in group i (i=1 for Group A, 2 for B and 3 for C); For linear regression convenience, we need to rearrange the data from wide to long: data table11_1long; set table11_1; array week(8) week1-week8; do time=1 to 8; score=week(time); output; end; drop week1-week8; run; Also we need to transfer Categorical Variable to Dummy Variable: data table11_3a; set table11_1long; if group='B' then do; a2=1; a3=0; end; if group='C' then do; a2=0; a3=1; end; if group='A' then do; a2=0; a3=0; end; run; Without considering the dependence of the score for different time, we first assume they are independent. Consider the interaction of time and Group, the model can be expressed as: proc glm data=table11_3a; model score = time a2 a3 time*a2 time*a3 / solution ss3; run; quit; The above model considers the interaction of Group and Time. It's a saturated model. It is the same as we make a regression of Score over Time for each Group. So, it can be equally achieved by: proc reg data=table11_1long; by group; model score = time; run; Sometimes it's required an exploratory analysis, or called data reduction or data summary. It consists of summarizing the response profiles for each subject by a small number of descriptive statistics based on the assumption that measurements on the same subject are independent. Here intercept and slope are used as the summary statistics: proc reg data = table11_1long; by subject; model score = time; ods output parameterestimates = table11_4 (keep = subject variable estimate stderr); run; quit; proc tabulate data=table11_4; var estimate stderr; class variable subject / order=data; table subject='Subject'*sum='', (variable='')*(estimate stderr)*f=6.3 / rts=20 row=float; run; Then we can get the following summary table: ------------------------------------------------ | | Intercept | time | | |-------------+-------------| | |Param-| |Param-| | | | eter |Stand-| eter |Stand-| | |Estim-| ard |Estim-| ard | | | ate |Error | ate |Error | |------------------+------+------+------+------| |Subject | | | | | |------------------| | | | | |1 |30.000| 7.289| 7.500| 1.443| |------------------+------+------+------+------| |2 |15.536| 4.099| 3.214| 0.812| |------------------+------+------+------+------| |3 |39.821| 3.209| 6.429| 0.636| |------------------+------+------+------+------| |4 |11.607| 3.387| 8.393| 0.671| |------------------+------+------+------+------| |5 |100.00| 0.000| 0.000| 0.000| |------------------+------+------+------+------| |6 | 0.893| 5.304|11.190| 1.050| |------------------+------+------+------+------| |7 |15.357| 4.669| 7.976| 0.925| |------------------+------+------+------+------| |8 |25.357| 1.971| 5.893| 0.390| |------------------+------+------+------+------| |9 |38.571| 3.522| 7.262| 0.698| |------------------+------+------+------+------| |10 |61.964| 2.236| 2.619| 0.443| |------------------+------+------+------+------| |11 |14.464| 5.893| 9.702| 1.167| |------------------+------+------+------+------| |12 |26.071| 2.147| 2.679| 0.425| |------------------+------+------+------+------| |13 |48.750| 8.927| 5.000| 1.768| |------------------+------+------+------+------| |14 |10.179| 3.209| 1.071| 0.636| |------------------+------+------+------+------| |15 |31.250| 1.948| 2.500| 0.386| |------------------+------+------+------+------| |16 |34.107| 2.809| 3.810| 0.556| |------------------+------+------+------+------| |17 |21.071| 2.551| 1.429| 0.505| |------------------+------+------+------+------| |18 |34.107| 1.164| 0.893| 0.231| |------------------+------+------+------+------| |19 |32.143| 1.164| 1.607| 0.231| |------------------+------+------+------+------| |20 |42.321| 3.698| 7.262| 0.732| |------------------+------+------+------+------| |21 |48.571| 6.140| 7.262| 1.216| |------------------+------+------+------+------| |22 |24.821| 1.885| 2.262| 0.373| |------------------+------+------+------+------| |23 |22.321| 1.709| 1.845| 0.338| |------------------+------+------+------+------| |24 |13.036| 4.492| 6.548| 0.890| ------------------------------------------------ The last part is to do analysis of variance for intercepts and slopes. Since in Table11_4 there is only these four variables: 'subject' 'variable' 'estimate' 'stderr'. In table11_1long there is variable 'time' and 'group'. We need to combine these two data sets: proc sql; create table table11_5 as select table11_4.*, table11_1long.group from table11_4, table11_1long where table11_4.subject = table11_1long.subject; quit; Then the analysis of Intercepts can be done by: proc genmod data=table11_5; class group (ref='A' param=ref); where variable='Intercept'; model estimate = group ; run; quit; The analysis of Slope can be done by: proc genmod data=table11_5; class group (ref='A' param=ref); where variable='time'; model estimate = group ; run; quit; Here we use proc genmod which allows us use categorical variables directly and has the choice of selecting reference level. We don't use proc glm since it has no choice of reference level in the regression. In the book the author use proc reg to do it. If use proc reg, we need to use dummy variables rather than categorical variable. So, first it requires to combine the table table11_4 with table11_3a together: proc sql; create table table11_5 as select distinct table11_4.*, x.a2 as a2, x.a3 as a3 from table11_4, table11_3a as x where table11_4.subject = x.subject; quit; proc reg data=table11_5; where variable='Intercept'; model estimate = a2 a3; run; quit; proc reg data=table11_5; where variable='time'; model estimate = a2 a3; run; quit; It gives the same result as we use proc genmod. As we know, usually the independence assumption here for the repeated measures is not suitable. A more suitable model for this data should consider the dependence of scores for these 8 time. Generalized Estimation Equations (GEE) model should be more reasonable. Linear Regression with Categorical Predictors and Its interactions The data set we use is elemapi2; variable mealcat is the percentage of free meals in 3 categories (mealcat=1, 2, 3); collcat is three different collections. They are both categorical variables. 1: Regression with only one categorical predictors Since mealcat is categorical, we transfer it to dummy variables for linear regression. That it, mealcat1=1 if mealcat=1, else mealcat1=0; mealcat2=1 if mealcat=2 else mealcat2=0; mealcat3=1 if mealcat=3 else mealcat3=0; The regression of api00~mealcat2 mealcat3 is (mealcat1 is as reference): proc reg data=elemdum; model api00 = mealcat2 mealcat3; plot predicted.*mealcat; run; The output is: The intercept is 805.72; Since mealcat=1 is reference group, that is: mean of api00 at mealcat=1 is 805.72; the coefficient for mealcat2 is the difference between mealcat=2 and mealcat=1; Mean of api00 at mealcat=2 is 805.72-166.32=639.40; The coefficient for mealcat3 is the difference between mealcat=3 and mealcat=1; Mean of api00 at mealcat=3 is 805.72-301.34=504.38. The result can be verified by: proc means data=elemdum; var api00; class mealcat; run;  mealcat 1 2 3 mean of api00 805.72 639.4 504.38 It is the same as anova result. Since in proc glm the last category is automatically chosen as reference, it shows the same information as above. 2: Regression with two Categorical Variables 1) here we consider there are two categorical variables, mealcat and collcat. The regression is: proc reg data=elemdum; model api00 = mealcat1 mealcat2 collcat1 collcat2 ; run; quit; Look at the following graph: With respect wot mealcat, mealcat=3 is the reference; with respect to collcat, collcat=3 is the reference. So cell 9 is the reference cell, and intercept is the predicted value for this cell. The coefficient for mealcat2 if the difference between cell 8 and cell 9; it’s also the difference between cell 5 and cell 6, and the difference between cell 2 and cell 3; in the same way we can explain mealcat1. The coefficient for collcat2 is the difference between cell 6 and cell 9; it’s also the difference between cell 5 and cell 8, and the difference between cell 4 and cell 7; in the same way we can explain collcat1. But if you calculate the mean for each cell, it will be different from the predicted value here. Why? Because here we only have main effect, therefore we have restrictions that the difference for some cells are the same as stated above. 2) Next we can use anova to get the same result: proc glm data=elemdum; class mealcat collcat; model api00 = mealcat collcat / solution ss3; run; The result is the same as we use dummy variables. 3: Regression with Interactions of Categorical Variables 1) As is shown above, it’s assumed that the differences of some cells are the same. To remove this restriction, now we consider interactions of the two categorical variables. First we need to create the dummy interactions, that is, the interaction of variable mealcat and collcat. SAS code is: data elemdum; set sasreg.elemapi2 (keep=api00 some_col mealcat collcat); array mealdum(3) mealcat1-mealcat3; array colldum(3) collcat1-collcat3; array mxc(3,3) m1xc1-m1xc3 m2xc1-m2xc3 m3xc1-m3xc3 ; do i=1 to 3; do j=1 to 3; mealdum(i)=(mealcat=i); colldum(j)=(collcat=j); mxc(i,j)=mealdum(i)*colldum(j); end; end; drop i j; run; We can check the indicators by proc freq in SAS: proc freq data=elemdum; table mealcat*collcat * m1xc1*m1xc2*m1xc3*m2xc1*m2xc2*m2xc3*m3xc1*m3xc2*m3xc3 / nopercent nocum list; run; The option list displays two-way to n-way tables in a list format rather than as crosstabulation tables. Now we will add the dummy variables and interactions into the regression. We also set mealcat=3 and collcat=3 as reference. proc reg data=elemdum; model api00 = mealcat1 mealcat2 mealcat3 collcat1 collcat2 collcat3 m1xc1--m3xc3; run; The output is: From the output we can see: a: some variables has no coefficient estimation since they are at reference level; b: interaction m2xc1 and m2xc2 is not sifnificant. One important thing is what does the interaction mean? In the case without interaction, the coefficient Bmealcat1 means the difference between group mealcat=1 and the reference group mealcat=3. Here with interactions, it’s more complicated. The presence of interaction imply that the difference of mealcat=1 and mealcat=3 depends on the level of collcat. For example, the interaction term Bm1xc1 means the extent to which the difference between mealcat=1 and mealcat=3 changes when collcat=1 compared to collcat=3. That is, Bm1xc1=(c1-c7)-(c3-c9). To be more detailed, I will list the items in each cell as below: So, if you want to perform a test of the main effect of collcat=1 and collcat=3 when mealcat=1, you need to compare (Intercept + Bmealcat1+ Bcollcat1 + Bm1xc1) - (Intercept + Bmealcat1) = Bcollcat1 + Bm1xc1 is zero or not. That is: The test is significant, indicating that collcat at level 1 and 3 is significant for the mealcat=1 group. 2) The same result can be given by anova as below: proc glm data = elemdum; class mealcat collcat; model api00 = mealcat collcat mealcat*collcat / solution; lsmeans mealcat*collcat / slice=collcat; run; At last, we can verify our result by calculating the average of each cell by proc tabulate: Or we can get it by proc sql from SAS: proc sql; create table temp as select mean(api00) as avg, collcat, mealcat from elemdum group by mealcat, collcat order by mealcat, collcat; run; quit; These questions are from the webbook 'Regression of Stata' of ucla ats. The original questions require to use Stata to answer. Here I try to use SAS to answer these question. Regression Diagnostics Question 1: The following data set (davis) consists of measured weight, measured height, reported weight and reported height of some 200 people. We tried to build a model to predict measured weight by reported weight, reported height and measured height. Draw an leverage v.s. residual square plot after the regression. Explain what you see in the graph and try to use other SAS commands to identify the problematic observation(s). What do you think the problem is and what is your solution? Answer 1: First we do the linear regression in SAS: proc reg data=sasreg.davis; model measwt = measht reptwt reptht ; output out=reg1davis h=lev r=r rstudent=rstudent p=p cookd=cookd dffits=dffits; run; 1) Then we draw the graph the question give: leverage v.s. normalized residual square. data rsquare; set reg1davis; rsquare=r**2; run; proc means data=rsquare; var rsquare; output out=rsquare1 mean=means std=stds; run; data davis1; set rsquare1; rsquarestd=0; do until (eof); set rsquare end=eof nobs=ntotal; rsquarestd=((rsquare-means)/stds); output; end; run; goptions reset=all; axis1 label=(r=0 a=90); symbol1 pointlabel=("#measwt") v=plus i=none c=blue height=1; proc gplot data=davis1; plot lev*rsquarestd / vaxis=axis1; run; quit; The corresponding graph is: From the graph above we can see there is a point (measwt=166 is far away from the other points.) we will study it furthermore. 2) Next we drat the scatter plot of measwt measht reptwt reptht. Also we can see the measwt=166 point stands out. 3) Usually when the student residual is greater than 3, then the corresponding data merit further study. proc univariate data=reg1davis; var rstudent; run; We can see there is one student residual greater than 3. Print the corresponding measwt, it is: proc print data=reg1davis; var measwt rstudent; where abs(rstudent) > 3; run; It also shows that measwt=166 is an outlier. 4) Leverage: Generally, a point with leverage greater than (2k+2)/n should be carefully examined, where k is the number of predictors and n is the number of observations. In our question, it equals to (2*3+2)/181=0.0442. 5) Cook’s D: the higher the Cook's D is, the more influential the point is. The conventional cut-off point is 4/n. We list the points with Cook’s D above the cut-off point: proc print data=reg1davis; var measwt measht reptwt reptht cookd; where cookd > (4/181); run; 6) DFBETA: We can consider more specific measure of influence by access how much a coefficient will change by deleting that data point. This is more computationally intensive than the summary statistics such as CookD. In SAS, we need to use the ods output OutStatistics statement to produce the DFBETAs for each of the predictors. A DFBETA value in excess of 2/sqrt(n) merits further investigation. proc reg data=sasreg.davis ; model measwt = measht reptwt reptht / influence; ods output outputstatistics=measwtdfbetas; id measwt; run; quit; proc print data=measwtdfbetas (obs=12); var measwt dfb_measht dfb_reptwt dfb_reptht; run; Based on these information above, we will say that measwt=166 is an outlier. Probably we need to delete this point for the regression model. Question 2: Using the data from the last exercise, what measure would you use if you want to know how much change an observation would make on a coefficient for a predictor? For example, show how much change would it be for the coefficient of predictor reptht if we omit observation 12 from our regression analysis? What are the other measures that you would use to assess the influence of an observation on regression? What are the cut-off values for them? Answer 2: DFBETA is the measure of how much change an observations would make on a coefficient for a predictor. From above graph we show the DFBETA for observation 12. The value of dfb_reptht is 24.2546, which means that by being included in the analysis (as compared to being excluded), the coefficient of reptht will increase by 24.2546 standard errors. That is, the parameter will increase by 24.2514*.04197=1.0178. A DFBETA value in excess of 2/sqrt(n) merits further investigation. Question 3: The following data file is called bbwt and it is from Weisberg's Applied Regression Analysis. It consists of the body weights and brain weights of some 60 animals. We want to predict the brain weight by body weight, that is, a simple linear regression of brain weight against body weight. Show what you have to do to verify the linearity assumption. If you think that it violates the linearity assumption, show some possible remedies that you would consider. Answer 3: 1) first we will do the linear assumption test. Usually we use scatter plot to study the relation. Here we draw two plots. proc reg data=sasreg.bbwt; model brainwt = bodywt; plot residual.*predicted.; plot brainwt*bodywt; run; quit; The first is residual v.s. predicted plot: The second is scatter plot of brainwt v.s. bodywt. 2) A way to modify the regression is to use lowess regression line as below. SAS will output 4 graphs for the corresponding 4 lowess smoothing parameter. proc loess data=sasreg.bbwt; model brainwt = bodywt / smooth = .1 .2 .3 .4; ods output outputstatistics=results; run; proc sort data=results; by smoothingparameter bodywt; run; goptions reset=all; symbol1 v=dot i=none c=black; symbol2 v=none i=join c=blue; symbol3 v=none i=r c=red; proc gplot data=results; by smoothingparameter; plot depvar*bodywt pred*bodywt depvar*bodywt / overlay; run; quit; Here we show one graph with smoothingparameter=.1. From the graph it shows the relation is not linear. 3) From both graph it shows the linear assumption is not good. So we need to make some transformations of the data. Here we use log-transformation for both brainwt and bodywt. data lbbwt; set sasreg.bbwt; lbrainwt=log(brainwt); lbodywt=log(bodywt); run; proc reg data=lbbwt; model lbrainwt = lbodywt; plot residual.*predicted.; plot lbrainwt*lbodywt; run; quit; Now we can see after transformation they are more linear than before. Question 4: We did a regression analysis using the data file elemapi2. Continuing with the analysis we did, draw an additional variable plot (avplot) here. Explain what an avplot is and what type of information you would get from the plot. If variable full were put in the model, would it be a significant predictor? Answer: A group of points can be jointly influential. An additional variable plot is an attractive graphic method to present multiple influential points on a predictor. What we are looking for in an additional variable plot are those points that can exert substantial change to the regression line. From the avplot we can see for the variable full, the observation with sch00l number 211 is very low on the left. Deleting it will flatten the regression a lot. In other words, it will decrease the coefficient for variable full a lot. Question 5: The data set wage is from a national sample of 6000 households with a male head earning less than$15,000 annually in 1966. The data were classified into 39 demographic groups for analysis. We tried to predict the average hours worked by average age of respondent and average yearly non-earned income. Both predictors are significant. Now if we add ASSET to our predictors list, neither NEIN nor ASSET is significant. Can you explain why?
Answer 5: The reason is NEIN and ASSET are highly correlated which result in they are not significant when both are used in the model. So the problem here is Multicollinearity.
proc corr data=sasreg.wage;
var age nein asset;
run;

We can also the VIF value to double check multicollinearity.
proc reg data=sasreg.wage;
model hrs = age nein asset / vif;
run;

Question 6:  Continue to use the previous data set. This time we want to predict the average hourly wage by average percent of white respondents. Carry out the regression analysis and list the SAS commands that you can use to check for heteroscedasticity. Explain the result of your test(s).
Now we want to build another model to predict the average percent of white respondents by the average hours worked. Repeat the analysis you performed on the previous regression model. Explain your results.
Answer 6:  1)  First we do the linear regression and then draw the plot of residuals v.s. predicted value. From the graph it doesn’t show any special pattern of the residuals.
proc reg data=sasreg.wage;
model rate = race / spec;
output out=reg1wage r=r p=p rstudent=rstudent;
run;

goptions reset=all;
proc gplot data=reg1wage;
plot r*p;
run;

Next we can do the White Test. A test for heteroscedasticity, the White test. The White test tests the null hypothesis that the variance of the residuals is homogenous. Therefore, if the p-value is very small, we would have to reject the hypothesis and accept the alternative hypothesis that the variance is not homogenous. We use the / spec option on the model statement to obtain the White test.

p-value is greater than .05. So we couldn’t reject the null hypothesis. That is, we couldn’t reject that the variance are homogeneous.

2) From the residual plot we can see the residuals blast out. Also, the white test verify this since p-value is less than .05. So we think the variance is not homogenous.

Question 7:  We have a data set (tree) that consists of volume, diameter and height of some objects. Someone did a regression of volume on diameter and height. Explain what tests you can use to detect model specification errors and if there is any, your solution to correct it.
Answer 7: A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model. If relevant variables are omitted from the model, the common variance they share with included variables may be wrongly attributed to those variables, and the error term is inflated. On the other hand, if irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to them. Model specification errors can substantially affect the estimate of regression coefficients.
Link test is used for model specification. If a model a well specified, one shouldn't get any new independent variables that is significant. For model specification, a common method is to make a new regression of y depends on the fitted value (fv) and squared fitted value (fv2). For the well specified model, fv should be significant and fv2 should not. We will show the result as below:
proc reg data=sasreg.tree;
model vol = dia height;
output out=reg1tree (keep= vol dia height fv) predicted=fv;
run;

data reg1square;
set reg1tree;
fv2=fv**2;
run;

proc reg data=reg1square;
model vol = fv fv2;
run;
quit;

From the graph above, it shows both fv and fv2 are significant. That is, our model of vol~dia height is not well specified.
It's reasonable since we know volume should be proportional to the square of diameter and height. So, we next add a independent variable dia2 as square of diameter.

Now we can see the model is well specified.

Introduction:

If a linear regression has only a single independent variable (IV), then scatter plot is enough to show the relation between IV and DV (dependent variable). However, if it's multiple regression, although we can drat scatter plot of DV with each IV, it's not enough since it doesn't take into account of the effect of other IV in the model.

partial regression plot attempts to show the effect of adding an additional variable to the model  given that one or more independent variables are already in the model. It's also called added variable plots.

Partial Regression Plot can be formed in these 3 steps:
1: Compute the residuals in the regression of DV against all IVs except X_i;
2: Compute the residuals in the regression of X_i against the remaining IVs;
3: Plot the residuals in 1 v.s. the residuals in 2

Velleman and Welsch list the following useful properties for this plot:
1. The least squares linear fit to this plot has the slope Betai and intercept zero.
2. The residuals from the least squares linear fit to this plot are identical to the residuals from the least squares fit of the original model (Y against all the independent variables including X_i).
3. The influences of individual data values on the estimation of a coefficient are easy to see in this plot.
4. It is easy to see many kinds of failures of the model or violations of the underlying assumptions (nonlinearity, heteroscedasticity, unusual patterns).

Implement in SAS:

Here I use the dataset called crime. The regression model is crime ~ pctmetro pctwhite poverty single

The first way to drat the added variable plot is to use partial options in SAS. That is:

proc reg data=sasreg.crime;
model crime = pctmetro pctwhite poverty single / partial  ;
output
run;
quit;
However, if you draw the graph in this way, it is generated in SAS output. The graph is difficult to read and use.

The second way to draw is from our introduction above. We calculate the residuals from the regression, then drat the graph handily.

/*  Calculate Regression Residuals Seperately  */
proc reg data=sasreg.crime;
model crime = pctmetro pctwhite poverty  ;
output out=result1 r=rcrime;
model  single  = pctmetro pctwhite poverty  ;
output out=result2 r=rsingle;
run;
quit;
/*  Sort the data since we need to merge them  */
proc sort data=result1; by crime; run;
proc sort data=result2; by crime; run;
/*  Merge the data  */
data result;
merge result1 result2;
by crime;
label rcrime="Residuals of crime v.s. pctmetro pctwhite poverty";
label rsingle="Residuals of single v.s. pctmetro pctwhite poverty";
run;

goptions reset=all;
symbol1 pointlabel=("#state") v=circle i=rl c=blue;
axis1 label=(r=0 a=90);
proc gplot data=result;
plot rcrime*rsingle / vaxis=axis1;
run;
quit;

In the same way, we can draw the partial regression plot for each IV in the model. It gives the same result as we did in method 1 while it gives us much more options to make the graph clear.

From wiki :  Partial regression plots are related to, but distinct from, partial residual plots. Partial regression plots are most commonly used to identify data points with high leverage and influential data points that might not have high leverage. Partial residual plots are most commonly used to identify the nature of the relationship between Y and Xi (given the effect of the other independent variables in the model).

These questions are from the webbook 'Regression of Stata' of ucla ats. The original questions require to use Stata to answer. Here I try to use SAS to answer these question.

Linear Regression

Question 1:  Use data set elemapi2. Make five graphs of api99: histogram, kdensity plot, boxplot, symmetry plot and normal quantile plot.
1:histogram / kernel density / normal density

2: kernel density

3: boxplot

4: QQ plot

5: normal probability plot

Question 2: What is the correlation between api99 and meals

From the output we can see the correlation of api99 and meals is negative, the value is -.908. That is, as the percentage of free meals increase, the api99 value will decrease.

Question 3: Regress api99 on meals. What does the output tell you?

p-value is < .0001, that is, meals is significant in the regression. the coefficient is negative which means as meals increase by one unit, the value of api99 will decrease by 4.187.

Question 4: Create and list the fitted (predicted) values

The first 20 predict values are

Question 5: Graph meals and api99 with and without the regression line.

With Regression Line:

Without Regression Line (in SAS, with / noline after model statement)

Question 6: Look at the correlations among the variables api99 meals ell avg_ed using the corr command. Explain how these commands are different. Make a scatterplot matrix for these variables and relate the correlation results to the scatterplot matrix.