These questions are from the webbook 'Regression of Stata' of ucla ats. The original questions require to use Stata to answer. Here I try to use SAS to answer these question.
Regression Diagnostics Question 1: The following data set (davis) consists of measured weight, measured height, reported weight and reported height of some 200 people. We tried to build a model to predict measured weight by reported weight, reported height and measured height. Draw an leverage v.s. residual square plot after the regression. Explain what you see in the graph and try to use other SAS commands to identify the problematic observation(s). What do you think the problem is and what is your solution?
Answer 1: First we do the linear regression in SAS:
proc reg data=sasreg.davis;
model measwt = measht reptwt reptht ;
output out=reg1davis h=lev r=r rstudent=rstudent p=p cookd=cookd dffits=dffits;
1) Then we draw the graph the question give: leverage v.s. normalized residual square.
proc means data=rsquare;
output out=rsquare1 mean=means std=stds;
do until (eof);
set rsquare end=eof nobs=ntotal;
axis1 label=(r=0 a=90);
symbol1 pointlabel=("#measwt") v=plus i=none c=blue height=1;
proc gplot data=davis1;
plot lev*rsquarestd / vaxis=axis1;
The corresponding graph is:
From the graph above we can see there is a point (measwt=166 is far away from the other points.) we will study it furthermore.
2) Next we drat the scatter plot of measwt measht reptwt reptht.
Also we can see the measwt=166 point stands out.
3) Usually when the student residual is greater than 3, then the corresponding data merit further study.
proc univariate data=reg1davis;
We can see there is one student residual greater than 3. Print the corresponding measwt, it is:
proc print data=reg1davis;
var measwt rstudent;
where abs(rstudent) > 3;
It also shows that measwt=166 is an outlier.
4) Leverage: Generally, a point with leverage greater than (2k+2)/n should be carefully examined, where k is the number of predictors and n is the number of observations. In our question, it equals to (2*3+2)/181=0.0442.
5) Cook’s D: the higher the Cook's D is, the more influential the point is. The conventional cut-off point is 4/n. We list the points with Cook’s D above the cut-off point:
proc print data=reg1davis;
var measwt measht reptwt reptht cookd;
where cookd > (4/181);
6) DFBETA: We can consider more specific measure of influence by access how much a coefficient will change by deleting that data point. This is more computationally intensive than the summary statistics such as CookD. In SAS, we need to use the ods output OutStatistics statement to produce the DFBETAs for each of the predictors. A DFBETA value in excess of 2/sqrt(n) merits further investigation.
proc reg data=sasreg.davis ;
model measwt = measht reptwt reptht / influence;
ods output outputstatistics=measwtdfbetas;
proc print data=measwtdfbetas (obs=12);
var measwt dfb_measht dfb_reptwt dfb_reptht;
Based on these information above, we will say that measwt=166 is an outlier. Probably we need to delete this point for the regression model.
Question 2: Using the data from the last exercise, what measure would you use if you want to know how much change an observation would make on a coefficient for a predictor? For example, show how much change would it be for the coefficient of predictor reptht if we omit observation 12 from our regression analysis? What are the other measures that you would use to assess the influence of an observation on regression? What are the cut-off values for them?
Answer 2: DFBETA is the measure of how much change an observations would make on a coefficient for a predictor. From above graph we show the DFBETA for observation 12.
The value of dfb_reptht is 24.2546, which means that by being included in the analysis (as compared to being excluded), the coefficient of reptht will increase by 24.2546 standard errors. That is, the parameter will increase by 24.2514*.04197=1.0178.
A DFBETA value in excess of 2/sqrt(n) merits further investigation.
Question 3: The following data file is called bbwt and it is from Weisberg's Applied Regression Analysis. It consists of the body weights and brain weights of some 60 animals. We want to predict the brain weight by body weight, that is, a simple linear regression of brain weight against body weight. Show what you have to do to verify the linearity assumption. If you think that it violates the linearity assumption, show some possible remedies that you would consider.
Answer 3: 1) first we will do the linear assumption test. Usually we use scatter plot to study the relation. Here we draw two plots.
proc reg data=sasreg.bbwt;
model brainwt = bodywt;
The first is residual v.s. predicted plot:
The second is scatter plot of brainwt v.s. bodywt.
2) A way to modify the regression is to use lowess regression line as below. SAS will output 4 graphs for the corresponding 4 lowess smoothing parameter.
proc loess data=sasreg.bbwt;
model brainwt = bodywt / smooth = .1 .2 .3 .4;
ods output outputstatistics=results;
proc sort data=results;
by smoothingparameter bodywt;
symbol1 v=dot i=none c=black;
symbol2 v=none i=join c=blue;
symbol3 v=none i=r c=red;
proc gplot data=results;
plot depvar*bodywt pred*bodywt depvar*bodywt / overlay;
Here we show one graph with smoothingparameter=.1. From the graph it shows the relation is not linear.
3) From both graph it shows the linear assumption is not good. So we need to make some transformations of the data. Here we use log-transformation for both brainwt and bodywt.
proc reg data=lbbwt;
model lbrainwt = lbodywt;
Now we can see after transformation they are more linear than before.
Question 4: We did a regression analysis using the data file elemapi2. Continuing with the analysis we did, draw an additional variable plot (avplot) here. Explain what an avplot is and what type of information you would get from the plot. If variable full were put in the model, would it be a significant predictor?
Answer: A group of points can be jointly influential. An additional variable plot is an attractive graphic method to present multiple influential points on a predictor. What we are looking for in an additional variable plot are those points that can exert substantial change to the regression line.
From the avplot we can see for the variable full, the observation with sch00l number 211 is very low on the left. Deleting it will flatten the regression a lot. In other words, it will decrease the coefficient for variable full a lot.
Question 5: The data set wage is from a national sample of 6000 households with a male head earning less than $15,000 annually in 1966. The data were classified into 39 demographic groups for analysis. We tried to predict the average hours worked by average age of respondent and average yearly non-earned income. Both predictors are significant. Now if we add ASSET to our predictors list, neither NEIN nor ASSET is significant. Can you explain why?
Answer 5: The reason is NEIN and ASSET are highly correlated which result in they are not significant when both are used in the model. So the problem here is Multicollinearity.
proc corr data=sasreg.wage;
var age nein asset;
We can also the VIF value to double check multicollinearity.
proc reg data=sasreg.wage;
model hrs = age nein asset / vif;
Question 6: Continue to use the previous data set. This time we want to predict the average hourly wage by average percent of white respondents. Carry out the regression analysis and list the SAS commands that you can use to check for heteroscedasticity. Explain the result of your test(s).Now we want to build another model to predict the average percent of white respondents by the average hours worked. Repeat the analysis you performed on the previous regression model. Explain your results. Answer 6: 1) First we do the linear regression and then draw the plot of residuals v.s. predicted value. From the graph it doesn’t show any special pattern of the residuals.
proc reg data=sasreg.wage;
model rate = race / spec;
output out=reg1wage r=r p=p rstudent=rstudent;
proc gplot data=reg1wage;
Next we can do the White Test. A test for heteroscedasticity, the White test. The White test tests the null hypothesis that the variance of the residuals is homogenous. Therefore, if the p-value is very small, we would have to reject the hypothesis and accept the alternative hypothesis that the variance is not homogenous. We use the / spec option on the model statement to obtain the White test.
p-value is greater than .05. So we couldn’t reject the null hypothesis. That is, we couldn’t reject that the variance are homogeneous.
2) From the residual plot we can see the residuals blast out. Also, the white test verify this since p-value is less than .05. So we think the variance is not homogenous.
Question 7: We have a data set (tree) that consists of volume, diameter and height of some objects. Someone did a regression of volume on diameter and height. Explain what tests you can use to detect model specification errors and if there is any, your solution to correct it. Answer 7: A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model. If relevant variables are omitted from the model, the common variance they share with included variables may be wrongly attributed to those variables, and the error term is inflated. On the other hand, if irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to them. Model specification errors can substantially affect the estimate of regression coefficients.Link test is used for model specification. If a model a well specified, one shouldn't get any new independent variables that is significant. For model specification, a common method is to make a new regression of y depends on the fitted value (fv) and squared fitted value (fv2). For the well specified model, fv should be significant and fv2 should not. We will show the result as below:
proc reg data=sasreg.tree;
model vol = dia height;
output out=reg1tree (keep= vol dia height fv) predicted=fv;
proc reg data=reg1square;
model vol = fv fv2;
From the graph above, it shows both fv and fv2 are significant. That is, our model of vol~dia height is not well specified. It's reasonable since we know volume should be proportional to the square of diameter and height. So, we next add a independent variable dia2 as square of diameter.
Now we can see the model is well specified.