linear regression

11月 102021
 

If you are thinking that nobody in their right mind would implement a Calculator API Service with a machine learning model, then yes, you’re probably right. But considering curiosity is in my DNA, it sometimes works this way and machine learning is fun. I have challenged myself to do it, not for the purpose of promoting such an experiment into production, but simply accomplishing a self-challenge that can be easily achieved with the resources provided by SAS Viya, particularly SAS Model Studio and SAS Intelligent Decisioning.

So, first, let’s define the purpose of my challenge:

deploy a basic calculator API service capable of executing the following operations for two input decimal numbers: addition, subtraction, multiplication, and division

The challenge must be executed under these restrictions:

  • Usage of a machine learning model as the “compute-engine” of the calculator
  • Development under the low code / no code paradigm
  • The end-to-end setup and execution process should not take more than a couple of hours

Use the following tools in SAS Viya:

  • Model Studio
  • Intelligent Decisioning
  • Simple web app (set up not covered)

The plan

The steps that I am identifying to complete my challenge are:
Step 1 - Create a machine learning model representing the compute-engine of my calculator (Model Studio)
Step 2 - Determine how to process other mathematical operations
Step 3 - Step 3 - Embed the needed logic into the decision to perform the calculator operations (Intelligent Decisioning)
Step 4 - Publish the artifact created as an API service (web app created outside of this post)

Step 1. Create a machine learning model as the compute-engine

Our first step is to create a model. We start with the addition operation and build from there. We’ll perform the addition by normal means of adding two numbers. Next, we’ll apply some extra logic which will perform subtraction, multiplication, and division. The diagram below represents the process:

A machine learning model is built from a data set where it self-learns what to do. I want my model to learn the addition of two numbers. I created a training data set in Excel with 100 registers, each of them with two random numbers between 0 and 1 and then the sum of them. The image below displays the general setup:

The algorithm / model I chose for my compute engine is the linear regression. Linear regression is a simple machine learning algorithm based in the following formula:

y = a + X1 + c·X2 + d·X3 + … + n·Xn

Where:

  • y is the result of the model execution – the result of the addition operation
  • X1, X2, X3, …, Xn are the input variables – for the calculator, there only will be X1 and X2 as operands
  • a, b, c, d, …, n are the parameters the machine learning process determines to create the model

So, with the training data set created, I’ll open a new machine learning project in SAS Viya Model Studio, selecting my data set from where the algorithm will learn, assign the target variable, add linear regression node, a test node, and click “Run pipeline”. Note: if following along in your own environment, make sure to use Selection Method = Adaptive LASSO and toggle Suppress intercept = On in the linear regression node. The resulting model resembles the following:

You can learn more about model creation in the How to Deploy Models in SAS tutorial on the SAS Users YouTube channel.

Once the pipeline completes, and reviewing the initial results, it seems the model is behaving in a proper way; but when I test specific operands where the result is zero, I realize the model has misgivings:

Those are the operations with zero as result. I think that maybe, the algorithm hasn’t learned with the proper data, so I change the first seven registers of my initial dataset with the following operations with zeros:

Again, running the pipeline and letting the magic work, Voila!!!, the process has learned to handle the zeroes and properly sum two input numbers. When I check the results, I verify the results calculated (predicted) by the model are the same as the original ones that were indicated in the training and test dataset. So now, I am sure my new model is ready for use as my calculator compute engine.

Now that I have my compute engine (model) ready, it’s time to use it. We know it can perform the sum operation perfectly, so how do we perform the rest of the operations? We’ll take the sum model, move it into SAS Intelligent Decisioning, and add rules to handle the rest of the operations.

First, let’s explore the logic that will build the operations from our original model. This is where mathematics start playing in (Step 2). After exploring the operations we'll look at the Decision model where we'll define the logic to run the operations (Step 3).

Addition

Execute the model with the two input numbers, with no additional logic.

Subtraction

By changing the sign of the second input, the model does the rest.

That’s a simple enough solution for subtraction, but how do we handle multiplication and division? Let’s take a look.

Multiplication and division

How can I perform a multiplication or a division operation if my compute engine only executes the addition operation? Here we can apply the magic of logarithmic properties stating:

  • log (x*y) = log(x) + log(y)
  • log (x/y) = log(x) – log(y)

Following this logic, if I want to multiply two numbers, I calculate the logarithm of each one and perform the addition operation in my compute engine. I follow this up by applying the exponential function to reverse the logarithm. The image below outlines the workflow.

For the division, it is the same process, but changing the sign of the second logarithm to the subtraction operation.

Additional Logic

There are also certain cases requiring special treatment. For example, a division by zero generates an error, a logarithm of zero or a negative number cannot be calculated, and the multiplication of two numbers is zero if at least one of them is zero.

Let's now build the calculator in SAS Intelligent Decisioning, including the operations, the model, and the extra logic.

Step 3 - Embed the needed logic into the decision to perform the calculator operations

The diagram below represents the Decision flow for our calculator example. Each node is numbered and followed by a definition.

0 - Overall decision flow - definition doc found here on GitHub
1 - Determine if zero is a factor for multiplication or division operations - definition doc found here on GitHub
2 - Decision handles value of previous step - set Variable = Decision_Internal_EndProcess = True
3 - Process calculations based on operation value - definition doc found here on GitHub
4 - Calculator linear regression model created earlier in the post - model definition file found here on GitHub
5 - Additional logic to finalize processing on multiplication and division operations - definition doc found here on GitHub

Step 4 - Publish the final artifact created as an API service

After completing all the work on the decision, click on the Publish button and the Calculator is ready to be consumed via an API.

A colleague of mine created a simple web application which calls models using SAS Viya microservice APIs. I'll use this web app to display the results of my calculator. For brevity, I won't cover the details of the app. If you'd like to see how to create a web app using SAS APIs, I recommend the Build a web application using SAS Compute Server series on the SAS Communities.

The app allows me to choose my decision flow, add my operands and indicate an operation as seen below.

I tested with several operand and operation combinations and they all checked out. It worked!

Final Thoughts

I can consider my self-challenge solved. Through this example we accomplished the following:

  • The Calculator API Service can perform the four operations based on a Machine Learning Model.
  • I created a simple Machine Learning Model to perform the addition of two decimal numbers from a 100 registers data set.
  • The model and the extra logic needed to perform the operations was developed under the low code / no code paradigm.
  • I used the visual interface to generate the model and the extra logic, in conjunction with the expression builder, to apply the logarithm and exponential operations.
  • The overall process has taken no more than a couple of hours.

Apart from the usefulness of this API, my principal takeaways of this self-challenge are:

  • In this case, building my data set to obtain the exact behavior I wanted for my model was quite straight-forward.
  • Building the model through the Graphical User Interface was easy and fast.
  • Having the capacity to embed the models with extra logic under the low code / no code paradigm provides “supercharged intelligence” to the model
  • The publishing feature of the whole artifact as an API service is great, providing instant value to the consumers.
  • SAS Viya is a great platform as it provides all the pieces needed to satisfy your analytical business needs as well as your “curiosity needs”.

 

How I used a SAS ML model and Intelligent Decisioning to build a calculator was published on SAS Users.

11月 102021
 

If you are thinking that nobody in their right mind would implement a Calculator API Service with a machine learning model, then yes, you’re probably right. But considering curiosity is in my DNA, it sometimes works this way and machine learning is fun. I have challenged myself to do it, not for the purpose of promoting such an experiment into production, but simply accomplishing a self-challenge that can be easily achieved with the resources provided by SAS Viya, particularly SAS Model Studio and SAS Intelligent Decisioning.

So, first, let’s define the purpose of my challenge:

deploy a basic calculator API service capable of executing the following operations for two input decimal numbers: addition, subtraction, multiplication, and division

The challenge must be executed under these restrictions:

  • Usage of a machine learning model as the “compute-engine” of the calculator
  • Development under the low code / no code paradigm
  • The end-to-end setup and execution process should not take more than a couple of hours

Use the following tools in SAS Viya:

  • Model Studio
  • Intelligent Decisioning
  • Simple web app (set up not covered)

The plan

The steps that I am identifying to complete my challenge are:
Step 1 - Create a machine learning model representing the compute-engine of my calculator (Model Studio)
Step 2 - Determine how to process other mathematical operations
Step 3 - Step 3 - Embed the needed logic into the decision to perform the calculator operations (Intelligent Decisioning)
Step 4 - Publish the artifact created as an API service (web app created outside of this post)

Step 1. Create a machine learning model as the compute-engine

Our first step is to create a model. We start with the addition operation and build from there. We’ll perform the addition by normal means of adding two numbers. Next, we’ll apply some extra logic which will perform subtraction, multiplication, and division. The diagram below represents the process:

A machine learning model is built from a data set where it self-learns what to do. I want my model to learn the addition of two numbers. I created a training data set in Excel with 100 registers, each of them with two random numbers between 0 and 1 and then the sum of them. The image below displays the general setup:

The algorithm / model I chose for my compute engine is the linear regression. Linear regression is a simple machine learning algorithm based in the following formula:

y = a + X1 + c·X2 + d·X3 + … + n·Xn

Where:

  • y is the result of the model execution – the result of the addition operation
  • X1, X2, X3, …, Xn are the input variables – for the calculator, there only will be X1 and X2 as operands
  • a, b, c, d, …, n are the parameters the machine learning process determines to create the model

So, with the training data set created, I’ll open a new machine learning project in SAS Viya Model Studio, selecting my data set from where the algorithm will learn, assign the target variable, add linear regression node, a test node, and click “Run pipeline”. Note: if following along in your own environment, make sure to use Selection Method = Adaptive LASSO and toggle Suppress intercept = On in the linear regression node. The resulting model resembles the following:

You can learn more about model creation in the How to Deploy Models in SAS tutorial on the SAS Users YouTube channel.

Once the pipeline completes, and reviewing the initial results, it seems the model is behaving in a proper way; but when I test specific operands where the result is zero, I realize the model has misgivings:

Those are the operations with zero as result. I think that maybe, the algorithm hasn’t learned with the proper data, so I change the first seven registers of my initial dataset with the following operations with zeros:

Again, running the pipeline and letting the magic work, Voila!!!, the process has learned to handle the zeroes and properly sum two input numbers. When I check the results, I verify the results calculated (predicted) by the model are the same as the original ones that were indicated in the training and test dataset. So now, I am sure my new model is ready for use as my calculator compute engine.

Now that I have my compute engine (model) ready, it’s time to use it. We know it can perform the sum operation perfectly, so how do we perform the rest of the operations? We’ll take the sum model, move it into SAS Intelligent Decisioning, and add rules to handle the rest of the operations.

First, let’s explore the logic that will build the operations from our original model. This is where mathematics start playing in (Step 2). After exploring the operations we'll look at the Decision model where we'll define the logic to run the operations (Step 3).

Addition

Execute the model with the two input numbers, with no additional logic.

Subtraction

By changing the sign of the second input, the model does the rest.

That’s a simple enough solution for subtraction, but how do we handle multiplication and division? Let’s take a look.

Multiplication and division

How can I perform a multiplication or a division operation if my compute engine only executes the addition operation? Here we can apply the magic of logarithmic properties stating:

  • log (x*y) = log(x) + log(y)
  • log (x/y) = log(x) – log(y)

Following this logic, if I want to multiply two numbers, I calculate the logarithm of each one and perform the addition operation in my compute engine. I follow this up by applying the exponential function to reverse the logarithm. The image below outlines the workflow.

For the division, it is the same process, but changing the sign of the second logarithm to the subtraction operation.

Additional Logic

There are also certain cases requiring special treatment. For example, a division by zero generates an error, a logarithm of zero or a negative number cannot be calculated, and the multiplication of two numbers is zero if at least one of them is zero.

Let's now build the calculator in SAS Intelligent Decisioning, including the operations, the model, and the extra logic.

Step 3 - Embed the needed logic into the decision to perform the calculator operations

The diagram below represents the Decision flow for our calculator example. Each node is numbered and followed by a definition.

0 - Overall decision flow - definition doc found here on GitHub
1 - Determine if zero is a factor for multiplication or division operations - definition doc found here on GitHub
2 - Decision handles value of previous step - set Variable = Decision_Internal_EndProcess = True
3 - Process calculations based on operation value - definition doc found here on GitHub
4 - Calculator linear regression model created earlier in the post - model definition file found here on GitHub
5 - Additional logic to finalize processing on multiplication and division operations - definition doc found here on GitHub

Step 4 - Publish the final artifact created as an API service

After completing all the work on the decision, click on the Publish button and the Calculator is ready to be consumed via an API.

A colleague of mine created a simple web application which calls models using SAS Viya microservice APIs. I'll use this web app to display the results of my calculator. For brevity, I won't cover the details of the app. If you'd like to see how to create a web app using SAS APIs, I recommend the Build a web application using SAS Compute Server series on the SAS Communities.

The app allows me to choose my decision flow, add my operands and indicate an operation as seen below.

I tested with several operand and operation combinations and they all checked out. It worked!

Final Thoughts

I can consider my self-challenge solved. Through this example we accomplished the following:

  • The Calculator API Service can perform the four operations based on a Machine Learning Model.
  • I created a simple Machine Learning Model to perform the addition of two decimal numbers from a 100 registers data set.
  • The model and the extra logic needed to perform the operations was developed under the low code / no code paradigm.
  • I used the visual interface to generate the model and the extra logic, in conjunction with the expression builder, to apply the logarithm and exponential operations.
  • The overall process has taken no more than a couple of hours.

Apart from the usefulness of this API, my principal takeaways of this self-challenge are:

  • In this case, building my data set to obtain the exact behavior I wanted for my model was quite straight-forward.
  • Building the model through the Graphical User Interface was easy and fast.
  • Having the capacity to embed the models with extra logic under the low code / no code paradigm provides “supercharged intelligence” to the model
  • The publishing feature of the whole artifact as an API service is great, providing instant value to the consumers.
  • SAS Viya is a great platform as it provides all the pieces needed to satisfy your analytical business needs as well as your “curiosity needs”.

 

How I used a SAS ML model and Intelligent Decisioning to build a calculator was published on SAS Users.

8月 112021
 

One of the benefits of using the SWEEP operator is that it enables you to "sweep in" columns (add effects to a model) in any order. This article shows that if you use the SWEEP operator, you can compute a SSCP matrix and use it repeatedly to estimate any linear regression model that uses those effects.

This post relates to a previous post, in which I showed that you never need to use a permutation matrix to rearrange the order of rows and columns of a matrix. As an application, I used the sum of squares and crossproducts (SSCP) matrix, which is used to estimate the regression coefficients in a least-squares regression model. Being able to permute the rows and columns of the SSCP matrix efficiently means that you can solve the linear regression problem very quickly for any ordering of the columns of the design matrix. But if you use the SWEEP operator, you do not need to permute the SSCP matrix at all!

The SSCP matrix in PROC REG

Before discussing the SWEEP operator, let's review a little-used feature of PROC REG in SAS. The REG procedure is interactive, which means that you can interactively add or delete effects from a model as part of the model-building process. However, if you are going to explore several different models, you must use the VAR statement to specify all variables that you might conceivably want to use BEFORE the first RUN statement. Why? Because when PROC REG encounters the RUN statement it builds the SSCP matrix for the variables that you have specified. As stated in the documentation, for each subsequent model, "the procedure selects the appropriate crossproducts from the [SSCP] matrix." In other words, PROC REG re-uses that SSCP matrix for every MODEL statement.

Let's look at an example. The following call to PROC REG computes the SSCP matrix for five variables and the intercept effect. It uses that SSCP matrix to compute the parameter estimates:

proc reg data=sashelp.cars plots=NONE;
   ods select ParameterEstimates;
   VAR MSRP EngineSize Horsepower MPG_Highway Weight;            /* vars in the SSCP */
   FULL:    model MSRP = EngineSize Horsepower MPG_Highway Weight;  /* uses the SSCP */
   run;

Because PROC REG is an interactive procedure, the procedure does not exit when it encounters the RUN statement. It remembers the SSCP matrix until the procedure exits. If you submit additional MODEL statements, PROC REG will use the SSCP matrix to estimate the new parameters, as long as the variables were previously specified. For example, you can specify the same model, but list the variables in a different order. Or you can estimate a reduced model that contains only a subset of the variables, as follows:

   PERMUTE: model MSRP = Horsepower Weight EngineSize MPG_Highway;
   run;
   PARTIAL: model MSRP = Weight Horsepower;
run;quit;

The output for the "PERMUTE" model is not shown, because the estimates are the same as for the "FULL" model, but in a different order. It is important to note that PROC REG estimates the second and third models without re-reading the data. It simply re-uses the original SSCP matrix. As I have previously shown, computing the SSCP matrix is often 90% of the work of computing regression estimates, which means that the second and third models are computed very quickly.

The next section shows that how the SWEEP operator can quickly compute the parameter estimates for any model and for any order of the variables. There is no need to physically permute the rows and columns of the SSCP matrix.

The SWEEP operator

You can use the SAS/IML language to illuminate how PROC REG can compute one SSCP matrix and re-use it for all subsequent models. The following SAS/IML statements read the data and compute the SSCP matrix. Following Goodnight (1979), the last column contains the response variable, but this is merely a convenience.

proc iml;
varNames = {'EngineSize' 'Horsepower' 'MPG_Highway' 'Weight'};
use sashelp.cars;
   read all var varNames into X;
   read all var 'MSRP' into Y;
close;
/* add intercept column and response variable */
X = j(nrow(X), 1, 1) || X;   /* the design matrix */
k = ncol(X);                 /* number of effects */
 
varNames = 'Intercept' || varNames || 'MSRP';
XY =  X || Y;
SSCP = XY` * XY;             /* the only SSCP that we'll need */
 
/* by using the sweep operator (1st row=mean, last col=param est*/
S = sweep(SSCP, 1:k); 
rCol = nrow(SSCP);           /* the response column */
b = S[1:k, rCol];            /* parameter estimates for regression coefficients */
print b[r=varNames];

The parameter estimates are the same as computed by PROC REG. They were obtained by sweeping the columns of the SSCP matrix in the order 1, 2, ..., 5. You can think of the vector v=1:k as the "identity permutation" of the rows. Then the SWEEP operator for the variables in their original order is the operation S = sweep(SSCP, v).

The SWEEP function in SAS/IML enables you to specify any vector of indices as the second argument. In this way, you can sweep the columns of the SSCP matrix in any order. For example, the "PERMUTE" model from the PROC REG example corresponds to sweeping the SSCP matrix in the order v = {1 3 5 2 4}. As the following example shows, you get the same parameter estimates because the models are the same:

/* sweep original SSCP in the order of v */
v = {1 3 5 2 4};
pVarNames = varNames[,v];    /* effect names in permuted order */
S = sweep(SSCP, v);          /* sweep the specified rows in the specified order */
b = S[v, rCol];              /* get the parameter estimates in new order */
print b[r=pVarNames];

As expected, the parameter estimates are the same, but the order of the parameters is different.

You can also estimate parameters for any model that uses a subset of the columns. For example, the model that has the explanatory variables Weight and Horsepower (the fifth and third effects) is computed from the SSCP matrix by sweeping the columns in the order {1 5 3}, as follows:

/* Perform a partial sweep for the model {1 5 3}. No need to form a new
   design matrix or to extract a portion of the SCCP. */
v = {1 5 3};
pVarNames = varNames[,v];    /* effect names for this model */
S = sweep(SSCP, v);          /* sweeps in the variables in the order of v */
b = S[v, rCol];              /* get the parameter estimates */
print b[r=pVarNames];

The output matches the "PARTIAL" model from PROC REG.

Summary

In a previous post, I showed an efficient way to permute the order of rows and columns of a matrix. But the current article shows that you do not need to permute the elements of an SSCP matrix if you use the SWEEP operator. The SWEEP operator enables you to specify the order in which you want to add effects to the model. From a design matrix, you can compute the SSCP matrix. (This is most of the work!) You can then quickly compute any model that uses the columns of the design matrix in any order.

This useful property of the SWEEP operator is among the many features discussed in Goodnight, J. (1979), "A Tutorial on the SWEEP Operator," The American Statistician. For other references, see the end of my previous article about the SWEEP operator.

The post More on the SWEEP operator for least-square regression models appeared first on The DO Loop.

7月 142021
 

In a previous article, I discussed various ways to solve a least-square linear regression model. I discussed the SWEEP operator (used by many SAS regression routines), the LU-based methods (SOLVE and INV in SAS/IML), and the QR decomposition (CALL QR in SAS/IML). Each method computes the estimates for the regression coefficients, b, by using the normal equations (X`X) b = X`y, where X is a design matrix for the data.

This article describes a QR-based method that does not use the normal equations but works directly with the overdetermined system X b = y. It then compares the performance of the direct QR method to the computational methods that use the normal equations.

The QR solution of an overdetermined system

As shown in the previous article, you can use the QR algorithm to solve the normal equations. However, if you search the internet for "QR algorithm and least squares," you find many articles that show how you can use the QR decomposition to directly solve the overdetermined system X b = y. How does the direct QR method compare to the methods that use the normal equations?

Recall that X is an n x m design matrix, where n > m and X is assumed to be full rank of m. For simplicity, I will ignore column pivoting. If you decompose X = QRL, the orthogonal matrix Q is n x n, but the matrix RL is not square. ("L" stands for "long.") However, RL is the vertical concatenation of a square triangular matrix and a rectangular matrix of zeros:
\({\bf R_L} = \begin{bmatrix} {\bf R} \\ {\bf 0} \end{bmatrix}\)
If you let Q1 be the first m columns of Q and let Q2 be the remaining (n-m) columns, you get a partitioned matrix equation:
\(\begin{bmatrix} {\bf Q_1} & {\bf Q_2} \end{bmatrix} \begin{bmatrix} {\bf R} \\ {\bf 0} \end{bmatrix} {\bf b} = {\bf y}\)
If you multiply both sides by Q` (the inverse of the orthogonal matrix, Q), you find out that the important matrix equation to solve is \({\bf R b} = {\bf Q_1^{\prime} y}\). The vector \({\bf Q_1^{\prime} y}\) is the first m rows of the vector \({\bf Q^{\prime} y}\). The QR call in SAS/IML enables you to obtain the triangular R matrix and the vector Q` y directly from the data matrix and the observed vector. The following program uses the same design matrix as for my previous article. Assuming X has rank m, the call to the QR subroutine returns the m x m triangular matrix, R, and the vector Q` y. You can then extract the first m rows of that vector and solve the triangular system, as follows:

/* Use PROC GLMSELECT to write a design matrix */
proc glmselect data=Sashelp.Class outdesign=DesignMat;
   class Sex;
   model Weight = Height Sex Height*Sex/ selection=none;
run;
 
proc iml;
use DesignMat;
   read all var {'Intercept' 'Height' 'Sex_F' 'Height_Sex_F'} into X;
   read all var {'Weight'} into Y;
close;
 
/* The QR algorithm can work directly with the design matrix and the observed responses. */
call QR(Qty, R, piv, lindep, X, , y);   /* return Q`*y and R (and piv) */
m = ncol(X);
c = QTy[1:m];                           /* we only need the first m rows of Q`*y */
b = trisolv(1, R, c, piv);              /* solve triangular system */
print b[L="Direct QR" F=D10.4];

This is the same least-squares solution that was found by using the normal equations in my previous article.

Compare the performance of least-squares solutions

How does this direct method compare with the methods that use the normal equations? You can download a program that creates simulated data and runs each algorithm to estimate the least-squares regression coefficients. The simulated data has 100,000 observations; the number of variables is chosen to be m={10, 25, 50, 75, 100, 250, 500}. The program uses SAS/IML 15.1 on a desktop PC to time the algorithms. The results are shown below:

The most obvious feature of the graph is that the "Direct QR" method that is described in this article is not as fast as the methods that use the normal equations. For 100 variables and 100,000 observations, the "Direct QR" call takes more than 12 seconds on my PC. (It's faster on a Linux server). The graph shows that the direct method shown in this article is not competitive with the normal-equation-based algorithms when using the linear algebra routines in SAS/IML 15.1.

The graph shows that the algorithms that use the normal equations are relatively faster. For the SAS/IML calls on my PC, you can compute the regression estimates for 500 variables in about 2.6 seconds. The graph has a separate line for the time required to form the normal equations (which you can think of as forming the X`X matrix). Most of the time is spent computing the normal equations; only a fraction of the time is spent actually solving the normal equations. The following table shows computations on my PC for the case of 500 variables:

The table shows that it takes about 2.6 seconds to compute the X`X matrix and the vector X`y. After you form the normal equations, solving them is very fast. For this example, the SOLVE and INV methods take only a few milliseconds to solve a 500 x 500. The QR algorithms take 0.1–0.2 seconds longer. So, for this example, forming the normal equations accounts for more than 90% of the total time.

Another article compares the performance of the SOLVE and INV routines in SAS/IML.

SAS regression procedures

These results are not the best that SAS can do. SAS/IML is a general-purpose tool. SAS regression procedures like PROC REG are optimized to compute regression estimates even faster. They also use the SWEEP operator, which is faster than the SOLVE function. For more than 20 years, SAS regression procedures have used multithreaded computations to optimize the performance of regression computations (Cohen, 2002). More recently, SAS Viya added the capability for parallel processing, which can speed up the computations even more. And, of course, they compute much more than only the coefficient estimates! They also compute standard errors, p-values, related statistics (MSE, R square,....), diagnostic plots, and more.

Summary

This article compares several methods for obtaining least-squares regression estimates. It uses simulated data where the number of observations is much greater than the number of variables. It shows that methods that use the normal equations are faster than a "Direct QR" method, which does not use the normal equations. When you use the normal equations, most of the time is spent actually forming the normal equations. After you have done that, the time required to solve the system is relatively fast.

You can download the SAS program that computes the tables and graphs in this article.

The post Compare computational methods for least squares regression appeared first on The DO Loop.

7月 122021
 

In computational statistics, there are often several ways to solve the same problem. For example, there are many ways to solve for the least-squares solution of a linear regression model. A SAS programmer recently mentioned that some open-source software uses the QR algorithm to solve least-squares regression problems and asked how that compares with SAS. This article shows how to estimate a least-squares regression model by using the QR method in SAS. It shows how to use the QR method in an efficient way.

Solving the least-squares problem

Before discussing the QR method, let's briefly review other ways to construct a least-squares solution to a regression problem. In a regression problem, you have an n x m data matrix, X, and an n x 1 observed vector of responses, y. When n > m, this is an overdetermined system and typically there is no exact solution. Consequently, the goal is to find the m x 1 vector of regression coefficients, b, such that the predicted values (X b) are as close as possible to the observed values. If b is a least-squares solution, then b minimizes the vector norm || X b - y ||2, where ||v||2 is the sum of the squares of the components of a vector.

Using calculus, you can show that the solution vector, b, satisfies the normal equations (X`X) b = X`y. The normal equations have a unique solution when the crossproduct matrix X`X is nonsingular. Most numerical algorithms for least-squares regression start with the normal equations, which have nice numerical properties that can be exploited.

Creating a design matrix

The first step of solving a regression problem is to create the design matrix. For continuous explanatory variables, this is easy: You merely append a column of ones (the intercept column) to the matrix of the explanatory variables. For more complicated regression models that contain manufactured effects such as classification effects, interaction effects, or spline effects, you need to create a design matrix. The GLMSELECT procedure is the best way to create a design matrix for fixed effects in SAS. The following call to PROC GLMSELECT writes the design matrix to the DesignMat data set. The call to PROC REG estimates the regression coefficients:

proc glmselect data=Sashelp.Class outdesign=DesignMat;  /* write design matrix */
   class Sex;
   model Weight = Height Sex Height*Sex/ selection=none;
run;
/* test: use the design matrix to estimate the regression coefficients */
proc reg data=DesignMat plots=none;
   model Weight = Height Sex_F Height_Sex_F;
run;

The goal of the rest of this article is to reproduce the regression estimates by using various other linear algebra operations. For each method, we want to produce the same estimates: {-141.10, 3.91, 23.73, -0.49}.

Linear regression: A linear algebraic approach

I've previously discussed the fact that most SAS regression procedures use the sweep operator to construct least-squares solutions. Another alternative is to use the SOLVE function in SAS/IML:

proc iml;
use DesignMat;
   read all var {'Intercept' 'Height' 'Sex_F' 'Height_Sex_F'} into X;
   read all var {'Weight'} into Y;
close;
 
/* form normal equations */
XpX = X`*X;
Xpy = X`*y;
 
/* an efficient numerical solution: solve a particular RHS */
b = solve(XpX, Xpy);
print b[L="SOLVE" F=D10.4];

The SOLVE function is very efficient and gives the same parameter estimates as the SWEEP operator (which was used by PROC REG). Using the SOLVE function on the system A*b=z is mathematically equivalent to using the matrix inverse to find b=A-1z. However, the INV function explicitly forms the inverse matrix, whereas the SOLVE function does not. Consequently, the SOLVE function is faster and more efficient than using the following SAS/IML statement:

/* a less efficient numerical solution */
Ainv = inv(XpX);       /* explicitly form the (m x m) inverse matrix */
b = Ainv*Xpy;          /* apply the inverse matrix to RHS */
print b[L="INV" F=D10.4];

For a performance comparison of the SOLVE and INV functions, see the article, "Solving linear systems: Which technique is fastest?"

An introduction to the QR method for solving linear equations

There are many ways to solve the normal equations. The SOLVE (and INV) functions use the LU decomposition. An alternative is the QR algorithm, which is slower but can be more accurate for ill-conditioned systems. You can apply the QR decomposition to the normal equations or to the original design matrix. This section discusses the QR decomposition of the design matrix. A subsequent article discusses decomposing the data matrix directly.

Let's see how the QR algorithm solves the normal equations. You can decompose the crossproduct matrix as the product of an orthogonal matrix, Q, and an upper-triangular matrix, R. If A = X`X, then A = QR. (I am ignoring column pivoting, which is briefly discussed below.) The beauty of an orthogonal matrix is that the transpose equals the inverse. This means that you can reduce the system A b = (X` y) to a triangular system R b = Q` (X` y), which is easily solved by using the TRISOLV function in SAS/IML.

The SAS/IML version of the QR algorithm is a version known as "QR with column pivoting." Using column pivoting improves solving rank-deficient systems and provides better numerical accuracy. However, when the algorithm uses column pivoting, you are actually solving the system AP = QR, where P is a permutation matrix. This potentially adds some complexity to dealing with the QR algorithm. Fortunately, the TRISOLVE function supports column pivoting. If you use the TRISOLV function, you do not need to worry about whether pivoting occurred or not.

Using the QR method in SAS/IML

Recall from the earlier example that it is more efficient to use the SOLVE function than the INV function. The reason is that the INV function explicitly constructs an m x m matrix, which then is multiplied with the right-hand side (RHS) vector to obtain the answer. In contrast, the SOLVE function never forms the inverse matrix. Instead, it directly applies transformations to the RHS vector.

The QR call in SAS/IML works similarly. If you do not supply a RHS vector, v, then the QR call returns the full m x m matrix, Q. You then must multiply Q` v yourself. If you do supply the RHS vector, then the QR call returns Q` v without ever forming Q. This second method is more efficient,

See the documentation of the QR call for the complete syntax. The following call uses the inefficient method in which the Q matrix is explicitly constructed:

/* It is inefficient to factor A=QR and explicitly solve the equation R*b = Q`*c */
call QR(Q, R, piv, lindep, XpX);       /* explicitly form the (m x m) matrix Q */
c = Q`*Xpy;                            /* apply the inverse Q matrix to RHS */
b = trisolv(1, R, c, piv);             /* equivalent to b = inv(R)*Q`*c */
print b[L="General QR" F=D10.4];

In contrast, the next call is more efficient because it never explicitly forms the Q matrix:

/* It is more efficient to solve for a specific RHS */
call QR(c, R, piv, lindep, XpX, ,Xpy); /* returns c = Q`*Xpy */
b = trisolv(1, R, c, piv);             /* equivalent to b = inv(R)*Q`*c */
print b[L="Specific QR" F=D10.4];

The output is not shown, but both calls produce estimates for the regression coefficients that are exactly the same as for the earlier examples. Notice that the TRISOLV function takes the pivot vector (which represents a permutation matrix) as the fourth argument. That means you do not need to worry about whether column pivoting occurred during the QR algorithm.

QR applied to the design matrix

As mentioned earlier, you can also apply the QR algorithm to the design matrix, X, and the QR algorithm will return the least-square solution without ever forming the normal equations. This is shown in a subsequent article, which also compares the speed of the various methods for solving the least-squares problem.

Summary

This article discusses three ways to solve a least-squares regression problem. All start by constructing the normal equations: (X`X) b = X` y. The solution of the normal equations (b) is the vector that minimizes the squared differences between the predicted values, X b, and the observed responses, y. This article discusses

  • the SWEEP operator, which is used by many SAS regression procedures
  • the SOLVE and INV function, which use the LU factorization
  • the QR call, which implements the QR algorithm with column pivoting

A follow-up article compares the performance of these methods and of another QR algorithm, which does not use the normal equations.

The post The QR algorithm for least-squares regression appeared first on The DO Loop.

3月 292021
 

A previous article discusses how to interpret regression diagnostic plots that are produced by SAS regression procedures such as PROC REG. In that article, two of the plots indicate influential observations and outliers. Intuitively, an observation is influential if its presence changes the parameter estimates for the regression by "more than it should." Various researchers have developed statistics that give a precise meaning to "more than it should," and the formulas for several of the popular influence statistics are included in the PROC REG documentation.

This article discusses the Cook's D and leverage statistics. Specifically, how can you use the information in the diagnostic plots to identify which observations are influential? I show two techniques for identifying the observations. The first uses a DATA step and a formula to identify influential observations. The second technique uses the ODS OUTPUT statement to extract the same information directly from a regression diagnostic plot.

These are not the only regression diagnostic plots that can identify influential observations:

  • The DFBETAS statistic estimates the effect that deleting each observation has on the estimates for the regression coefficients.
  • The DFFITS statistic is a measure of how the predicted value changes when an observation is deleted. It is closely related to the Cook's D statistic.

The data and model

As in the previous article, let's use a model that does NOT fit the data very well, which makes the diagnostic plots more interesting. The following DATA step adds a quadratic effect to the Sashelp.Thick data and also adds a variable that is used in a subsequent section to merge the data with the Cook's D and leverage statistics.

data Thick2;
set Sashelp.Thick;
North2 = North**2;   /* add quadratic effect */
Observation = _N_;   /* for merging with other data sets */
run;

Graphs and labels

Rather than create the entire panel of diagnostic plots, you can use the PLOTS(ONLY)= option to create only the graphs for Cook's D statistic and for the studentized residuals versus the leverage. In the following call to PROC REG, the LABEL suboption requests that the influential observations be labeled. By default, the labels are the observation numbers. You can use the ID statement in PROC REG to specify a variable to use for the labels.

proc reg data=Thick2  plots(only label) =(CooksD RStudentByLeverage);
   model Thick = North North2 East; /* can also use INFLUENCE option */
run;

The graph of the Cook'd D statistic is shown above. The PROC REG documentation states that the horizontal line is the value 4/n, where n is the number of observations in the analysis. For these data, n = 75. According to this statistic, observations 1, 4, 8, 63, and 65 are influential.

The second graph is a plot of the studentized residual versus the leverage statistic. The PROC REG documentation states that the horizontal lines are the values ±2 and the vertical line is at the value 2p/n, where p is the number of parameters, including the intercept. For this model, p = 4. The observations 1, 4, 8, 10, 16, 63, and 65 are shown in this graph as potential outliers or potential high-leverage points.

From the graphs, we know that certain observations are influential. But how can highlight those influential observations in plots, print them, or otherwise analyze them? And how can we automate the process so that it works for any model on any data set?

Manual computation of influential observations

There are two ways to determine which observations have large residuals or are high-leverage or have a large value for the Cook's D statistic. The traditional way is to use the OUTPUT statement in PROC REG to output the statistics, then identify the observations by using the same cutoff values that are shown in the diagnostic plots. For example, the following DATA step lists the observations whose Cook's D statistic exceeds the cutoff value 4/n ≈ 0.053.

/* manual identification of influential observations */
proc reg data=Thick2  plots=none;
   model Thick = North North2 East; /* can also use INFLUENCE option */
   output out=RegOut predicted=Pred student=RStudent cookd=CookD H=Leverage;
quit;
 
%let p = 4;  /* number of parameter in model, including intercept */
%let n = 75; /* Number of Observations Used */
title "Influential (Cook's D)";
proc print data=RegOut;
   where CookD > 4/&n;
   var Observation East North Thick CookD;
run;

This technique works well. However, it assumes that you can easily write a formula to identify the influential observations. This is true for regression diagnostics. However, you can imagine a more complicated graph for which "special" observations are not so easy to identify. As shown in the next section, you can leverage (pun intended) the fact that SAS already identified the special observations for you.

Leveraging the power of the ODS OUTPUT statement

Did you know that you can create a data set from any SAS graphic? Many SAS programmers use ODS OUTPUT to save a table to a SAS data set, but the same technique enables you to save the data underlying any ODS graph. There is a big advantage of using ODS OUTPUT to get to the data in a graph: SAS has already done the work to identify and label the important points in the graph. You don't need to know any formulas!

Let's see how this works by extracting the observations whose Cook's D statistic exceeds the cutoff value. First, generate the graphs and use ODS OUTPUT to same the underlying data models, as follows:

/* Let PROC REG do the work. Use ODS OUTPUT to capture information */
ods exclude all;
proc reg data=Thick2  plots(only label) =(CooksD RStudentByLeverage);
   model Thick = North North2 East; 
   ods output CooksDPlot=CookOut         /* output the data for the Cook's D graph */
              RStudentByLeverage=RSOut;  /* output the data for the outlier-leverage plot */
quit;
ods exclude none;
 
/* you need to look at the data! */
proc print data=CookOut(obs=12); 
   where Observation > 0;
run;

When you output the data from a graph, you have to look at how the data are structured. Sometimes the data set has strange variable names or extra observations. In this case, the data begins with three extra observations for which the Cook's D statistic is missing. For the curious, fake observations are often used to set axes ranges or to specify the order of groups in a plot.

Notice that the CookOut data set includes a variable named Observation, which you can use to merge the CookOut data and the original data.

From the structure of the CookOut data set, you can infer that the influential observations are those for which the CooksDLabel variable is nonmissing (excepting the fake observations at the top of the data). Therefore, the following DATA step merges the output data sets and the original data. In the same DATA step, you can create other useful variables, such as a binary variable that indicates which observations have a large Cook's D statistic:

data All;
merge Thick2 CookOut RSOut;
by Observation;
/* create a variable that indicates whether the obs has a large Cook's D stat */
CooksDInf = (^missing(CooksD) & CooksDLabel^=.);
label CooksDInf = "Influential (Cook's D)";
run;
 
proc print data=All noobs;
   where CooksDInf > 0; 
   var Observation East North Thick;
run;
 
proc sgplot data=All;
   scatter x=East y=North / group=CooksDInf datalabel=CooksDLabel 
                  markerattrs=(symbol=CircleFilled size=10);
run;

The output from PROC PRINT (not shown) confirms that observations 1, 4, 8, 63, and 65 have a large Cook's D statistic. The scatter plot shows that the influential observations are located at extreme values of the explanatory variables.

Outliers and high-leverage points

The process to extract or visualize the outliers and high-leverage points is similar. The RSOut data set contains the relevant information. You can do the following:

  1. Look at the names of the variables and the structure of the data set.
  2. Merge with the original data by using the Observation variable.
  3. Use one of more variables to identify the special observations.

The following call to PROC PRINT gives you an overview of the data and its structure:

/* you need to look at the data! */
proc print data=RSOut(obs=12) noobs; 
   where Observation > 0;
   var Observation RStudent HatDiagonal RsByLevIndex outLevLabel RsByLevGroup;
run;

For the RSOut data set, the indicator variable is named RsByLevIndex, which has the value 1 for ordinary observations and the value 2, 3, or 4 for influential observations. The meaning of each index value is shown in the RsByLevGroup variable, which has the corresponding values "Outlier," "Leverage," and "Outlier and Leverage" (or a blank string for ordinary observations). You can use these values to identify the outliers and influential observations. For example, you can print all influential observations or you can graph them, as shown in the following statements:

proc print data=All noobs;
   where Observation > 0 & RsByLevIndex > 1;
   var Observation East North Thick RsByLevGroup;
run;
 
proc sgplot data=All;
   scatter x=East y=North / markerattrs=(size=9);        /* ordinary points are not filled */
   scatter x=East y=North / group=RsByLevGroup nomissinggroup /* special points are filled */
           datalabel=OutLevLabel markerattrs=(symbol=CircleFilled size=10);
run;

The PROC PRINT output confirms that we can select the noteworthy observations. In the scatter plot, the color of each marker indicates whether the observation is an outlier, a high-leverage point, both, or neither.

Summary

It is useful to identify and visualize outliers and influential observations in a regression model. One way to do this is to manually compute a cutoff value and create an indicator variable that indicates the status of each observation. This article demonstrates that technique for the Cook's D statistic. However, an alternative technique is to take advantage of the fact that SAS can create graphs that label the outliers and influential observations. You can use the ODS OUTPUT statement to capture the data underlying any ODS graph. To visualize the noteworthy observations, you can merge the original data and the statistics, indicator variables, and label variables.

Although it's not always easy to decipher the variable names and the structure of the data that comes from ODS graphics, this technique is very powerful. Its use goes far beyond the regression example in this article. The technique enables you to incorporate any SAS graph into a second analysis or visualization.

The post Identify influential observations in regression models appeared first on The DO Loop.

3月 242021
 

When you fit a regression model, it is useful to check diagnostic plots to assess the quality of the fit. SAS, like most statistical software, makes it easy to generate regression diagnostics plots. Most SAS regression procedures support the PLOTS= option, which you can use to generate a panel of diagnostic plots. Some procedures (most notably PROC REG and PROC LOGISTIC) support dozens of graphs that help you to evaluate the fit of the model, to determine whether the data satisfy various assumptions, and to identify outliers and high-leverage points. Diagnostic plots are most useful when the size of the data is not too large, such as less than 5,000 observations.

This article shows how to interpret diagnostic plots for a model that does not fit the data. For an ill-fitting model, the diagnostic plots should indicate the lack of fit.

This article discusses the eight plots in the DiagnosticsPanel plot, which is produced by several SAS regression procedures. To make the discussion as simple as possible, this article uses PROC REG to fit an ordinary least squares model to the data.

The eight plots can be classified into five groups:

  1. A plot of the observed versus predicted responses. (Center of panel.)
  2. Two graphs of residual values versus the predicted responses. (Upper left of panel.)
  3. Two graphs that indicate outliers, high-leverage observations, and influential observations. (Upper right of panel.)
  4. Two graphs that assess the normality of the residuals. (Lower left of panel.)
  5. A plot that compares the cumulative distributions of the centered predicted values and the residuals. (Bottom of panel.)

This article also includes graphs of the residuals plotted against the explanatory variables.

Create a model that does not fit the data

This section creates a regression model that (intentionally) does NOT fit the data. The data are 75 locations and measurements of the thickness of a coal seam. A naive model might attempt to fit the thickness of the coal seam as a linear function of the East-West position (the East variable) and the South-North position (the North variable). However, even a cursory glance at the data (see Figure 2 in the documentation for the VARIOGRAM procedure) indicates that the thickness is not linear in the North-South direction, so the following DATA step creates a quadratic effect. The subsequent call to PROC REG fits the model to the data and uses the PLOT= option to create a panel of diagnostic plots. By default, PROC REG creates a diagnostic panel and a panel of residual plots. If you want to add a loess smoother to the residual plots, you can use the SMOOTH suboption to the RESIDUALPLOT option, as follows:

data Thick2;
set Sashelp.Thick;
North2 = North**2;   /* add quadratic effect */
run;
 
proc reg data=Thick2
         plots =(DiagnosticsPanel ResidualPlot(smooth));
   model Thick = North North2 East;
quit;

The panel of diagnostic plots is shown. The panel of residual plots is shown later in this article. To guide the discussion, I have overlaid colored boxes around certain graphs. You can look at the graphs in any order, but I tend to look at them in the order indicated by the numbers in the panel.

1. The predicted versus observed response

The graph in the center (orange box) shows the quality of the predictive model. The graph plots the observed response versus the predicted response for each observation. For a model that fits the data well, the markers will be close to the diagonal line. Markers that are far from the line indicate observations for which the predicted response is far from the observed response. That is, the residual is large.

For this model, the cloud of markers is not close to the diagonal, which indicates that the model does not fit the data. You can see several markers that are far below the diagonal. These observations will have large negative residuals, as shown in the next section.

2. The residual and studentized residual plots

Two residual plots in the first row (purple box) show the raw residuals and the (externally) studentized residuals for the observations.

The first graph is a plot of the raw residuals versus the predicted values. Ideally, the graph should not show any pattern. Rather, it should look like a set of independent and identically distributed random variables. You can use this graph for several purposes:

  1. If the residuals seem to follow a curve (as in this example), it indicates that the model is not capturing all the "signal" in the response variable. You can try to add additional effects to the model to see if you can eliminate the pattern.
  2. If the residuals are "fan shaped," it indicates that the residuals do not have constant variance. Fan-shaped residuals are an example of heteroscedasticity. The canonical example of a fan-shaped graph is when small (in magnitude) residuals are associated with observations that have small predicted values. Similarly, large residuals are associated with observations that have large predicted values. Heteroscedasticity does not invalidate the model's ability to make predictions, but it violates the assumptions behind inferential statistics such as p-values, confidence intervals, and significance tests.
  3. Detect autocorrelation. If the residuals are not randomly scattered, it might indicate that they are not independent. A time series can exhibit autocorrelation; spatial data can exhibit spatial correlations.

The second residual graph often looks similar to the plot of the raw residuals. The vertical coordinates in the second graph (labeled "RStudent") are externally studentized residuals. The PROC REG documentation includes the definition of a studentized residual. For a studentized residual, the variance for the i_th observation is estimated without including the i_th observation. If the magnitude of the studentized residual exceeds 2, you should examine the observation as a possible outlier. Thus, the graph includes horizontal lines at ±2. For these data, five observations have large negative residuals.

3. Influential observations and high-leverage points

The two graphs in the upper right box (green) enable you to investigate outliers, influential observations, and high-leverage points. One graph plots the studentized residuals versus the leverage value for each observation. As mentioned previously, the observations whose studentized residuals exceed ±2 can be considered outliers. The leverage statistic attempts to identify influential observations. The leverage statistic indicates how far an observation is from the centroid of the data in the space of the explanatory variables. Observations far from the centroid are potentially influential in fitting the regression model. "Influential" means that deleting that observation can result in a relatively large change in the parameter estimates.

Markers to the right of the vertical line are high-leverage points. Thus, the three lines in the graph separate the observations into four categories: (1) neither an outlier nor a high-leverage point; (2) an outlier, but not a high-leverage point; (3) a high-leverage point, but not an outlier; and (4) an outlier and a high-leverage point.

If there are high-leverage points in your data, you should be careful interpreting the model. Least squares models are not robust, so the parameter estimates are affected by high-leverage points.

The needle plot of Cook's D second graph (second row, last column) is another plot that indicates influential observations. Needles that are higher than the horizontal line are considered influential.

A subsequent article includes details about how to use the Cook's D plot to identify influential observations in your data.

4. Normality of residuals

The graphs in the lower left (red box) indicate whether the residuals for the model are normally distributed. Normally distributed residuals are one of the assumptions of regression that are used to derive inferential statistics. The first plot is a normal quantile-quantile plot (Q-Q plot) of the residuals. If the residuals are approximately normal, the markers should be close to the diagonal line. The following table gives some guidance about how to interpret deviations from a straight line:

Shape of Q-Q Plots
Point Pattern Interpretation
all but a few points fall on a line outliers in the data
left end of pattern is below the line long tail on the left
left end of pattern is above the line short tail on the left
right end of pattern is below the line short tail on the right
right end of pattern is above the line long tail on the right

For the Q-Q plot of these residuals, the markers are below the diagonal line at both ends of the pattern. That means that the distribution of residuals is negatively skewed: it has a long tail on the left and a short tail on the right. This is confirmed by looking at the histogram of residuals. For this model, the residuals are not normal; the overlay of the normal density curve does not fit the histogram well.

The spread plot

The last graph (blue box) is the residual-fit spread plot, which I have covered in a separate article. For these data, you can conclude that the three explanatory variables account for a significant portion of the variation in the model.

Residual plots with smoothers

Lastly, the PLOTS=RESIDUALPLOT(SMOOTH) option on the PROC REG statement creates a panel that includes the residuals plotted against the value of each explanatory variable. The SMOOTH suboption overlays a loess smoother, which is often helpful in determining whether the residuals contain a signal or are randomly distributed. The panel for these data and model is shown below:

The residuals plotted again the East variable shows a systematic pattern. It indicates that the model does not adequately capture the dependence of the response on the East variable. An analyst who looks at this plot might decide to add a quadratic effect in the East direction, or perhaps to model a fully quadratic response surface by including the North*East interaction term.

Summary

This article created a regression model that does not fit the data. The regression diagnostic panel detects the shortcomings in the regression model. The diagnostic panel also shows you important information about the data, such as outliers and high-leverage points. The diagnostic plot can help you evaluate whether the data and model satisfy the assumptions of linear regression, including normality and independence of errors.

A subsequent article will describe how to use the diagnostic plots to identify influential observations.

The post An overview of regression diagnostic plots in SAS appeared first on The DO Loop.

3月 242021
 

When you fit a regression model, it is useful to check diagnostic plots to assess the quality of the fit. SAS, like most statistical software, makes it easy to generate regression diagnostics plots. Most SAS regression procedures support the PLOTS= option, which you can use to generate a panel of diagnostic plots. Some procedures (most notably PROC REG and PROC LOGISTIC) support dozens of graphs that help you to evaluate the fit of the model, to determine whether the data satisfy various assumptions, and to identify outliers and high-leverage points. Diagnostic plots are most useful when the size of the data is not too large, such as less than 5,000 observations.

This article shows how to interpret diagnostic plots for a model that does not fit the data. For an ill-fitting model, the diagnostic plots should indicate the lack of fit.

This article discusses the eight plots in the DiagnosticsPanel plot, which is produced by several SAS regression procedures. To make the discussion as simple as possible, this article uses PROC REG to fit an ordinary least squares model to the data.

The eight plots can be classified into five groups:

  1. A plot of the observed versus predicted responses. (Center of panel.)
  2. Two graphs of residual values versus the predicted responses. (Upper left of panel.)
  3. Two graphs that indicate outliers, high-leverage observations, and influential observations. (Upper right of panel.)
  4. Two graphs that assess the normality of the residuals. (Lower left of panel.)
  5. A plot that compares the cumulative distributions of the centered predicted values and the residuals. (Bottom of panel.)

This article also includes graphs of the residuals plotted against the explanatory variables.

Create a model that does not fit the data

This section creates a regression model that (intentionally) does NOT fit the data. The data are 75 locations and measurements of the thickness of a coal seam. A naive model might attempt to fit the thickness of the coal seam as a linear function of the East-West position (the East variable) and the South-North position (the North variable). However, even a cursory glance at the data (see Figure 2 in the documentation for the VARIOGRAM procedure) indicates that the thickness is not linear in the North-South direction, so the following DATA step creates a quadratic effect. The subsequent call to PROC REG fits the model to the data and uses the PLOT= option to create a panel of diagnostic plots. By default, PROC REG creates a diagnostic panel and a panel of residual plots. If you want to add a loess smoother to the residual plots, you can use the SMOOTH suboption to the RESIDUALPLOT option, as follows:

data Thick2;
set Sashelp.Thick;
North2 = North**2;   /* add quadratic effect */
run;
 
proc reg data=Thick2
         plots =(DiagnosticsPanel ResidualPlot(smooth));
   model Thick = North North2 East;
quit;

The panel of diagnostic plots is shown. The panel of residual plots is shown later in this article. To guide the discussion, I have overlaid colored boxes around certain graphs. You can look at the graphs in any order, but I tend to look at them in the order indicated by the numbers in the panel.

1. The predicted versus observed response

The graph in the center (orange box) shows the quality of the predictive model. The graph plots the observed response versus the predicted response for each observation. For a model that fits the data well, the markers will be close to the diagonal line. Markers that are far from the line indicate observations for which the predicted response is far from the observed response. That is, the residual is large.

For this model, the cloud of markers is not close to the diagonal, which indicates that the model does not fit the data. You can see several markers that are far below the diagonal. These observations will have large negative residuals, as shown in the next section.

2. The residual and studentized residual plots

Two residual plots in the first row (purple box) show the raw residuals and the (externally) studentized residuals for the observations.

The first graph is a plot of the raw residuals versus the predicted values. Ideally, the graph should not show any pattern. Rather, it should look like a set of independent and identically distributed random variables. You can use this graph for several purposes:

  1. If the residuals seem to follow a curve (as in this example), it indicates that the model is not capturing all the "signal" in the response variable. You can try to add additional effects to the model to see if you can eliminate the pattern.
  2. If the residuals are "fan shaped," it indicates that the residuals do not have constant variance. Fan-shaped residuals are an example of heteroscedasticity. The canonical example of a fan-shaped graph is when small (in magnitude) residuals are associated with observations that have small predicted values. Similarly, large residuals are associated with observations that have large predicted values. Heteroscedasticity does not invalidate the model's ability to make predictions, but it violates the assumptions behind inferential statistics such as p-values, confidence intervals, and significance tests.
  3. Detect autocorrelation. If the residuals are not randomly scattered, it might indicate that they are not independent. A time series can exhibit autocorrelation; spatial data can exhibit spatial correlations.

The second residual graph often looks similar to the plot of the raw residuals. The vertical coordinates in the second graph (labeled "RStudent") are externally studentized residuals. The PROC REG documentation includes the definition of a studentized residual. For a studentized residual, the variance for the i_th observation is estimated without including the i_th observation. If the magnitude of the studentized residual exceeds 2, you should examine the observation as a possible outlier. Thus, the graph includes horizontal lines at ±2. For these data, five observations have large negative residuals.

3. Influential observations and high-leverage points

The two graphs in the upper right box (green) enable you to investigate outliers, influential observations, and high-leverage points. One graph plots the studentized residuals versus the leverage value for each observation. As mentioned previously, the observations whose studentized residuals exceed ±2 can be considered outliers. The leverage statistic attempts to identify influential observations. The leverage statistic indicates how far an observation is from the centroid of the data in the space of the explanatory variables. Observations far from the centroid are potentially influential in fitting the regression model. "Influential" means that deleting that observation can result in a relatively large change in the parameter estimates.

Markers to the right of the vertical line are high-leverage points. Thus, the three lines in the graph separate the observations into four categories: (1) neither an outlier nor a high-leverage point; (2) an outlier, but not a high-leverage point; (3) a high-leverage point, but not an outlier; and (4) an outlier and a high-leverage point.

If there are high-leverage points in your data, you should be careful interpreting the model. Least squares models are not robust, so the parameter estimates are affected by high-leverage points.

The needle plot of Cook's D second graph (second row, last column) is another plot that indicates influential observations. Needles that are higher than the horizontal line are considered influential.

A subsequent article includes details about how to use the Cook's D plot to identify influential observations in your data.

4. Normality of residuals

The graphs in the lower left (red box) indicate whether the residuals for the model are normally distributed. Normally distributed residuals are one of the assumptions of regression that are used to derive inferential statistics. The first plot is a normal quantile-quantile plot (Q-Q plot) of the residuals. If the residuals are approximately normal, the markers should be close to the diagonal line. The following table gives some guidance about how to interpret deviations from a straight line:

Shape of Q-Q Plots
Point Pattern Interpretation
all but a few points fall on a line outliers in the data
left end of pattern is below the line long tail on the left
left end of pattern is above the line short tail on the left
right end of pattern is below the line short tail on the right
right end of pattern is above the line long tail on the right

For the Q-Q plot of these residuals, the markers are below the diagonal line at both ends of the pattern. That means that the distribution of residuals is negatively skewed: it has a long tail on the left and a short tail on the right. This is confirmed by looking at the histogram of residuals. For this model, the residuals are not normal; the overlay of the normal density curve does not fit the histogram well.

The spread plot

The last graph (blue box) is the residual-fit spread plot, which I have covered in a separate article. For these data, you can conclude that the three explanatory variables account for a significant portion of the variation in the model.

Residual plots with smoothers

Lastly, the PLOTS=RESIDUALPLOT(SMOOTH) option on the PROC REG statement creates a panel that includes the residuals plotted against the value of each explanatory variable. The SMOOTH suboption overlays a loess smoother, which is often helpful in determining whether the residuals contain a signal or are randomly distributed. The panel for these data and model is shown below:

The residuals plotted again the East variable shows a systematic pattern. It indicates that the model does not adequately capture the dependence of the response on the East variable. An analyst who looks at this plot might decide to add a quadratic effect in the East direction, or perhaps to model a fully quadratic response surface by including the North*East interaction term.

Summary

This article created a regression model that does not fit the data. The regression diagnostic panel detects the shortcomings in the regression model. The diagnostic panel also shows you important information about the data, such as outliers and high-leverage points. The diagnostic plot can help you evaluate whether the data and model satisfy the assumptions of linear regression, including normality and independence of errors.

A subsequent article will describe how to use the diagnostic plots to identify influential observations.

The post An overview of regression diagnostic plots in SAS appeared first on The DO Loop.

2月 152021
 

I've previously written about how to generate all pairwise interactions for a regression model in SAS. For a model that contains continuous effects, the easiest way is to use the EFFECT statement in PROC GLMSELECT to generate second-degree "polynomial effects." However, a SAS programmer was running a simulation study and wanted to be able to generate all pairwise interaction effects within SAS/IML. One way to solve the programmer's problem is to use the HDIR function in SAS/IML. This article introduces the HDIR function and shows how to use it to generate pairwise (quadratic) interactions for continuous effects.

Generate pairwise interactions by using the EFFECT statement

Before trying to solve a problem with a new tool, I like to use an existing method to solve the problem. Then I can compare the old and new answers and make sure they are the same.

For sample data, let's use the numerical variables in the Sashelp.Class data. So that the techniques are clearer, I will rename the variables to X1, X2, and X3. We want to generate all six pairwise interactions, including the "pure quadratic" terms where a variable interacts with itself. The names of the interaction effects will be X1_2, X1_X2, x1_X3, X2_2, X2_X3, and X3_2. The names with "_2" at the end are pure quadratic effects; the others are interactions.

One way to generate a design matrix that has all pairwise interactions is to use the EFFECT statement in PROC GLMSELECT. The interactions are written to the OUTDESIGN= data set. The columns of the design matrix have interpretable names, which are stored in the _GLSMOD macro variable.

data Have;         /* create the example data */
set sashelp.class(rename=(Age=X1 Height=X2 Weight=X3));
if _N_=3 then X1=.;    /* add a missing value */
keep _NUMERIC_;
run;
 
/* Use PROC GLMSELECT to generate all pairwise interactions.
   Step 1: Add a fake response variable */
%let DSIn  = Have;         /* the name of the input data set */
%let DSOut = Want;         /* the name of the output data set */
data AddFakeY / view=AddFakeY;
   set &DSIn;
   _Y = 0;      /* add a fake response variable */
run;
 
/* Step 2: Use the EFFECT statement to generate the degree-2 effects, as shown in 
      https://blogs.sas.com/content/iml/2017/09/07/polynomial-effects-regression-sas.html */
proc glmselect data=AddFakeY NOPRINT outdesign(addinputvars)=&DSOut(drop=_Y);
   effect poly2 = polynomial( X1 X2 X3 / degree=2);
   model _Y = poly2 /  noint selection=none;  /* equiv to X1*X1 X1*X2 X1*X3 X2*X2 X2*X3 X3*X3 */
run;
%put &=_GLSMOD;    /* look at names of the interaction effects  in the OUTDESIGN= data */
 
proc print data=&DSOut(obs=5);
   var X1_2 X1_X2 X1_X3 X2_2 X2_X3 X3_2; /* names are generated automatically */
run;

The output from PROC GLMSELECT contains the "pure quadratic" interactions (X1_2, X2_2, and X3_2) and the cross-variable interactions (X1_X2, X1_X3, and X2_X3). If you have k variables, there will be k pure quadratic terms and "k choose 2" cross-variable terms. Hence, the total number of quadratic interactions is "(k+1) choose 2," which is (k+1)*k/2. Here, k=3 and there are six quadratic interactions.

Notice how PROC GLMSELECT handles the missing value in the third observation: because the X1 value is missing, the procedure puts a missing value into all interaction effects.

The horizontal direct product between matrices

The horizontal direct product between matrices A and B is formed by the elementwise multiplication of their columns. The operation is most often used to form interactions between dummy variables for two categorical variables. If A is the design matrix for the categorical variable C1 and B is the design matrix for the categorical variable C2, then HDIR(A,B) is the design matrix for the interaction effect C1*C2.

The following simple example shows how the HDIR function works. The HDIR function returns a matrix whose columns are formed by elementwise multiplication of the columns of A and the matrix B. The first set of columns is the product A[,1]#B, the second set is A[,2]#B, and so forth. If A has k1 columns and B has k2 columns, then HDIR(A,B) has k1*k2 columns.

proc iml;
/* horizontal direct product multiplies each column of A by each column of B */
A = {1 2,
     2 3,
     4 5};
B = {0  1 2,
     1  1 3,
     2 -1 4};
C = hdir(A,B);
print C[c={'A1_B1' 'A1_B2' 'A1_B3' 'A2_B1' 'A2_B2' 'A2_B3'} L='hdir(A,B)'];

Interactions of continuous variables

The previous section shows that if X is a data matrix of continuous variables, the function call HDIR(X,X) generates all pairwise combinations of columns of X. Unfortunately, the matrix HDIR(X,X) contains more columns than we need. if X contains k columns, we need the "(k+1) choose 2" =(k+1)k/2 columns of interactions, but the matrix HDIR(X,X) contains k*k columns. The problem is that HDIR(X,X) contains columns for X1*X2 and for X2*X1, even though those columns are the same. The same holds for other crossed terms such as X1*X3 and X3*X1, X2*X3 and X3*X2, and so forth.

If you want to use the HDIR function to generate all pairwise interactions, you have a choice: You can generate ALL products of pairs and delete the redundant ones, or you can compute only the unique pairs. I will show the latter because it is more efficient.

varNames = 'X1':'X3';
use Have;
   read all var varNames into X;
close;
 
/* Compute only the columns we need */
A1 = HDIR(X[,1], X[,1:3]);    /* interaction of X1 with X1, X2, X3 */
A2 = HDIR(X[,2], X[,2:3]);    /* interaction of X2 with X2 X3 */
A3 = HDIR(X[,3], X[,3]);      /* interaction of X3 with X3 */
A = A1 || A2 || A3;
 
/* get the HEAD module from 
   https://blogs.sas.com/content/iml/2021/02/10/print-top-rows-of-data-sas.html
*/
load module=(Head);
run Head(A) colname={'X1_2' 'X1_X2' 'X1_X3' 'X2_2' 'X2_X3' 'X3_2'};

The HEAD module displays only the top five rows of the matrix of interactions. This output is almost the same as the output from PROC GLMSELECT. A difference is how missing values propagate. For the third row, X1 is missing. PROC GLMSELECT puts a missing value in all columns of the pairwise interactions. In contrast, the SAS/IML output only has missing values in the first three columns because those are the only columns that involve the X1 variable. If your data contain missing values, you might want to use the CompleteCases function to delete the rows that have missing values before performing additional analyses.

A function that computes interactions of continuous variables

The previous section computes all quadratic interactions for a matrix that has three variables. It is straightforward to generalize the idea to k variables. You simply compute the interactions HDIR(X[,i],X[,i:k]) for i=1,2,...,k. This is done in the following program. During the computation, it is convenient to also generate names for the interactions. The following program generates the same names as PROC GLMSELECT. To make the code easier to understand, I encapsulate the logic for names into a helper function called GetNamePairs. The helper function is called as part of the InteractPairs module, which returns both the matrix of interactions and the character vector that names the columns:

/* Helper function: Form names of interaction effects.
   Input: s is a scalar name. t is a character vector of names. 
   Return row vector of pairs of names "s_2" or "s_t[i]" 
   Example: GetNamePairs('X1', 'X1':'X3') returns {'X1_2' 'X1_X2' 'X1_X3'} */
start GetNamePairs(s, t);
   k = nrow(t)*ncol(t);
   b = blankstr(nleng(s)+nleng(t)+1);  /* max length of interaction name */
   pairs = j(1, k, b);
   do i = 1 to k;
      if s=t[i] then pairs[i] = strip(s) + "_2";
      else           pairs[i] = catx("_", strip(s), strip(t[i])); 
   end;
   return pairs;
finish;
 
/* Generate all quadratic interactions for continuous variables.
   Input: design matrix X and a vector of column names.
   Output: OutX:     a matrix whose columns are the pairwise interactions
           OutNames: the names of interactions, such as Age_2 and Height_Age */
start InteractPairs(OutX, OutNames, X, Names);
   k = ncol(X);
   numInteract = comb(k+1,2);  /* = k + comb(k,2) */
   OutX = j(nrow(X), numInteract, .);
   OutNames = j(1, numInteract, BlankStr(2*nleng(Names)+1));
   col = 1;              /* initial column to fill */
   do i = 1 to k;
      m = k-i+1;         /* number of interaction for this loop */
      OutX[,col:col+m-1]     = HDIR(X[,i], X[,i:k]);
      OutNames[,col:col+m-1] = GetNamePairs(Names[i], Names[i:k]);
      col = col + m;
   end;
finish;
 
run InteractPairs(OutX, OutNames, X, varNames);
run Head(OutX) colname=OutNames;

The output is identical to the earlier output. The input arguments to this module can have an arbitrary number of columns. In this example, the design matrix does not include an intercept column (a column that is all ones). Consequently, the output is the set of all quadratic interactions. If your design matrix includes an intercept column, the output will contain an intercept column and all main effects in addition to the quadratic effects.

Technically, you don't need to use the HDIR function in the InteractPairs module. Instead, you can use the familiar '#' operator to perform the elementwise multiplication. However, if you try to generalize the module to handle the interaction between categorical variables and continuous variables, the HDIR function will be useful.

Summary

This article introduces the HDIR function in SAS/IML, which computes the horizontal direct product of matrices. You can use the HDIR function to generate interaction effects in regression models. This article shows how to generate all quadratic interactions among a set of continuous variables.

The post Generate all quadratic interactions in a regression model appeared first on The DO Loop.

11月 232020
 

I previously showed how to create a decile calibration plot for a logistic regression model in SAS. A decile calibration plot (or "decile plot," for short) is used in some fields to visualize agreement between the data and a regression model. It can be used to diagnose an incorrectly specified model.

In principle, the decile plot can be applied to any regression model that has a continuous regressor. In this article, I show how to create two decile plots for an ordinary least squares regression model. The first overlays a decile plot on a fit plot, as shown to the right. The second decile plot shows the relationship between the observed and predicted responses.

Although this article shows how to create a decile plot in SAS, I do not necessarily recommend them. In many cases, it is better and simpler to use a loess fit, as I discuss at the end of this article.

A misspecified model

A decile plot can be used to identify an incorrectly specified model. Suppose, for example, that a response variable (Y) depends on an explanatory variable (X) in a nonlinear manner. If you model Y as a linear function of X, then you have specified an incorrect model. The following SAS DATA step simulates a response variable (Y) that depends strongly on a linear effect and weakly on quadratic and cubic effects for X. Because the nonlinear dependence is weak, a traditional fit plot does not indicate that the model is misspecified:

/* Y = b0 + b1*x + b2*x2 + b3*x**3 + error, where b2 and b3 are small */
%let N = 200;                      /* sample size */ 
data Have(keep= Y x);
call streaminit(1234);             /* set the random number seed */
do i = 1 to &N;                    /* for each observation in the sample  */
   x = rand("lognormal", 0, 0.5);  /* x = explanatory variable */
   y = 20 + 5*x - 0.5*(x-3)**3 +   /* include weak cubic effect */
            rand("Normal", 0, 4); 
   output;
end;
run;
 
proc reg data=Have;               /* Optional: plots=residuals(smooth); */
   model y = x;
quit;

The usual p-values and diagnostic plots (not shown) do not reveal that the model is misspecified. In particular, the fit plot, which shows the observed and predicted response versus the explanatory variable, does not indicate that the model is misspecified. At the end of this article, I discuss how to get PROC REG to overlay additional information about the fit on its graphs.

Overlay a decile plot on a fit plot

This section shows how to overlay a decile plot on the fit plot. The fit plot shows the observed values of the response (Y) for each value of the explanatory variable (X). You can create a decile fit plot as follows:

  1. Bin the X values into 10 bins by using the deciles of the X variable as cut points. In SAS, you can use PROC RANK for this step.
  2. For each bin, compute the mean Y value and a confidence interval for the mean. At the same time, compute the mean of each bin, which will be the horizontal plotting position for the bin. In SAS, you can use PROC MEANS for this step.
  3. Overlay the fit plot with the 10 confidence intervals and mean values for Y. You can use PROC SGPLOT for this step.
/* 1. Assign each obs to a decile for X */
proc rank data=Have out=Decile groups=10 ties=high;
   var x;           /* variable on which to group */
   ranks Decile;    /* name of variable to contain groups 0,1,...,k-1 */
run;
 
/* 2. Compute the mean and the empirical proportions (and CI) for each decile */
proc means data=Decile noprint;
   class Decile;
   types Decile;
   var y x;
   output out=DecileOut mean=YMean XMean 
          lclm=yLower uclm=yUpper;  /* only compute limits for Y */
run;
 
/* 3. Create a fit plot and overlay the deciles. */
data All;
   set Have DecileOut; 
run; 
 
title "Decile Plot";
title2 "Observed and Predicted Response versus Explanatory";
proc sgplot data=All noautolegend;
   reg x=x y=y / nomarkers CLM alpha=0.1 name="reg";
   scatter x=xMean y=YMean / yerrorlower=yLower yerrorupper=yUpper 
           name="obs" legendlabel="Mean Y and 90% CI" markerattrs=GraphData2;
   fringe x / transparency=0.5;
   xaxis label="x" values=(0 to 4.5 by 0.5) valueshint;
   yaxis label="y" offsetmin=0.05;
   keylegend "reg" "obs" / position=NW location=inside across=1;
run;

The graph is shown at the top of this article. I use the FRINGE statement to display the locations of the X values at the bottom of the plot. The graph indicates that the model might be misspecified. The mean Y values for the first three deciles are all above the predicted value from the model. For the next six deciles, the mean Y values are all below the model's predictions. This kind of systematic deviation from the model can indicate that the model is missing an important component. In this case, we know that the true response is cubic whereas the model is linear.

A decile plot of observed versus predicted response

You can create a fit plot when the model contains a single continuous explanatory variable. If you have multiple explanatory variables, you can assess the model by plotting the observed responses against the predicted responses. You can overlay a decile plot by computing the average observed response for each decile of the predicted response. When you hear someone use the term "decile plot," they are probably referring to this second plot, which is applicable to most regression models.

The SAS program is almost identical to the program in the previous section. The main difference is that you need to create an output data set (FIT) that contains the predicted values (Pred). You then use the Pred variable instead of the X variable throughout the program, as follows:

/* 1. Compute predicted values */
proc reg data=Have noprint;
   model y = x;
   output out=Fit Pred=Pred;
quit;
 
/* 2. Assign each obs to a decile for Pred */
proc rank data=Fit out=Decile groups=10 ties=high;
   var Pred;        /* variable on which to group */
   ranks Decile;    /* name of variable to contain groups 0,1,...,k-1 */
run;
 
/* 3. Compute the mean and the empirical proportions (and CI) for each decile */
proc means data=Decile noprint;
   class Decile;
   types Decile;
   var y Pred;
   output out=DecileOut mean=YMean PredMean 
          lclm=yLower uclm=yUpper;  /* only compute limits for Y */
run;
 
/* 4. Create an Observed vs Predicted plot and overlay the deciles. */
data All;
   set Fit DecileOut; 
run; 
 
title "Decile Plot";
title2 "Observed Response verus Predicted Response";
proc sgplot data=All noautolegend;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash) clip; /* add identity line */
   scatter x=PredMean y=YMean / yerrorlower=yLower yerrorupper=yUpper 
           name="obs" legendlabel="Mean Y and 95% CI" markerattrs=GraphData2;
   fringe Pred;
   xaxis label="Predicted Response";
   yaxis label="Observed Response" offsetmin=0.05;
   keylegend "obs" / position=NW location=inside across=1;
run;

The decile plot includes a diagonal line, which is the line of perfect agreement between the model and the data. In a well-specified model, the 10 empirical means of the deciles should be close to the line, and they should vary randomly above and below the line. The decile plot for the misspecified model shows a systematic trend rather than random variation. For the lower 30% of the predicted values, the observed response is higher (on average) than the predictions. When the predicted values, are in the range [29, 32] (the fourth through ninth deciles) the observed response is lower (on average) than the predictions. Thus, the decile plot indicates that the linear model might be missing an important nonlinear effect.

The loess alternative

A general principle in statistics is that you should not discretize a continuous variable unless absolutely necessary. You can often get better insight by analyzing the continuous variable. In accordance with that principle, this section shows how to identify misspecified models by using a loess smoother rather than by discretizing a continuous variable into deciles.

In addition to the "general principle," a decile plot requires three steps: compute the deciles, compute the means for the deciles, and merge the decile information with the original data so that you can plot it. As I discuss at the end of my previous article, it is often easier to overlay a loess smoother.

The following statements use the LOESS statement in PROC SGPLOT to add a loess smoother to a fit plot. If the loess fit is not approximately linear, it could indicate that the linear model is not appropriate.

title "Linear Regression Fit";
proc sgplot data=Have;
   reg x=x y=y / clm;              /* fit linear model y = b0 + b1*x */
   loess x=x y=y / nomarkers;      /* use loess fit instead of decile plot */
run;

This plot is the continuous analog to the first decile plot. It tells you the same information: For very small and very large values of X, the predicted values are too low. The loess curve suggests that the linear model is lacking an effect.

The previous plot is useful for single-variable regressions. But even if you have multiple regressors, you can add a loess smoother to the "observed versus predicted" plot. If the FIT data set contains the observed and predicted response, you can create the graph as follows:

title "Observed versus Predicted";
proc sgplot data=Fit;
   loess x=Pred y=y;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash) clip legendlabel="Identity Line";
run;

This plot is the continuous analog to the second decile plot. The loess curve indicates that the observed response is higher than predicted for small and large predicted values. When the model predicts values of Y in the range [29.5, 32], the predicted values tend to be smaller than the observed values. This is the same information that the decile plot provides, but the loess curve is easier to use and gives more information because it does not discretize a continuous variable.

Finally, you can add a loess smoother to a residual plot by using an option in PROC REG. In a well-specified model, the residual plot should not show any systematic trends. You can use the PLOTS=RESIDUALS(SMOOTH) options on the PROC REG statement to add a loess smoother to the residual plot, as follows:
proc reg data=Have plots=residuals(smooth);
   model y = x;
   ods select ResidualPlot;
quit;

The loess smoother on the residual plot shows that the residuals seem to have a systematic component that is not captured by the linear model. When you see the loess fit, you might be motivated to fit models that are more complex. You can use this option even for models that have multiple explanatory variables.

Summary

In summary, this article shows how to create two decile plots for least-square regression models in SAS. The first decile plot adds 10 empirical means to a fit plot. You can create this plot when you have a single continuous regressor. The second decile plot is applicable to models that have multiple continuous regressors. It adds 10 empirical means to the "observed versus predicted" plot. Lastly, although you can create decile plots in SAS, in many cases it is easier and more informative to use a loess curve to assess the model.

The post Decile plots in SAS appeared first on The DO Loop.