6月 212019
 

For every project in SAS®, the first step is almost always making your data available. This blog shows you how to load three of the most common input data types—a data set, a text file, and a Microsoft Excel file—into SAS® Cloud Analytic Services (CAS) tables.

The three methods that I show here are the three easiest ways to load each data type into CAS. Multiple tools can load data into CAS, but I am showing the tools that I consider the easiest to use and that are probably the most familiar to SAS programmers.

You need to place your data in a location that can be accessed by the programming environment that is used to access CAS. The most common programming environment that accesses CAS is SAS® Studio. Input data files that are used in CAS are going to be very large. You will need to use an SFTP tool to move your data from your PC to a directory that can be accessed by your SAS Studio session. (Check with your system administrator to see what the preferred tool is at your site.)

After your data is in a location that can be accessed by the programming environment, you need to start a CAS session. This is done with the CAS statement; here is the syntax:

cas session-name <option(s)>;

The options that you specify depend on how your system administrator configured your environment. For example, I asked my system administrator to set it up so that the only thing I need to do is issue the following statement:

cas;

That statement then creates a CAS session with the default name of CASAUTO, with an active caslib of CASUSER.

After you establish your CAS session, you can start loading data.

Load a SAS data set

The easiest way to load SAS data into CAS is to use a DATA step. The basic syntax is the same as it is when you are creating a SAS data set. The key difference is that the libref that is listed in the DATA step must point to a caslib.

The following example accesses SASHELP.CARS and then creates the table CARS in the CASUSER caslib.

cas;                   /* log on to the CAS session */
   caslib _all_ assign;   /* create a libref for each active 
                             caslib */
 
   data casuser.cars;     /* create the table in the caslib */
      set sashelp.cars;
   run;

The one thing to note about this code is that this DATA step is running in SAS and not CAS. A DATA step runs in CAS only if the input and output librefs are both using the CAS engine and only if it uses language elements that CAS supports. In this example, the SASHELP libref was not created with the CAS engine, so the step must run in SAS.

There are no calculations in this step, so there is no effect on performance.

When you load a data set into a CAS table, one task that you might want to perform is to promote the table. The following DATA step shows how to promote a table as you create it:

data casuser.cars(promote=yes);
      set sashelp.cars;
   run;

Promoting a table gives it a global scope. You can then access the table in multiple sessions, and the table is also available in SAS® Visual Analytics. The PROMOTE= data set option can be used with all three of the examples in this blog.

Load a delimited text file

The easiest way to load a delimited text file and store it as a CAS table is to use another familiar SAS step, the IMPORT procedure. The syntax is going to be basically the same as it is in SAS. Again, the key difference is that you need to point to a caslib in the OUT= option of the PROC IMPORT statement.

If you are running SAS® Studio 5.1, a common location for a text file that needs to be loaded into CAS is the SAS Content folder. This folder is a predefined repository for text files that you need to access from SAS Studio.

In order to access the files from SAS Content, you need to use the FILESRVC access method with the FILENAME statement. This method enables you to store and retrieve content using the SAS® Viya® Files service. Here is the basic syntax:

filename fileref filesrvc folderpath='path'
            filename='name';

For more information about this access method, see SAS 9.4 Global Statements Reference.

In the following example, PROC IMPORT and the FILESRVC access method are used to load the class.csv file from the SAS Content folder. The resulting ALLCLASS table is written to the CASUSER caslib.

filename myfile filesrvc folderpath='/Users/saskir'
            filename='class.csv';
   proc import datafile=myfile out=casuser.allclass(promote=yes)
        dbms=csv;
   run;

Load an Excel file

This section shows you how to load an Excel file into a CAS table by using PROC IMPORT. Loading an Excel file using PROC IMPORT requires that you have SAS/ACCESS® Interface to PC Files. If you are unsure whether you have this product, you can write all of your licensed products to the log by using the SETINIT procedure:

proc setinit;
   run;

You should see the following in the log when the product is licensed:

---SAS/ACCESS Interface to PC Files

After you confirm that you have this product, you can adapt the following PROC IMPORT code. This example loads the ReportTest.xlsx file and stores it in a table named ReportTest in the CASUSER caslib.

cas;
   caslib _all_ assign;
 
   proc import datafile='/viyashare/ReportTest.xlsx'
        out=casuser.ReportTest dbms=xlsx;
   run;

There are other methods

The purpose of this blog was to show you the easiest ways to load a SAS data set, text file, and Excel file into CAS. Although there are multiple ways to accomplish these tasks, both programmatically and interactively, these methods are the easiest and most straightforward ways to get data into CAS.

Learn the three easiest ways to load data into CAS tables was published on SAS Users.

6月 202019
 

Move over video games and sports. Make room for escape rooms. This burgeoning form of entertainment found its roots in the video gaming movement. Escape rooms tap into a player's drive to reach the next level, solve a puzzle and win. Escape rooms present a physical game that traps you until your brain (and teamwork) help you escape. Does that sound exhilarating or terrifying? Maybe a bit of both.

SAS adapted this concept and built its own Data Science Escape Rooms at SAS Global Forum 2019. More than 800 customers used SAS software to solve a series of problems in one of six rooms, spanning three different themes: soccer, wildlife and cyberattack.

I had the opportunity to support these rooms by assisting with registration and check-in. I also "covered" the experience through videos that shared a behind-the-scenes look, player reactions and backstories on how the rooms were developed.

I even got to chat with one team that completed all three rooms. The Advanced Claims Analysis Team with the United States Office of Personnel Management, aka Team "Auditgators," included Kevin Sikora, Julie Zoeller, Richard Allen and Lauren Goob. They spoke about how cool it was to play with software they're not used to in pseudo real-world scenarios.

What the escape rooms were really like

Let me give you a lay of the land. The rooms included four stations each with a computer screen and a mouse (no keyboards). At each station, players used products like SAS Visual Analytics and SAS Visual Investigator to solve a challenge, working against the clock to escape the room within 20 minutes. Talk about pressure.

A team studies a problem at one of four stations in the Wildlife-themed escape room.

"You definitely felt a sense of urgency in there," recalls Sikora. "We really got into the wildlife room – it's where we received the highest score. We had the chance to do some great team-building too. We're more cohesive having shared this experience."

The Auditgators worked separately to solve problems and then collaborated to piece together the bigger puzzle. When they got stuck, they asked for help from the "gamemasters," who were literally behind the walls of the rooms. Gamemasters doled out clues to nudge teams along.

This may seem like a fun diversion, but how in the world can you apply this experience to your daily work? "Well we certainly realized the value in using more graphs, to click and filter to find answers we need," said Zoeller. "It was enlightening to see how we could show the data, and it gave us a better awareness of the end goal, the big picture."

Allen and Goob have been to SAS Global Forums in the past and said that the Data Science Escape Rooms gave them the best opportunity to interact with one another than anything else at the event. "It breaks up sitting in rooms and listening, breaks up the monotony. It gave us a chance to do something together as a team."

More reactions and backstories

Want a taste of the action at the event? Here, SAS' Lisa Dodson, Data Science Escape Room Gamemaster, couldn't hide her excitement to give clues and cheer on teams.

 
 
Michael Gibbs with the University of Arkansas described the intensity and pressure to get the answer and move on while escaping.

 
 
SAS partnered with SciSports to build the soccer-themed escape room. SciSports Founder and CIO Giels Brouwer and Data Scientist Mick Bosma explained how the idea came about and plans to take the room on the road.

 
 
While most participants enjoyed the challenge, not everyone "escaped" successfully. The most successful teams had solid teamwork and good communication. Sound like a lesson for "real life"? SAS' Alfredo Iglesias Rey reports that 18 percent of the teams completed all four challenges correctly. He also marveled at how players were able to create advanced machine learning techniques with just a mouse and keyboard.

 
 
Dying to check out a Data Science Escape Room yourself? You'll find them at several local SAS Forums this year, and plans are underway to offer them at other venues. But you don't have to wait for an escape room to get a feel for SAS software. Try SAS online right now, for free.

A playful way to get your hands on SAS: the Data Science Escape Rooms was published on SAS Users.

6月 192019
 

A previous article describes the DFBETAS statistics for detecting influential observations, where "influential" means that if you delete the observation and refit the model, the estimates for the regression coefficients change substantially. Of course, there are other statistics that you could use to measure influence. Two popular ones are the DFFTIS and Cook's distance, which is also known as Cook's D statistic. Both statistics measure the change in predicted values that occurs when you delete an observation and refit the model. This article describes the DFFITS and Cook's D statistics and shows how to compute and graph them in SAS.

DFFITS: How the predicted value changes if an observation is excluded

If you exclude an observation from a model and refit, the predicted values will change. The DFFITS statistic is a measure of how the predicted value at the i_th observation changes when the i_th observation is deleted. High-leverage points tend to pull the regression surface towards the response at that point, so the change in the predicted value at that point is a good indication of how influential the observation is. So that the DFFITS values are independent of the scale of the data, the change in predicted values is scaled by dividing by the standard error of the predicted value at that point. The exact formula is given in the documentation for PROC REG.. The book Regression Diagnostics by Belsley, Kuh, and Welsch (1980) suggests that an observation is influential if the magnitude of its DFFITS value exceeds 2*sqrt(p/n), where p is the number of effects in the model and n is the sample size.

PROC REG provides three ways to generate the DFFITS statistics for each observation:

  • You can create a graph of the DFFITS statistics by using the PLOTS=DFFITS option.
  • You can also display a table of the DFFITS (and other influence statistics) by using the INFLUENCE option in the MODEL statement.
  • You can write the DFFITS statistics to a data set by using the DFFITS= option in the OUTPUT statement.

The following DATA step extracts a subset of n = 84 vehicles from the Sashelp.Cars data, creates a short ID variable for labeling observations, and sorts the data by the response variable, MPG_City. The data are sorted because the DFFITS statistic is graphed against the observation number, which is an arbitrary quantity. By sorting the data, you know that small observation numbers correspond to low values of the response and so forth. If you have a short ID variable, you can label the influential observations by using the LABEL suboption, as follows:

/* Create sample data */
data cars;
set sashelp.cars;
where Type in ('SUV', 'Truck');
/* make short ID label from Make and Model values */
length IDMakeMod $20;
IDMakeMod = cats(substr(Make,1,4), ":", substr(Model,1,5));
run;
 
/* Optional but helpful: Sort by response variable */
proc sort data=cars;
   by MPG_City;
run;
 
proc reg data=Cars plots(only) = DFFITS(label); 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
run; quit;

The DFFITS graph shows that three observations have a large positive DFFITS value. The observations are the Ford Excursion, the Ford Ranger, and the Madza BB230. For these observations, the predicted value (at the observation) is higher with the observation included in the model than if it were excluded. Thus, these observations "pull the regression up." There are four observations that have large negative DFFITS, which means that these observations "pull the regression down." They include the Land Rover Discovery and the Volvo XC90.

Cook's D: A distance measure for the change in regression estimates

When you estimate a vector of regression coefficients, there is uncertainty. The confidence regions for the parameter estimate is an ellipsoid in k-dimensional space, where k is the number of effects that you are estimating (including the intercept). Cook (1977) defines a distance that the estimates move within the confidence ellipse when the i_th point is deleted. Equivalently, Cook shows that the statistic is proportional to the squared studentized residual for the i_th observation. The documentation for PROC REG provides a formula in terms of the studentized residuals.

By default, PROC REG creates a plot of Cook's D statistic as part of the panel of diagnostic plots. (Cook's D is the second row and third column.) You can create a larger stand-alone plot by using the PLOTS=DFFITS option. Optionally, you can label the influential points (those whose Cook's D statistic exceeds 4/sqrt(n)) by using the LABEL suboption, as shown below:
/* create multiple plots and label influential points */
proc reg data=Cars plots(only) = (CooksD(label) DFFits(label));   
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
   output out=RegOut pred=Pred rstudent=RStudent dffits=DFFits cookd=CooksD; /* optional: output statistics */
run; quit;

In many ways, the plot of Cook's D looks similar to a plot of the squared DFFITS statistics. Both measure a change in the predicted value at the i_th observation when the i_th observation is excluded from the analysis. The formula for Cook's D statistic squares a residual-like quantity, so it does not show the direction of the change, whereas the DFFITS statistics do show the direction. Otherwise, the observations that are "very influential" are often the same for both statistics, as seen in this example.

The post Influential observations in a linear regression model: The DFFITS and Cook's D statistics appeared first on The DO Loop.

6月 192019
 

As a data scientist, did you ever come to the point where you felt the need for an evolved analytics platform bringing together the disparate skills of open source and commercial software? A system that can enable advanced analytic capabilities. This is now possible and easy to implement. With many deployment possibilities, SAS Viya allows you to choose the data storage location where compute happens, and the deployment methods for models.

Let’s say you want to expand your model development process with SAS Viya analytical capabilities and you don’t want to wait for getting such environment up and running. Unfortunately, you have no infrastructure, nor the experience to install SAS Viya. Moving the traditional way, you could go for:

  • Protracted hardware procurement and provisioning
  • Deployment planning and coordination with IT
  • Effort and time required for software installation/configuration

This solution may be the right path for many organizations, but I think we all recognize this: the traditional approach could take days, weeks and yes sometimes months.

What if you could get up and running with a full SAS Viya platform in two hours? If you have some affinity for cloud-based solutions, SAS offers you the AWS SAS Viya Cloud Rapid Deployment tool. SAS released this AWS Quick Start as a rapid deployment architecture for SAS Viya on AWS. Deployable products include SAS Visual Data Mining and Machine Learning, SAS Visual Statistics and SAS Visual Analytics.

The goal of this article is to brief you how I launched such an AWS SAS Viya Quickstart. I strongly advise you to watch this related video by my colleague Erwan Granger. Much of what is covered here appears in Erwan's video. The recording predates the SAS Viya 3.4 release, but main concepts are still the same.

What you will need

The following is a list of items you need to complete this task.

  • AWS Account with appropriate creation privileges
  • A valid SAS Viya License; this means you will need a SAS Software Order Confirmation e-mail
  • Optional: you deploy with your own DNS Name and SSL Certificate. In that case you need to register a domain managed by Amazon Route 53. For instructions on registering the domain, see the Route 53 documentation. And you can request and register a certificate with AWS Certificate Manager.

Furthermore, it’s good to know this Quick Start provides two deployment options. You can deploy SAS Viya into a new Virtual Private Cloud (VPC) or into an existing VPC. The first option builds a new AWS environment consisting of the VPC, private and public subnets, NAT gateways, security groups, Ansible controllers, and other infrastructure components, and then deploys SAS Viya into this new VPC. The second option provisions SAS Viya in your existing AWS infrastructure. I decided to go for the first option.

What you will build

Here's an architectural overview of what we will build:

SAS Viya architecture on the AWS Cloud

You can find exactly the same architecture on the SAS Viya AWS Quick Start landing page.

Configure the build

We’ll be following the build process outlined in the Quick Start guide. On the landing page, next to the "What you’ll build tab" you can click on "How to deploy". From there launch the "Deploy into a new VPC" wizard.

Deploy into a new VPC wizard

Prerequisite prep

Make sure you sign in with your AWS account and you have chosen the region where you want to deploy. On that first screen you can leave the Amazon S3 template URL default. That template is the basics for the AWS CloudFormation we are launching. CloudFormation is a tool from AWS that allows you to spin up resources in the right order. The template is the blueprint document for your CloudFormation. By keeping the default template, we will build exactly the architecture displayed above.

Pre-req prep template

Now click "Next" and move to the page where we can specify more details and the required parameters of the CloudFormation parameters.

Cloudfourmation parameters

The first parameter is the SAS Viya Software Order file, which is the Amazon S3 location of the Software Order e-mail attachment.

SAS Viya install package location

In the Administration section, you provide parameters to configure your AWS architecture. That way, you control access, instance type, and if you will use a SAS Viya Mirror repository.

CloudFormation administration parameters

Administration parameter definitions:

  • The name of an Amazon EC2 key pair, so you can access the Ansible controller
  • The Amazon Availability zone for the public and private subnet
  • Allowable IP range for HTTP traffic; must be a valid IP CIDR range
  • Allowable IP Range for SSH traffic to the Ansible controller; must be a valid IP CIDR range
  • SAS Administrator password
  • Password for Default (sasuser) user
  • Amazon EC2 Instance type for CAS Compute VM
  • Amazon EC2 Instance type for SAS Viya Services VM
  • (Optional) Location of SAS Viya Deployment Repository data
  • (Optional) Operator Email

If you want to work with custom DNS names and SSL, you will need to provide the next three parameters as well.

DNS and SSL configuration (optional)

DNS and SSL parameters:

You may accept the defaults on the remaining parameters.

Optional parameters

After clicking "Next" another set of optional parameters are available. I mostly go with accepting the default parameters provided. The lone exception is the Rollback on failure.

Optional administration parameters

Based on what I’ve learned from Erwan's video, the safer choice is "No" on the Rollback option. This way, if the deployment process encounters issues, the log will identify in which step the error occurred. Of course this means you are responsible to manually delete AWS created resources that are not longer necessary. The easiest way to do this is by deleting the CloudFormation Stacks afterward.

Kick off the build

To conclude the deployment wizard, click "Next" once more and acknowledge the necessary AWS resources to create. By clicking "Create stack" the deployment process starts.

Start the build process

You can monitor the deployment log using AWS CloudWatch. In his video, Erwan demonstrates this at around minute 23.

After a successful formation you will find two AWS CloudFormation Stacks created. The Outputs gives you the direct links to SAStudioV and SASDrive.

SAS Studio and SAS Drive stacks

That’s it. You are deployed and ready to begin using your SAS Viya environment!

Additional Reference

Alexander Koller writes about SAS on AWS and takeaways for preparing for the AWS associate solution architect exam.

Your experiences and opinion matter

New forces are shaping the analytics ecosystem. Because of increased competition, rise in customer expectations and new, emerging technology such as AI and Machine Learning, challenging IT departments with evolving their analytic ecosystems to meet the demands of their business partners.

How is your organization doing this? How does your Analytics Cloud strategy compare to the market? And what do your peers think about migrating Analytics to the cloud? We can give you some insights and an industry benchmark on the topic.

Tell us about your experience in this 5 minute survey and we will be happy to share a detailed industry insight report with you, to answer these questions.

Deploy SAS Viya on AWS - Quick Start was published on SAS Users.

6月 182019
 

What is Item Response Theory?

Item Response Theory (IRT) is a way to analyze responses to tests or questionnaires with the goal of improving measurement accuracy and reliability.

A common application is in testing a student’s ability or knowledge. Today, all major psychological and educational tests are built using IRT. The methodology can significantly improve measurement accuracy and reliability while providing potential significant reductions in assessment time and effort, especially via computerized adaptive testing. For example, the SAT and GRE both use Item Response Theory for their tests. IRT takes into account the number of questions answered correctly and the difficulty of the question.

In recent years, IRT models have also become increasingly popular in health behavior, quality of life, and clinical research. There are many different models for IRT. Three of the most popular are:

The Rasch model

Two-parameter model

Graded Response model

Early IRT models (such as the Rasch model and two-parameter model) concentrate mainly on dichotomous responses. These models were later extended to incorporate other formats, such as ordinal responses, rating scales, partial credit scoring, and multiple category scoring.

Item Response Theory Models Using SAS

Ron Cody and Jeffrey K. Smith’s book, Test Scoring and Analysis Using SAS, uses SAS PROC IRT to show how to develop your own multiple-choice tests, score students, produce student rosters (in print form or Excel), and explore item response theory (IRT).

Aimed at non-statisticians working in education or training, the book describes item analysis and test reliability in easy-to-understand terms and teaches SAS programming to score tests, perform item analysis, and estimate reliability.

For those with a more statistical background, Bayesian Analysis of Item Response Theory Models Using SAS describes how to estimate and check IRT models using the SAS MCMC procedure. Written especially for psychometricians, scale developers, and practitioners, numerous programs are provided and annotated so that you can easily modify them for your applications.

Assessment has played, and continues to play, an integral part in our work and educational settings. IRT models continue to be increasingly popular in many other fields, such as medical research, health sciences, quality-of-life research, and even marketing research. With the use of IRT models, you can not only improve scoring accuracy but also economize test administration by adaptively using only the discriminative items.

Interested in learning more? Check out our chapter previews available for free. Want to learn more about SAS Press? Explore our online bookstore and subscribe to our newsletter to get all the latest discounts, news, and more.

Further resources

SAS Blogs:
New at SAS: Psychometric Testing by Charu Shankar
SAS author’s tip: Bayesian analysis of item response theory models

SAS Communities:
SAS Communities: Custom Task Tuesday: SAS Global Forum/PROC IRT Edition!

SAS Global Forum Paper:
Item Response Theory: What It Is and How You Can Use the IRTProcedure to Apply It by Xinming An and Yiu-Fai Yung

SAS Documentation:
The IRT Procedure
SAS/STAT 14.1 User Guide: The IRT Procedure
SAS/STAT 14.2 User Guide: Help Center

Understanding Item Response Theory with SAS was published on SAS Users.

6月 172019
 

My article about deletion diagnostics investigated how influential an observation is to a least squares regression model. In other words, if you delete the i_th observation and refit the model, what happens to the statistics for the model? SAS regression procedures provide many tables and graphs that enable you to examine the influence of deleting an observation. For example:

  • The DFBETAS are statistics that indicate the effect that deleting each observation has on the estimates for the regression coefficients.
  • The DFFITS and Cook's D statistics indicate the effect that deleting each observation has on the predicted values of the model.
  • The COVRATIO statistics indicate the effect that deleting each observation has on the variance-covariance matrix of the estimates.

These observation-wise statistics are typically used for smaller data sets (n ≤ 1000) because the influence of any single observation diminishes as the sample size increases. You can get a table of these (and other) deletion diagnostics by using the INFLUENCE option on the MODEL statement of PROC REG in SAS. However, because there is one statistic per observation, these statistics are usually graphed. PROC REG can automatically generate needle plots of these statistics (with heuristic cutoff values) by using the PLOTS= option on the PROC REG statement.

This article describes the DFBETAS statistic and shows how to create graphs of the DFBETAS in PROC REG in SAS. The next article discusses the DFFITS and Cook's D statistics. The COVRATIO statistic is not as popular, so I won't say more about that statistic.

DFBETAS: How the coefficient estimates change if an observation is excluded

The documentation for PROC REG has a section that describes the influence statistics, which is based on the book Regression Diagnostics by Belsley, Kuh, and Welsch (1980, p. 13-14). Among these, the DFBETAS statistics are perhaps the easiest to understand. If you exclude an observation from the data and refit the model, you will get new parameter estimates. How much do the estimates change? Notice that you get one statistic for each observation and also one for each regressor (including the intercept). Thus if you have n observations and k regressors, you get nk statistics.

Typically, these statistics are shown in a panel of k plots, with the DFBETAS for each regressor plotted against the observation number. Because "observation number" is an arbitrary number, I like to sort the data by the response variable. Then I know that the small observation numbers correspond to low values of the response variable and large observation numbers correspond to high values of the response variable. The following DATA step extracts a subset of n = 84 vehicles from the Sashelp.Cars data, creates a short ID variable for labeling observations, and sorts the data by the response variable, MPG_City:

data cars;
set sashelp.cars;
where Type in ('SUV', 'Truck');
/* make short ID label from Make and Model values */
length IDMakeMod $20;
IDMakeMod = cats(substr(Make,1,4), ":", substr(Model,1,5));
run;
 
proc sort data=cars;
   by MPG_City;
run;
 
proc print data=cars(obs=5) noobs;
   var Make Model IDMakeMod MPG_City;
run;

The first few observations are shown. Notice that the first observations correspond to small values of the MPG_City variable. Notice also a short label (IDMakeMod) identifies each vehicle.

There are two ways to generate the DFBETAS statistics: You can use the INFLUENCE option on the MODEL statement to generate a table of statistics, or you can use the PLOTS=DFBETAS option in the PROC REG statement to generate a panel of graphs. The following call to PROC REG generates a panel of graphs. The IMAGEMAP=ON option on the ODS GRAPHICS statement enables you to hover the mouse pointer over an observation and obtain a brief description of the observation:

ods graphics on / imagemap=on;              /* enable data tips (tooltips) */
proc reg data=Cars plots(only) = DFBetas; 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
run; quit;
ods graphics / imagemap=off;

The panel shows the influence of each observation on the estimates of the four regression coefficients. The statistics are standardized so that all graphs can use the same vertical scale. Horizontal lines are drawn at ±2/sqrt(n) ≈ 0.22. Observations are called influential if they have a DFBETA statistic that exceeds that value. The graph shows a tool tip for one of the observations in the EngineSize graph, which shows that the influential point is observation 4, the Land Rover Discovery.

Each graph reveals a few influential observations:

  • For the intercept estimate, the most influential observations are numbers 1, 35, 83, and 84.
  • For the EngineSize estimates, the most influential observations are numbers 4, 35, and 38.
  • For the Horsepower estimates, the most influential observations are numbers 1, 4, and 38.
  • For the Weight estimates, the most influential observations are numbers 1, 24, 35, and 38.

Notice that several observations (such as 1, 35, and 38) are influential for more than one estimate. Excluding those observations causes several parameter estimates to change substantially.

Labeing the influential observations

For me, the panel of graphs is too small. I found it difficult to hover the mouse pointer exactly over the tip of a needle in an attempt to discover the observation number and name of the vehicle. Fortunately, if you want details like that, PROC REG supplies options that make the process easier. If you don't have too many observations, you can add labels to the DFBETAS plots by using the LABEL suboption. To plot each graph individually (instead of in a panel), use the UNPACK suboption, as follows:

proc reg data=Cars plots(only) = DFBetas(label unpack); 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
quit;

The REG procedure creates four plots, but only the graph for the Weight variable is shown here. In this graph, the influential observations are labeled by the IDMakeMod variable, which enables you to identify vehicles rather than observation numbers. For example, some of the influential observations for the Weight variable are the Ford Excursion (1), the Toyota Tundra (24), the Mazda B400 (35), and the Volvo XC90 (38).

A table of influential observations

If you want a table that displays the most influential observations, you can use the INFLUENCE option to generate the OutputStatistics table, which contains the DFBETAS for all regressors. You can write that table to a SAS data set and exclude any that do not have a large DFBETAS statistic, where "large" means the magnitude of the statistic exceeds 2/sqrt(n), where n is the sample size. The following DATA step filters the observations and prints only the influential ones.

ods exclude all;
proc reg data=Cars plots=NONE; 
   model MPG_City = EngineSize HorsePower Weight / influence;
   id IDMakeMod;
   ods output OutputStatistics=OutputStats;      /* save influence statistics */
run; quit;
ods exclude none;
 
data Influential;
set OutputStats nobs=n;
array DFB[*] DFB_:;
cutoff = 2 / sqrt(n);
ObsNum = _N_;
influential = 0;
DFBInd = '0000';                   /* binary string indicator */
do i = 1 to dim(DFB);
   if abs(DFB[i])>cutoff then do;  /* this obs is influential for i_th regressor */
      substr(DFBInd,i,1) = '1';
      influential = 1;
   end;
end;
if influential;                    /* output only influential obs */
run;
 
proc print data=Influential noobs;
   var ObsNum IDMakeMod DFBInd cutoff DFB_:;
run;

The DFBInd variable is a four-character binary string that indicates which parameter estimates are influenced by each observation. Some observations are influential only for one coefficient; others (1, 3, 35, and 38) are influential for many variables. Creating a binary string for each observation is a useful trick.

By the way, did you notice that the name of the statistic ("DFBETAS") has a large S at the end? Until I researched this article, I assumed it was to make the word plural since there is more than one "DFBETA" statistic. But, no, it turns out that the S stands for "scaled." You can define the DFBETA statistic (without the S) to be the change in parameter estimates bb(i), but that statistic depends on the scale of the variables. To standardize the statistic, divide by the standard error of the parameter estimates. That scaling is the reason for the S as the end of DFBETAS. The same is true for the DFFITS statistic: S stands for "scaled."

The next article describes how to create similar graphs for the DFFITS and Cook's D statistics.

---------------

DFFITS: How the predicted values change if an observation is excluded

The DFFITS statistic measures, for each observation, how the predicted value at that observation changes if you exclude the observation and refit the model.

Cook's D: How the sum of the predicted values change if an observation is excluded

Cook's distance (D) statistic measures, for each observation, the sum of the differences in the predicted values (summed over all observations) if you exclude the observation and refit the model.

The post Influential observations in a linear regression model: The DFBETAS statistics appeared first on The DO Loop.

6月 122019
 

For linear regression models, there is a class of statistics that I call deletion diagnostics or leave-one-out statistics. These observation-wise statistics address the question, "If I delete the i_th observation and refit the model, what happens to the statistics for the model?" For example:

  • The PRESS statistic is similar to the residual sum of squares statistic but is based on fitting n different models, where n is the sample size and the i_th model excludes the i_th observation.
  • Cook's D statistic measures the influence of the i_th observation on the fit.
  • The DFBETAS statistics measure how the regression estimates change if you delete the i_th observation.

Although most references define these statistics in terms of deleting an observation and refitting the model, you can use a mathematical trick to compute the statistics without ever refitting the model! For example, the Wikipedia page on the PRESS statistic states, "each observation in turn is removed and the model is refitted using the remaining observations. The out-of-sample predicted value is calculated for the omitted observation in each case, and the PRESS statistic is calculated as the sum of the squares of all the resulting prediction errors." Although this paragraph is conceptually correct, theSAS/STAT documentation for PROC GLMSELECT states that the PRESS statistic "can be efficiently obtained without refitting the model n times."

A rank-1 update to the inverse of a matrix

Recall that you can use the "normal equations" to obtain the least squares estimate for the regression problem with design matrix X and observed responses Y. The normal equations are b = (X`X)-1(X`Y), where X`X is known as the sum of squares and crossproducts (SSCP) matrix and b is the least squares estimate of the regression coefficients. For data sets with many observations (very large n), the process of reading the data and forming the SSCP is a relatively expensive part of fitting a regression model. Therefore, if you want the PRESS statistic, it is better to avoid rebuilding the SSCP matrix and computing its inverse n times. Fortunately, there is a beautiful result in linear algebra that relates the inverse of the full SSCP matrix to the inverse when a row of X is deleted. The result is known as the Sherman-Morrison formula for rank-1 updates.

The key insight is that one way to compute the SSCP matrix is as a sum of outer products of the rows of X. Therefore if xi is the i_th row of X, the SCCP matrix for data where xi is excluded is equal to X`X - xi`xi. You have to invert this matrix to find the least squares estimates after excluding xi.

The Sherman-Morrison formula enables you to compute the inverse of X`X - xi`xi when you already know the inverse of X`X. For brevity, set A = X`X. The Sherman-Morrison formula for deleting a row vector xi` is
(A – xi`xi)-1 = A-1 + A-1 xi`xi A-1 / (1 – xiA-1xi`)

Implement the Sherman-Morrison formula in SAS

The formula shows how to compute the inverse of the updated SSCP by using a matrix-vector multiplication and an outer product. Let's use a matrix language to demonstrate the update method. The following SAS/IML program reads in a small data set, forms the SSCP matrix (X`X), and computes its inverse:

proc iml;
use Sashelp.Class;   /* read data into design matrix X */
read all var _NUM_ into X[c=varNames];  
close;
XpX = X`*X;          /* form SSCP */
XpXinv = inv(XpX);   /* compute the inverse */

Suppose you want to compute a leave-one-out statistic such as PRESS. For each observation, you need to estimate the parameters that result if you delete that observation. For simplicity, let's just look at deleting the first row of the X matrix. The following program creates a new design matrix (Z) that excludes the row, forms the new SSCP matrix, and finds its inverse:

/* Inefficient: Manually delete the row from the X matrix 
   and recompute the inverse */
n = nrow(X);
Z = X[2:n, ];       /* delete first row */
ZpZ = Z`*Z;         /* reform the SSCP matrix */
ZpZinv = inv(ZpZ);  /* recompute the inverse */
print ZpZinv[c=varNames r=varNames L="Inverse of SSCP After Deleting First Row"];

The previous statements essentially repeat the entire least squares computation. To compute a leave-one-out statistic, you would perform a total of n similar computations.

In contrast, it is much cheaper to apply the Sherman-Morrison formula to update the inverse of the original SSCP. The following statements apply the Sherman-Morrison formula as it is written:

/* Alternative: Do not change X or recompute the inverse. 
   Use the Sherman-Morrison rank-1 update formula.
   https://en.wikipedia.org/wiki/Sherman–Morrison_formula */
r = X[1, ];          /* first row */
rpr = r`*r;          /* outer product */
/* apply Sherman-Morrison formula */
NewInv = XpXinv + XPXinv*rpr*XPXinv / (1 - r*XpXinv*r`);
print NewInv[c=varNames r=varNames L="Inverse from Sherman-Morrison Formula"];

These statements compute the new inverse by using the old inverse, an outer product, and a few matrix multiplications. Notice that the denominator of the Sherman-Morrison formula includes the expression r*(X`X)-1*r`, which is the leverage statistic for the i_th row.

The INVUPDT function in SAS/IML

Because it is important to be able to update an inverse matrix quickly when an observation is deleted (or added!), the SAS/IML language supports the IMVUPDT function, which implements the Sherman-Morrison formula. You merely specify the inverse matrix to update, the vector (as a column vector) to use for the rank-one update, and an optional scalar value, which is usually +1 if you are adding a new observation and -1 if you are deleting an observation. For example, the following statements are the easiest way to implement the Sherman-Morrison formula in SAS for a leave-one-out statistic:

NewInv2 = invupdt(XpXinv, r`, -1);
print NewInv2[c=varNames r=varNames L="Inverse from INVUPDT Function"];

The output is not displayed because the matrix NewInv2 is the same as the matrix NewInv in the previous section. The documentation includes additional examples.

The general Sherman-Morrison-Woodbury formula

The Sherman-Morrison formula shows how to perform a rank-1 update of an inverse matrix. There is a more general formula, called the Sherman-Morrison-Woodbury formula, which enables you to update an inverse for any rank-k modification of the original matrix. The general formula (Golub and van Loan, p. 51 of 2nd ed. or p. 65 of 4th ed.) shows how to find the matrix of a rank-k modification to a nonsingular matrix, A, in terms of the inverse of A. The general formula is
(A + U VT)-1 = A-1 – A-1 U (I + VT A-1 U) VT A-1
where U and V are p x k and all inverses are assumed to exist. When k = 1, the matrices U and V become vectors and the k x k identify matrix becomes the scalar value 1. In the previous section, U equals -xiT and V equals xiT.

The Sherman-Morrison-Woodbury formula is one of my favorite results in linear algebra. It shows that a rank-k modification of a matrix results in a rank-k modification of its inverse. It is not only a beautiful theoretical result, but it has practical applications to leave-one-out statistics because you can use the formula to quickly compute the linear regression model that results by dropping an observation from the data. In this way, you can study the influence of each observation on the model fit (Cook's D, DFBETAS,...) and perform leave-one-out cross-validation techniques, such as the PRESS statistic.

The post Leave-one-out statistics and a formula to update a matrix inverse appeared first on The DO Loop.