6月 262019
 

SAS/STAT software contains a number of so-called HP procedures for training and evaluating predictive models. ("HP" stands for "high performance.") A popular HP procedure is HPLOGISTIC, which enables you to fit logistic models on Big Data. A goal of the HP procedures is to fit models quickly. Inferential statistics such as standard errors, hypothesis tests, and p-values are less important for Big Data because, if the sample is big enough, all effects are significant and all hypothesis tests are rejected! Accordingly, many of the HP procedures do not support the same statistical tests as their non-HP cousins (for example, PROC LOGISTIC).

A SAS programmer recently posted an interesting question on the SAS Support Community. She is using PROC HPLOGISTIC for variable selection on a large data set. She obtained a final model, but she also wanted to estimate the covariance matrix for the parameters. PROC HPLOGISTIC does not support that statistic but PROC LOGISTIC does (the COVB option on the MODEL statement). She asks whether it is possible to get the covariance matrix for the final model from PROC LOGISTIC, seeing as PROC HPLOGISTIC has already fit the model? Furthermore, PROC LOGISTIC supports computing and graphing odds ratios, so is it possible to get those statistics, too?

It is an intriguing question. The answer is "yes," although PROC LOGISTIC still has to perform some work. The main idea is that you can tell PROC LOGISTIC to use the parameter estimates found by PROC HPLOGISTIC.

What portion of a logistic regression takes the most time?

The main computational burden in logistic regression is threefold:

  • Reading the data and levelizing the CLASS variables in order to form a design matrix. This is done once.
  • Forming the sum of squares and crossproducts matrix (SSCP, also called the X`X matrix). This is done once.
  • Maximizing the likelihood function. The parameter estimates require running a nonlinear optimization. During the optimization, you need to evaluate the likelihood function, its gradient, and its Hessian many times. This is an expensive operation, and it must be done for each iteration of the optimization method.

The amount of time spent required by each step depends on the number of observations, the number of effects in the model, the number of levels in the classification variables, and the number of computational threads available on your system. You can use the Timing table in the PROC HPLOGISTIC output to determine how much time is spent during each step for your data/model combination. For example, the following results are for a simulated data set that has 500,000 observations, 30 continuous variables, four CLASS variables (each with 3 levels), and a main-effects model. It is running on a desktop PC with four cores. The iteration history is not shown, but this model converged in six iterations, which is relatively fast.

ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi OUTEST;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   ods output ParameterEstimates=PE;
   performance details;
run;

For these data and model, the Timing table tells you that the HPLOGISITC procedure fit the model in 3.68 seconds. Of that time, about 17% was spent reading the data, levelizing the CLASS variables, and accumulating the SSCP matrix. The bulk of the time was spent evaluating the loglikelihood function and its derivatives during the six iterations of the optimization process.

Reduce time in an optimization by improving the initial guess

If you want to improve the performance of the model fitting, about the only thing you can control is the number of iterations required to fit the model. Both HPLOGISTIC and PROC LOGISTIC support an INEST= option that enables you to provide an initial guess for the parameter estimates. If the guess is close to the final estimates, the optimization method will require fewer iterations to converge.

Here's the key idea: If you provide the parameter estimates from a previous run, then you don't need to run the optimization algorithm at all! You still need to read the data, form the SSCP matrix, and evaluate the loglikelihood function at the (final) parameter estimates. But you don't need any iterations of the optimization method because you know that the estimates are optimal.

How much time can you save by using this trick? Let's find out. Notice that the previous call used the ODS OUTPUT statement to output the final parameter estimates. (It also used the OUTEST option to add the ParmName variable to the output.) You can run PROC TRANSPOSE to convert the ParameterEstimates table into a data set that can be read by the INEST= option on PROC HPLOGISTIC or PROC LOGISTIC. For either procedure, set the option MAXITER=0 to bypass the optimization process, as follows:

/* create INEST= data set from ParameterEstimates */
proc transpose data=PE out=inest(type=EST) label=_TYPE_;
   label Estimate=PARMS;
   var Estimate;
   id ParmName;
run;
 
ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi INEST=inest MAXITER=0;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   performance details;
run;
The time required to read the data, levelize, and form the SSCP is unchanged. However, the time spent evaluating the loglikelihood and its derivatives is about 1/(1+NumIters) of the time from the first run, where NumIters is the number of optimization iterations (6, for these data). The second call to the HPLOGISTIC procedure still has to fit the loglikelihood function to form statistics like the AIC and BIC. It needs to evaluate the Hessian to compute standard errors of the estimates. But the total time is greatly decreased by skipping the optimization step.

Use PROC HPLOGISTIC estimates to jump-start PROC LOGISTIC

You can use the same trick with PROC LOGISTIC. You can use the INEST= and MAXITER=0 options to greatly reduce the total time for the procedure. At the same time, you can request statistics such as COVB and ODDSRATIO that are not available in PROC HPLOGISTIC, as shown in the following example:

proc logistic data=simLogi INEST=inest 
   plots(only MAXPOINTS=NONE)=oddsratio(range=clip);
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass / MAXITER=0 COVB;
   oddsratio c1;
run;

The PROC LOGISTIC step takes about 4.5 seconds. It produces odds ratios and plots for the model effects and displays the covariance matrix of the betas (COVB). By using the parameter estimates that were obtained by PROC HPLOGISTIC, it was able to avoid the expensive optimization iterations.

You can also use the STORE statement in PROC LOGISTIC to save the model to an item store. You can then use PROC PLM to create additional graphs and to run additional post-fit analyses.

In summary, PROC LOGISTIC can compute statistics and hypothesis tests that are not available in PROC HPLOGISTIC. it is possible to fit a model by using PROC HPLOGISTIC and then use the INEST= and MAXITER=0 options to pass the parameter estimates to PROC LOGISTIC. This enables PROC LOGISTIC to skip the optimization iterations, which saves substantial computational time.

You can download the SAS program that creates the tables and graphs in this article. The program contains a DATA step that simulates logistic data of any size.

The post Jump-start PROC LOGISTIC by using parameter estimates from PROC HPLOGISTIC appeared first on The DO Loop.

6月 262019
 

In his article How to use CASL to develop and work with user-defined CAS actions, Brian Kinnebrew defines CASL as "a language specification used by the SAS client to interact with and provide easy access to Cloud Analytic Services (CAS). CASL is a statement-based scripting language with many uses and strengths." I can't come up with a better definition, so if you'd like to learn more about the basics of CASL, I encourage you to read Brian's post.

In a SAS Stored Process (or any traditional SAS program) the code has multiple DATA steps and procedures with a good dose of macros.

Over the last couple of years, the focus of my projects is the use of CAS actions with no traditional procedures in the mix. Most of these involved web applications calling multiple CAS actions. My initial approach was to make multiple http calls - one per action. This could get tedious.

Then I met my action hero: sccasl.runCasl.

My favorite SAS CASL action

This action executes a CASL "script" on the CAS server analogous to executing a SAS Stored Process. Running a CASL program with a mix of CAS actions and CASL statements on the CAS server has these benefits:

  1. Reduced the number of http calls to the server
  2. The client-side code is much easier to reason
  3. The returned values can be a dictionary that is suited for further consumption by the client, simplifying the client code

My personal name for these scripts is "CAS stored process".

Where art thou Macro?

In many applications, user input is passed to the code running on the server. In a SAS Stored Process, macros pass the parameters. CASL has no macros. My initial approaches to passing parameters to CASL programs were:

  1. Generate the final CASL program in JavaScript with the user input values inserted into the code. Sample code is available here.
    • Drawback: Debugging the code in SAS Studio requires a cut-and-paste of the generated code into SAS Studio.
  2. Load the data into a CAS table using table.upload action for programs needing an input table. Sample code is available here.
    • Drawback: This requires an additional http call to the server.

In developing the GraphQL approach to writing applications - GraphQL and SAS Viya applications - a good match - I addressed the two drawbacks listed above by creating two functions and put them in my utility belt.

    Superfriend functions

  • jsonToDict.js - generate a string having the CASL dictionary version of a JavaScript object
  • argsToTable - create a CAS table from a dictionary

The remainder of this article discusses these two functions. To demonstrate the functions' usage, I use code listings from the scoring example, covered in the GraphQL example.

The jsonToDict function

The result of this function produces a string with CASL dictionary suitable for inclusion in CASL code.

Function definition

jsonToDict ⇒ string

Returnsstring - returns the string containing the CASL dictionary

Param Type Description
obj object the JavScript object of interest
name string the name to assign to the dictionary

 

Example function code

The code below outlines the jsonToDict function usage.

obj = {x:1, b:2, c:['a', 'b']};
let r = jsonToDict(obj, '_appEnv_');
The result is:
r = `_appEnv_ = {x=1, b=2, c={"a", "b"}}'`;

The following lists the input parameters passed to the CASL code for scoring.

let input = {
        JOB    : 'J1',
        CLAGE  : 100, 
        CLNO   : 20, 
        DEBTINC: 20, 
        DELINQ : 2, 
        DEROG  : 0, 
        MORTDUE: 4000, 
        NINQ   : 1,
        YOJ    : 10,
        LOAN: 1000,
        ASSET: 100000
    };

Below is the Javascript code and the result.

let _args_ = jsonToDict(input, '_args_');
results in
 
args_ = `_args_ = {  JOB= "J1" ,CLAGE=100  ,CLNO=20  ,DEBTINC=20  ,DELINQ=2  ,DEROG=0  ,
MORTDUE=4000  ,NINQ=1  ,YOJ=10  ,LOAN=1000  ,ASSET=100000  };`;

To allow the use of different versions of the model, the name of the scoring model is passed in as a parameter. The Javascript for the model follows.

let env = {
     astore: {
            caslib: 'Public',
            name  : 'GRADIENT_BOOSTING___BAD_2'
        }
};
let _appEnv_= jsonToDict(env, '_appEnv_');
 
resulting in:
let _appEnv_ = `_appEnv_ = { astore = { caslib="Public", name="GRADIENT_BOOSTING___BAD_2"}};`;

Next, I prepend the strings to the CASL program in the client code with the following code snips.

let code = _args_ + _appEnv_ + `the CASL code shown below';
loadactionset "astore";
 
/* convert arguments to a cas table */
argsToTable(_args_, 'casuser', 'INPUTDATA' );
 
/* score */
action astore.score r=rc/
    table  = { caslib= 'casuser', name = 'INPUTDATA' } 
    rstore = { caslib= _appEnv_.astore.caslib,  name=_appEnv_.astore.name }
    casout  = { caslib = 'casuser', name = 'OUTPUTDATA' replace= TRUE};
 
/* fetch results */
action table.fetch r = result /
    table = {  caslib = 'casuser' name = 'OUTPUTDATA' } ;
 
/* extract the score and send it as a dictionary */
score = result.Fetch[1].P_BAD;
send_response({score = score});

Now the CASL program can access the incoming information using the two dictionaries _args_ and _appEnv_. Note: As a personal choice, I use a convention of _args_ for user input and _appEnv_ for application specific information. I use the restaf application framework to make the http calls as shown below.

let payload = {
        action: 'sccasl.runCasl',
        data  : { code: code}
    }
 let result = await store.runAction(session, payload); 
let score = result.items('results', 'score');

CASL coding - easier than saying Kilp-ill-skim

Notice the absence of string substitutions that look strange. Just simple, straight forward coding. Easier than saying your name backwards, forcing you back to the fifth dimension.

The argsToTable.casl function

The sample code above used the function argsToTable. As you may guess it converts the _args_ dictionary into a CAS table used in the scoring action. The argsToTable is the function in CASL handling this task.

Function definition

argsToTable ⇒ Load dictionary into a CAS Table

Param Type Description
input dictionary the data to load
caslib string caslib of output table
name string name of output table

 

The relevant CASL code from Step 1 is reproduced here:

argsToTable(_args_, 'casuser', 'INPUTDATA' );

The argsToTable function is either stored on a server or prepended to the CASL code sent to the runCasl action. This function removes the need to run a http call to load the data in the CAS table.

Returning data from CAS - the send_response function

The ultimate sidekick function

Any good super hero has a sidekick. The function send_response in CASL is very versatile - it allows one to return data in the form the application needs and allows more than one result. In many programs I return data in a form easily consumed by the client code.

For example, if you wanted to return just the rows of table you can do the following:

function resultsToDict(r);
    casResults = {};
    i = 1;
  do row over r;
     casResults[i] = row;
     i = i + 1;
   end;
  return casResults;
end;
 
/* and use it as follows: */
 
action table.fetch r = result /
    table = {  caslib = 'casuser',  name = 'mydata' };
/* extract the data and return it as a dictionary */
casResults = resultsToDict(result.Fetch);
send_response({casResults: casResults});

Finally

Using a combination of CASL, a couple of utility functions and the runCasl action you can develop some very efficient programs with minimal traffic between your client and the server. If you run multiple actions in sequence you should consider grouping them into a CASL program and executing them on the CAS server using the runCasl action.

Next

In my next article, which I hope to finish in a flash, I will discuss using the runCasl action to create a browser for CAS tables with support for pagination.

All comments are welcome. Please feel free to clone the code, make it better.

Cheers...
Deva

Let runCasl be your BFF and favorite action hero

"CAS Stored Process" with my Favorite Action Hero runCasl was published on SAS Users.

6月 252019
 

Everyone is talking about artificial intelligence. Unfortunately, a lot of what you hear about AI in the movies and on the TV is sensationalized for entertainment. Indeed, AI is overhyped. But AI is also real and powerful. Consider this: engineers worked for years on hand-crafted models for object detection, facial [...]

Meet 5 data pioneers developing AI solutions for the real world was published on SAS Voices by Oliver Schabenberger

6月 242019
 

When fitting a least squares regression model to data, it is often useful to create diagnostic plots of the residuals versus the explanatory variables. If the model fits the data well, the plots of the residuals should not display any patterns. Systematic patterns can indicate that you need to include additional explanatory effects to model the data. Sometimes it is difficult to spot patterns in a seemingly random cloud of points, so some analysts like to add a scatter plot smoother to the residual plots. You can use the SMOOTH suboption to the PLOTS=RESIDUALS option in many SAS regression procedures to generate a panel of residual plots that contain loess smoothers. For SAS procedures that do not support the PLOTS=RESIDUALS option, you can use PROC SGPLOT to manually create a residual plot with a smoother.

Residual plots with loess smoothers

Many SAS linear regression procedures such as PROC REG and PROC GLM support the PLOTS=RESIDUAL(SMOOTH) option on the PROC statement. For example, the following call to PROC GLM automatically creates a panel of scatter plots where the residuals are plotted against each regressor. The model is a two-variable regression of the MPG_City variable in the Sashelp.Cars data.

/* residual plots with loess smoother */
ods graphics on;
proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   where Type in ('SUV', 'Truck');
   model MPG_City = EngineSize Weight;
run; quit;

The loess smoothers can sometimes reveal patterns in the residuals that would not otherwise be perceived. In this case, it looks like there is a quadratic pattern to the residuals-versus-EngineSize graph (and perhaps for the Weight variable as well). This indicates that you might need to include a quadratic effect in the model. Because the EngineSize and Weight variables are highly correlated (ρ = 0.81), the following statements add only a quadratic effect for EngineSize:

proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   where Type in ('SUV', 'Truck');
   model MPG_City = EngineSize Weight
                    EngineSize*EngineSize ;
quit;

After adding the quadratic effect, the residual plots do not reveal any obvious systematic trends. Also, the residual plot for Weight no longer shows any quadratic pattern.

How to use PROC SGPLOT to create a residual plot with a smoother

If you use a SAS procedure that does not support the PLOTS=RESIDUALS(SMOOTH) option, you can output the residual values to a SAS data set and use PROC SGPLOT to create the residual plots. Even when a procedure DOES support the PLOTS=RESIDUALS(SMOOTH) option, you might want to customize the plot by adding legends, by changing attributes of the markers or curve, or by specifying a value for the smoothing parameter.

An example is shown below. If you use the same model for MPG_City, but use all observations in the data set, the residual plot for EngineSize looks very strange. For these data, the smoothing parameter for the loess curve is very small and therefore the loess curve overfits the residuals:

proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   model MPG_City = EngineSize Weight;
   output out=RegOut predicted=Pred residual=Residual;
run; quit;

Yuck! The loess curve for the plot on the left clearly overfits the residuals-versus-EngineSize data! Unfortunately, you cannot change the smoothing parameter from the PROC GLM syntax. However, you can change the default smoothing parameter in PROC SGPLOT and you can make other modifications to the plot as well. Notice in the previous call to PROC GLM that the OUTPUT statement creates a data set named RegOut that contains the residual values and the original variables. Therefore, you can create a residual plot and add a loess smoother by using PROC SGPLOT, as follows:

ods graphics / attrpriority=NONE;
title "Residuals for Model";
proc sgplot data=RegOut ;
   scatter x=EngineSize y=Residual / group=Origin;
   loess x=EngineSize y=Residual / nomarkers smooth=0.5;
   refline 0 / axis=y;
   xaxis grid; yaxis grid;
run;

The smoothing parameter was manually set to 0.5, but you can use PROC LOESS if you want to choose a smoothing parameter that optimizes some information criterion such as the AICC statistic. Notice that you can use additional SGPLOT statements to add a reference grid and to change marker attributes. If you prefer, you could add a different kind of smoother such as a penalized B-spline by using the PBSPLINE statement.

You might wonder why the smoother in the residual plot for EngineSize is so small. The parameter is chosen to optimize a criterion such as the AICC statistic, so why does it overfit the data? An example in the PROC LOESS documentation provides an explanation. The chosen value for the smoothing parameter is one that corresponds to a local minimum of an objective function that involves the AICC statistic. Unfortunately, a set of data can have multiple local minima, and this is the case for the residuals of the EngineSize variable. When the smoothing parameter is 0.534, the AICC criterion reaches a local minimum. However, there are smaller values of the smoothing parameter for which the AICC criterion is even smaller. The minimum value of the AICC occurs when the smoothing parameter is 0.015, which leads to the "jagged" loess curve that is seen in the panel of residual plots shown earlier in this section. If you want to see this phenomenon yourself, run the following PROC LOESS code and look at the criterion plot.

ods select CriterionPlot SmoothingCriterion FitPlot;
proc loess data=RegOut;
   model Residual = EngineSize / select=AICC(global) ;
run;

Because a data set can be smoothed at multiple scales, the "optimal" smoothing parameter that is chosen automatically by the PLOTS=RESIDUALS(SMOOTH) option might not enable you to see the general trend of the residuals. If you experience this phenomenon, output the residuals and use PROC SGPLOT or PROC LOESS to compute a more useful smoother.

In summary, SAS provides the PLOTS=RESIDUALS(SMOOTH) option to automatically create residual-versus-regressor plots. Although this panel usually provides a useful indication of patterns in the residuals, you can also output the residuals to a data set and use PROC SGPLOT or PROC LOESS to create a customized residual plot.

The post Add loess smoothers to residual plots appeared first on The DO Loop.

6月 212019
 

For every project in SAS®, the first step is almost always making your data available. This blog shows you how to load three of the most common input data types—a data set, a text file, and a Microsoft Excel file—into SAS® Cloud Analytic Services (CAS) tables.

The three methods that I show here are the three easiest ways to load each data type into CAS. Multiple tools can load data into CAS, but I am showing the tools that I consider the easiest to use and that are probably the most familiar to SAS programmers.

You need to place your data in a location that can be accessed by the programming environment that is used to access CAS. The most common programming environment that accesses CAS is SAS® Studio. Input data files that are used in CAS are going to be very large. You will need to use an SFTP tool to move your data from your PC to a directory that can be accessed by your SAS Studio session. (Check with your system administrator to see what the preferred tool is at your site.)

After your data is in a location that can be accessed by the programming environment, you need to start a CAS session. This is done with the CAS statement; here is the syntax:

cas session-name <option(s)>;

The options that you specify depend on how your system administrator configured your environment. For example, I asked my system administrator to set it up so that the only thing I need to do is issue the following statement:

cas;

That statement then creates a CAS session with the default name of CASAUTO, with an active caslib of CASUSER.

After you establish your CAS session, you can start loading data.

Load a SAS data set

The easiest way to load SAS data into CAS is to use a DATA step. The basic syntax is the same as it is when you are creating a SAS data set. The key difference is that the libref that is listed in the DATA step must point to a caslib.

The following example accesses SASHELP.CARS and then creates the table CARS in the CASUSER caslib.

cas;                   /* log on to the CAS session */
   caslib _all_ assign;   /* create a libref for each active 
                             caslib */
 
   data casuser.cars;     /* create the table in the caslib */
      set sashelp.cars;
   run;

The one thing to note about this code is that this DATA step is running in SAS and not CAS. A DATA step runs in CAS only if the input and output librefs are both using the CAS engine and only if it uses language elements that CAS supports. In this example, the SASHELP libref was not created with the CAS engine, so the step must run in SAS.

There are no calculations in this step, so there is no effect on performance.

When you load a data set into a CAS table, one task that you might want to perform is to promote the table. The following DATA step shows how to promote a table as you create it:

data casuser.cars(promote=yes);
      set sashelp.cars;
   run;

Promoting a table gives it a global scope. You can then access the table in multiple sessions, and the table is also available in SAS® Visual Analytics. The PROMOTE= data set option can be used with all three of the examples in this blog.

Load a delimited text file

The easiest way to load a delimited text file and store it as a CAS table is to use another familiar SAS step, the IMPORT procedure. The syntax is going to be basically the same as it is in SAS. Again, the key difference is that you need to point to a caslib in the OUT= option of the PROC IMPORT statement.

If you are running SAS® Studio 5.1, a common location for a text file that needs to be loaded into CAS is the SAS Content folder. This folder is a predefined repository for text files that you need to access from SAS Studio.

In order to access the files from SAS Content, you need to use the FILESRVC access method with the FILENAME statement. This method enables you to store and retrieve content using the SAS® Viya® Files service. Here is the basic syntax:

filename fileref filesrvc folderpath='path'
            filename='name';

For more information about this access method, see SAS 9.4 Global Statements Reference.

In the following example, PROC IMPORT and the FILESRVC access method are used to load the class.csv file from the SAS Content folder. The resulting ALLCLASS table is written to the CASUSER caslib.

filename myfile filesrvc folderpath='/Users/saskir'
            filename='class.csv';
   proc import datafile=myfile out=casuser.allclass(promote=yes)
        dbms=csv;
   run;

Load an Excel file

This section shows you how to load an Excel file into a CAS table by using PROC IMPORT. Loading an Excel file using PROC IMPORT requires that you have SAS/ACCESS® Interface to PC Files. If you are unsure whether you have this product, you can write all of your licensed products to the log by using the SETINIT procedure:

proc setinit;
   run;

You should see the following in the log when the product is licensed:

---SAS/ACCESS Interface to PC Files

After you confirm that you have this product, you can adapt the following PROC IMPORT code. This example loads the ReportTest.xlsx file and stores it in a table named ReportTest in the CASUSER caslib.

cas;
   caslib _all_ assign;
 
   proc import datafile='/viyashare/ReportTest.xlsx'
        out=casuser.ReportTest dbms=xlsx;
   run;

There are other methods

The purpose of this blog was to show you the easiest ways to load a SAS data set, text file, and Excel file into CAS. Although there are multiple ways to accomplish these tasks, both programmatically and interactively, these methods are the easiest and most straightforward ways to get data into CAS.

Learn the three easiest ways to load data into CAS tables was published on SAS Users.

6月 202019
 

Move over video games and sports. Make room for escape rooms. This burgeoning form of entertainment found its roots in the video gaming movement. Escape rooms tap into a player's drive to reach the next level, solve a puzzle and win. Escape rooms present a physical game that traps you until your brain (and teamwork) help you escape. Does that sound exhilarating or terrifying? Maybe a bit of both.

SAS adapted this concept and built its own Data Science Escape Rooms at SAS Global Forum 2019. More than 800 customers used SAS software to solve a series of problems in one of six rooms, spanning three different themes: soccer, wildlife and cyberattack.

I had the opportunity to support these rooms by assisting with registration and check-in. I also "covered" the experience through videos that shared a behind-the-scenes look, player reactions and backstories on how the rooms were developed.

I even got to chat with one team that completed all three rooms. The Advanced Claims Analysis Team with the United States Office of Personnel Management, aka Team "Auditgators," included Kevin Sikora, Julie Zoeller, Richard Allen and Lauren Goob. They spoke about how cool it was to play with software they're not used to in pseudo real-world scenarios.

What the escape rooms were really like

Let me give you a lay of the land. The rooms included four stations each with a computer screen and a mouse (no keyboards). At each station, players used products like SAS Visual Analytics and SAS Visual Investigator to solve a challenge, working against the clock to escape the room within 20 minutes. Talk about pressure.

A team studies a problem at one of four stations in the Wildlife-themed escape room.

"You definitely felt a sense of urgency in there," recalls Sikora. "We really got into the wildlife room – it's where we received the highest score. We had the chance to do some great team-building too. We're more cohesive having shared this experience."

The Auditgators worked separately to solve problems and then collaborated to piece together the bigger puzzle. When they got stuck, they asked for help from the "gamemasters," who were literally behind the walls of the rooms. Gamemasters doled out clues to nudge teams along.

This may seem like a fun diversion, but how in the world can you apply this experience to your daily work? "Well we certainly realized the value in using more graphs, to click and filter to find answers we need," said Zoeller. "It was enlightening to see how we could show the data, and it gave us a better awareness of the end goal, the big picture."

Allen and Goob have been to SAS Global Forums in the past and said that the Data Science Escape Rooms gave them the best opportunity to interact with one another than anything else at the event. "It breaks up sitting in rooms and listening, breaks up the monotony. It gave us a chance to do something together as a team."

More reactions and backstories

Want a taste of the action at the event? Here, SAS' Lisa Dodson, Data Science Escape Room Gamemaster, couldn't hide her excitement to give clues and cheer on teams.

 
 
Michael Gibbs with the University of Arkansas described the intensity and pressure to get the answer and move on while escaping.

 
 
SAS partnered with SciSports to build the soccer-themed escape room. SciSports Founder and CIO Giels Brouwer and Data Scientist Mick Bosma explained how the idea came about and plans to take the room on the road.

 
 
While most participants enjoyed the challenge, not everyone "escaped" successfully. The most successful teams had solid teamwork and good communication. Sound like a lesson for "real life"? SAS' Alfredo Iglesias Rey reports that 18 percent of the teams completed all four challenges correctly. He also marveled at how players were able to create advanced machine learning techniques with just a mouse and keyboard.

 
 
Dying to check out a Data Science Escape Room yourself? You'll find them at several local SAS Forums this year, and plans are underway to offer them at other venues. But you don't have to wait for an escape room to get a feel for SAS software. Try SAS online right now, for free.

A playful way to get your hands on SAS: the Data Science Escape Rooms was published on SAS Users.

6月 192019
 

A previous article describes the DFBETAS statistics for detecting influential observations, where "influential" means that if you delete the observation and refit the model, the estimates for the regression coefficients change substantially. Of course, there are other statistics that you could use to measure influence. Two popular ones are the DFFTIS and Cook's distance, which is also known as Cook's D statistic. Both statistics measure the change in predicted values that occurs when you delete an observation and refit the model. This article describes the DFFITS and Cook's D statistics and shows how to compute and graph them in SAS.

DFFITS: How the predicted value changes if an observation is excluded

If you exclude an observation from a model and refit, the predicted values will change. The DFFITS statistic is a measure of how the predicted value at the i_th observation changes when the i_th observation is deleted. High-leverage points tend to pull the regression surface towards the response at that point, so the change in the predicted value at that point is a good indication of how influential the observation is. So that the DFFITS values are independent of the scale of the data, the change in predicted values is scaled by dividing by the standard error of the predicted value at that point. The exact formula is given in the documentation for PROC REG.. The book Regression Diagnostics by Belsley, Kuh, and Welsch (1980) suggests that an observation is influential if the magnitude of its DFFITS value exceeds 2*sqrt(p/n), where p is the number of effects in the model and n is the sample size.

PROC REG provides three ways to generate the DFFITS statistics for each observation:

  • You can create a graph of the DFFITS statistics by using the PLOTS=DFFITS option.
  • You can also display a table of the DFFITS (and other influence statistics) by using the INFLUENCE option in the MODEL statement.
  • You can write the DFFITS statistics to a data set by using the DFFITS= option in the OUTPUT statement.

The following DATA step extracts a subset of n = 84 vehicles from the Sashelp.Cars data, creates a short ID variable for labeling observations, and sorts the data by the response variable, MPG_City. The data are sorted because the DFFITS statistic is graphed against the observation number, which is an arbitrary quantity. By sorting the data, you know that small observation numbers correspond to low values of the response and so forth. If you have a short ID variable, you can label the influential observations by using the LABEL suboption, as follows:

/* Create sample data */
data cars;
set sashelp.cars;
where Type in ('SUV', 'Truck');
/* make short ID label from Make and Model values */
length IDMakeMod $20;
IDMakeMod = cats(substr(Make,1,4), ":", substr(Model,1,5));
run;
 
/* Optional but helpful: Sort by response variable */
proc sort data=cars;
   by MPG_City;
run;
 
proc reg data=Cars plots(only) = DFFITS(label); 
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
run; quit;

The DFFITS graph shows that three observations have a large positive DFFITS value. The observations are the Ford Excursion, the Ford Ranger, and the Madza BB230. For these observations, the predicted value (at the observation) is higher with the observation included in the model than if it were excluded. Thus, these observations "pull the regression up." There are four observations that have large negative DFFITS, which means that these observations "pull the regression down." They include the Land Rover Discovery and the Volvo XC90.

Cook's D: A distance measure for the change in regression estimates

When you estimate a vector of regression coefficients, there is uncertainty. The confidence regions for the parameter estimate is an ellipsoid in k-dimensional space, where k is the number of effects that you are estimating (including the intercept). Cook (1977) defines a distance that the estimates move within the confidence ellipse when the i_th point is deleted. Equivalently, Cook shows that the statistic is proportional to the squared studentized residual for the i_th observation. The documentation for PROC REG provides a formula in terms of the studentized residuals.

By default, PROC REG creates a plot of Cook's D statistic as part of the panel of diagnostic plots. (Cook's D is the second row and third column.) You can create a larger stand-alone plot by using the PLOTS=DFFITS option. Optionally, you can label the influential points (those whose Cook's D statistic exceeds 4/sqrt(n)) by using the LABEL suboption, as shown below:
/* create multiple plots and label influential points */
proc reg data=Cars plots(only) = (CooksD(label) DFFits(label));   
   model MPG_City = EngineSize HorsePower Weight;
   id IDMakeMod;
   output out=RegOut pred=Pred rstudent=RStudent dffits=DFFits cookd=CooksD; /* optional: output statistics */
run; quit;

In many ways, the plot of Cook's D looks similar to a plot of the squared DFFITS statistics. Both measure a change in the predicted value at the i_th observation when the i_th observation is excluded from the analysis. The formula for Cook's D statistic squares a residual-like quantity, so it does not show the direction of the change, whereas the DFFITS statistics do show the direction. Otherwise, the observations that are "very influential" are often the same for both statistics, as seen in this example.

The post Influential observations in a linear regression model: The DFFITS and Cook's D statistics appeared first on The DO Loop.

6月 192019
 

As a data scientist, did you ever come to the point where you felt the need for an evolved analytics platform bringing together the disparate skills of open source and commercial software? A system that can enable advanced analytic capabilities. This is now possible and easy to implement. With many deployment possibilities, SAS Viya allows you to choose the data storage location where compute happens, and the deployment methods for models.

Let’s say you want to expand your model development process with SAS Viya analytical capabilities and you don’t want to wait for getting such environment up and running. Unfortunately, you have no infrastructure, nor the experience to install SAS Viya. Moving the traditional way, you could go for:

  • Protracted hardware procurement and provisioning
  • Deployment planning and coordination with IT
  • Effort and time required for software installation/configuration

This solution may be the right path for many organizations, but I think we all recognize this: the traditional approach could take days, weeks and yes sometimes months.

What if you could get up and running with a full SAS Viya platform in two hours? If you have some affinity for cloud-based solutions, SAS offers you the AWS SAS Viya Cloud Rapid Deployment tool. SAS released this AWS Quick Start as a rapid deployment architecture for SAS Viya on AWS. Deployable products include SAS Visual Data Mining and Machine Learning, SAS Visual Statistics and SAS Visual Analytics.

The goal of this article is to brief you how I launched such an AWS SAS Viya Quickstart. I strongly advise you to watch this related video by my colleague Erwan Granger. Much of what is covered here appears in Erwan's video. The recording predates the SAS Viya 3.4 release, but main concepts are still the same.

What you will need

The following is a list of items you need to complete this task.

  • AWS Account with appropriate creation privileges
  • A valid SAS Viya License; this means you will need a SAS Software Order Confirmation e-mail
  • Optional: you deploy with your own DNS Name and SSL Certificate. In that case you need to register a domain managed by Amazon Route 53. For instructions on registering the domain, see the Route 53 documentation. And you can request and register a certificate with AWS Certificate Manager.

Furthermore, it’s good to know this Quick Start provides two deployment options. You can deploy SAS Viya into a new Virtual Private Cloud (VPC) or into an existing VPC. The first option builds a new AWS environment consisting of the VPC, private and public subnets, NAT gateways, security groups, Ansible controllers, and other infrastructure components, and then deploys SAS Viya into this new VPC. The second option provisions SAS Viya in your existing AWS infrastructure. I decided to go for the first option.

What you will build

Here's an architectural overview of what we will build:

SAS Viya architecture on the AWS Cloud

You can find exactly the same architecture on the SAS Viya AWS Quick Start landing page.

Configure the build

We’ll be following the build process outlined in the Quick Start guide. On the landing page, next to the "What you’ll build tab" you can click on "How to deploy". From there launch the "Deploy into a new VPC" wizard.

Deploy into a new VPC wizard

Prerequisite prep

Make sure you sign in with your AWS account and you have chosen the region where you want to deploy. On that first screen you can leave the Amazon S3 template URL default. That template is the basics for the AWS CloudFormation we are launching. CloudFormation is a tool from AWS that allows you to spin up resources in the right order. The template is the blueprint document for your CloudFormation. By keeping the default template, we will build exactly the architecture displayed above.

Pre-req prep template

Now click "Next" and move to the page where we can specify more details and the required parameters of the CloudFormation parameters.

Cloudfourmation parameters

The first parameter is the SAS Viya Software Order file, which is the Amazon S3 location of the Software Order e-mail attachment.

SAS Viya install package location

In the Administration section, you provide parameters to configure your AWS architecture. That way, you control access, instance type, and if you will use a SAS Viya Mirror repository.

CloudFormation administration parameters

Administration parameter definitions:

  • The name of an Amazon EC2 key pair, so you can access the Ansible controller
  • The Amazon Availability zone for the public and private subnet
  • Allowable IP range for HTTP traffic; must be a valid IP CIDR range
  • Allowable IP Range for SSH traffic to the Ansible controller; must be a valid IP CIDR range
  • SAS Administrator password
  • Password for Default (sasuser) user
  • Amazon EC2 Instance type for CAS Compute VM
  • Amazon EC2 Instance type for SAS Viya Services VM
  • (Optional) Location of SAS Viya Deployment Repository data
  • (Optional) Operator Email

If you want to work with custom DNS names and SSL, you will need to provide the next three parameters as well.

DNS and SSL configuration (optional)

DNS and SSL parameters:

You may accept the defaults on the remaining parameters.

Optional parameters

After clicking "Next" another set of optional parameters are available. I mostly go with accepting the default parameters provided. The lone exception is the Rollback on failure.

Optional administration parameters

Based on what I’ve learned from Erwan's video, the safer choice is "No" on the Rollback option. This way, if the deployment process encounters issues, the log will identify in which step the error occurred. Of course this means you are responsible to manually delete AWS created resources that are not longer necessary. The easiest way to do this is by deleting the CloudFormation Stacks afterward.

Kick off the build

To conclude the deployment wizard, click "Next" once more and acknowledge the necessary AWS resources to create. By clicking "Create stack" the deployment process starts.

Start the build process

You can monitor the deployment log using AWS CloudWatch. In his video, Erwan demonstrates this at around minute 23.

After a successful formation you will find two AWS CloudFormation Stacks created. The Outputs gives you the direct links to SAStudioV and SASDrive.

SAS Studio and SAS Drive stacks

That’s it. You are deployed and ready to begin using your SAS Viya environment!

Additional Reference

Alexander Koller writes about SAS on AWS and takeaways for preparing for the AWS associate solution architect exam.

Your experiences and opinion matter

New forces are shaping the analytics ecosystem. Because of increased competition, rise in customer expectations and new, emerging technology such as AI and Machine Learning, challenging IT departments with evolving their analytic ecosystems to meet the demands of their business partners.

How is your organization doing this? How does your Analytics Cloud strategy compare to the market? And what do your peers think about migrating Analytics to the cloud? We can give you some insights and an industry benchmark on the topic.

Tell us about your experience in this 5 minute survey and we will be happy to share a detailed industry insight report with you, to answer these questions.

Deploy SAS Viya on AWS - Quick Start was published on SAS Users.