8月 032018
 

In many movies, there is often a scene where the star says "We can do this the easy way, or the hard way" (and the hard way usually involves quite a bit of pain). So it is with interrogations ... and so it is with writing SAS code! Today I'm [...]

The post Cumulative values - "We can do this the easy way, or the hard way..." appeared first on SAS Learning Post.

8月 022018
 

SAS Viya has opened an entirely new set of capabilities, allowing SAS to analyze on cloud technology in real-time. One of the best new features of SAS Viya is its ability to pair with open source platforms, allowing developers the freedom of language and implementation to integrate with the power of SAS analytics.

At SAS Global Forum 2018, Sean Ankenbruck and Grace Heyne Lybrand from Zencos Consulting led the talk, SAS Viya: The Beauty of REST in Action. While the paper – and this blog post – outlines the use of Python and SAS Viya, note that SAS Viya integrates with R, Java and Lua as well.

Nonetheless, this Python integration example shows how easy it is to integrate SAS Viya and open source technologies. Here is the basic workflow:

1. A developer creates a web application, in a language of their choice
2. A user enters data in the web application
3. The collected data that is passed to Viya via the defined APIs
4. Analysis is performed in Viya using SAS actions
5. Results are passed back to the web application
6. The web application presents the results to the user

About the process

SAS’ Cloud Analytic Services (CAS) acts as a server to analyze data, and REST API’s are being used to integrate many programming languages into SAS Viya. REST stands for Representational State Transfer, and is a set of constraints that allows scalability and integration of multiple web-based systems. In layman’s terms, it’s a set of software design patterns that provides handy connector points from one web app to another. The REST API is what developers use to interact with and submit requests through the processing system.

CAS actions are what allow “tasks” to be completed on SAS Viya. These “tasks” are under the categories of Statistics, Analytics, System, and Data Mining and Machine Learning.

Integration with Python

To access CAS through Python, the SAS Scripting Wrapper for Analytics Transfer (SWAT) package is used, letting Python conventions dictate CAS actions. To create this interface, data must be captured through a web application in a format that Python can transmit to SAS Viya.
In order to connect Python and CAS, the following is necessary:

• Hostname
• CAS Port
• Username
• Password

Let’s see it in action

As an example, one project about wine preferences used CAS-collected data through a questionnaire stored in Python’s Pandas library. When the information was gathered, the decision tree was uploaded to SAS Viya. A model was created with common terms reviewers use to describe wines, feeding into a decision tree. The CAS server scored the users’ responses in real-time, and then sent the results back to the user providing them with suggested wines to match their inputs.

Process for model

Code to utilize tree:

conn.loadactionset("decisionTree") 
conn.decisionTree.dTreeTrain( 
      casOut = {"name":"tree_model"},
      inputs = [{vars}], 
      modelId = "DT_wine_variety", 
      table = {"caslib":"public", "name": "wines_model_data"}, 
      target = "variety")

"Decision" given to user

Conclusion

SAS Viya has opened SAS to a plethora of opportunities, allowing many different programming languages to be interpreted and quickly integrated, giving analysts and data scientists more flexibility.

Additional resources

At Your Service: Using SAS® Viya™ and Python to Create Worker Programs for Real-Time Analytics, Jon Klopfer, Scott Koval, and Mia List
SAS Viya
sas-viya-programming on github
python-swat on github
SAS Global Forum

Additional SAS Viya talks from SAS Global Forum

A Need for Speed: Loading Data via the Cloud, Henry Christoffels
Come On, Baby, Light my SAS® Viya®: Programming for CAS, David Shannon
Just Enough SAS® Cloud Analytic Services: CAS Actions for SAS® Visual Analytics Report Developers, Michael Drutar
Running SAS Viya on Oracle Cloud without Sacrificing Performance, Dan Grant
Command-Line Administration in SAS® Viya®, Danny Hamrick
Five Approaches for High-Performance Data Loading to the SAS® Cloud Analytic Services Server, Rob Collum

How SAS Viya uses REST APIs to integrate with Python was published on SAS Users.

8月 022018
 

Recently, Scott Jackson, Director Business Intelligence at the University of North Carolina at Chapel Hill shared their data quality, reporting and analytics journey. They're using SAS in a multitude of ways – from operations, institutional research, athletics – and are now looking to scale to the enterprise. They've been so successful [...]

Scaling data and analytics across the University of North Carolina was published on SAS Voices by Georgia Mariani

8月 012018
 

I started my computer career as a Unix system administrator, and when I began using SAS almost 30 years ago it was only natural that I used it on Unix computers (they were the most powerful/capable computers I had access to at the time). I've picked up a lot of [...]

The post SAS graphs and Excel data, from Unix! appeared first on SAS Learning Post.

8月 012018
 

Recently I have found myself part of the trend of ordering ‘easily prepared meals’ via the web. On the surface it seems simple. You click the meal you want on their website and a couple days later all the ingredients arrive for you to do the rest. As simple as 1,2,3 right? [...]

The post Data Preparation without Analytics – Is Just Data Preparation appeared first on The Data Roundtable.

8月 012018
 

When you use a regression procedure in SAS that supports variable selection (GLMSELECT or QUANTSELECT), did you know that the procedures automatically produce a macro variable that contains the names of the selected variables? This article provides examples and details. A previous article provides an overview of the 'SELECT' procedures in SAS for building statistical models.

The final model in PROC GLMSELECT

PROC GLMSELECT uses variable selection techniques such as LAR and LASSO to fit a parsimonious linear model from a large number of potential regressors. The following call to PROC GLMSELECT is adapted from the "Getting Started" example from the documentation, which models the log-transformed salaries of baseball players by using on-the-field statistics, player characteristics, and team attributes. When PROC GLMSELECT runs, it creates the _GLSIND macro variable, which contains the names of the effects in the final model.

/* GLMSELECT creates &_GLSInd for selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / selection=lasso;
quit;
 
%put &_GLSInd;      /* display names of effects for selected model */
 
/* Use _GLSInd to call PROC GLM on final selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = &_GLSInd / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   nHits nRBI nBB YrMajor CrHits CrRuns CrRbi

The MODEL statement in PROC GLMSELECT includes 18 independent variables, but the final LASSO model contains only seven variables. The _GLSInd macro contains the name of the selected variables. As shown in the example, the macro can be used in subsequent analyses. The example uses the macro on the MODEL statement of PROC GLM. The _GLSIND variable contains the original variable names, which means that if you use the CLASS statement (or EFFECT statement) in the GLMSELECT model, you should specify the same statement in subsequent procedures.

If you want to use the selected variables in a procedure that does not support the CLASS statement (or EFFECT statement), you should use the OUTDESIGN= option in PROC GLMSELECT to generate a design matrix and use the _GLSMOD macro variable to specify the names of the dummy variables in the final model. An example is given at the end of this article.

The final model in PROC QUANTSELECT

Similarly, PROC QUANTSELECT creates a macro variable (named _QSLInd) that names the independent variables in the final selected model. For example, the following example models the conditional 90th percentile of the LogSalary variable for the same data set:

/* QUANTSELECT creates _QRSInd for selected model */
proc quantselect data=sashelp.baseball;
class league division;
model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / 
                  selection=lasso quantile=0.9;
run;
 
%put &_QRSInd;
--- SAS Log ---
%put &_QRSInd;
   nRBI CrHome nBB nRuns

When modeling the top 10% of salaries, the selected model includes two factors (career home runs and the number of runs scored in the previous year) that did not appear when predicting the conditional mean of the salaries. You can use the _QRSInd macro to build a predictive model in PROC QUANTREG.

PROC HPQUANTSELECT also creates a macro variable that contains the final selected independent variables. The macro is named _HPQRSIND.

Macros created by other selection procedures in SAS

The HPGENSELECT and LOGISTIC procedures, which can perform variable selection for generalized linear models, do not create a macro variable that contains the selected variables. However, the STEPDISC procedure creates a macro variable named _STDVar that contains the names of the quantitative variables that best discriminate among the classes in a discriminant analysis.

What happens if you use splines or split classification variables?

You might wonder what happens if you use the EFFECT statement to generate spline effects and/or use the SPLIT option on the CLASS statement to enable individual levels of a classification variable to enter/leave the model independently of other levels. (For the LASSO, LAR, and ELASTICNET methods, all spline effects and classification effects are split.) In this case, the final model is likely to contain only certain levels of a categorical variable or only certain basis functions for a spline effect. Consequently, probably you will need to use the design matrix and _GLSMOD macro variable in subsequent procedures.

For example, the following example creates a spline effect with five internal knots and splits the levels of two categorical variables as part of selecting a LASSO model. The final model contains the "Type='Sedan'" level, the "Origin='Asia'" level, and the first and fourth spline basis for the Weight variable:

proc glmselect data=Sashelp.Cars(where=(Type^="Hybrid"))
               outdesign=GLSSplitDesign;      /* create design matrix */
  class origin type / split;                  /* LASSO will force the SPLIT option */
  effect splWt  = spline(weight / details naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
  model mpg_city = origin | type | splWt @ 2  /  selection=Lasso;
quit;
 
%put &_GLSInd;    /* model refers to effects that are not part of the input data */
%put &_GLSMod;    /* model refers to dummy variables that are in the OUTDESIGN= data set */
 
/* for split variables, use the _GLSMOD macro and the OUTDESIGN= data set */
proc glm data=GLSSplitDesign;
   model mpg_city = &_GLSMod / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   Type_Sedan Type_Sports Origin_Asia*Type_Sedan splWt:1 splWt:4 splWt:4*Origin_USA
%put &_GLSMod;
   Type_Sedan Type_Sports Origin_Asia_Type_Sedan splWt_1 splWt_4 splWt_4_Origin_USA

Notice that the _GLSIND macro contains names such as "splWt:1" that are not in the input data set. However, the names in the _GLSMOD macro are all valid columns in the GLSSplitDesign data set, which was created by using the OUTDESIGN= option. So if you are using the LASSO, LAR, or ELASTICNET methods of variable selection or if you manually specify the SPLIT option, you will want to use the _GLSMOD macro and the design data set in subsequent analyses.

The post Which variables are in the final selected model? appeared first on The DO Loop.

8月 012018
 

When you use a regression procedure in SAS that supports variable selection (GLMSELECT or QUANTSELECT), did you know that the procedures automatically produce a macro variable that contains the names of the selected variables? This article provides examples and details. A previous article provides an overview of the 'SELECT' procedures in SAS for building statistical models.

The final model in PROC GLMSELECT

PROC GLMSELECT uses variable selection techniques such as LAR and LASSO to fit a parsimonious linear model from a large number of potential regressors. The following call to PROC GLMSELECT is adapted from the "Getting Started" example from the documentation, which models the log-transformed salaries of baseball players by using on-the-field statistics, player characteristics, and team attributes. When PROC GLMSELECT runs, it creates the _GLSIND macro variable, which contains the names of the effects in the final model.

/* GLMSELECT creates &_GLSInd for selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / selection=lasso;
quit;
 
%put &_GLSInd;      /* display names of effects for selected model */
 
/* Use _GLSInd to call PROC GLM on final selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = &_GLSInd / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   nHits nRBI nBB YrMajor CrHits CrRuns CrRbi

The MODEL statement in PROC GLMSELECT includes 18 independent variables, but the final LASSO model contains only seven variables. The _GLSInd macro contains the name of the selected variables. As shown in the example, the macro can be used in subsequent analyses. The example uses the macro on the MODEL statement of PROC GLM. The _GLSIND variable contains the original variable names, which means that if you use the CLASS statement (or EFFECT statement) in the GLMSELECT model, you should specify the same statement in subsequent procedures.

If you want to use the selected variables in a procedure that does not support the CLASS statement (or EFFECT statement), you should use the OUTDESIGN= option in PROC GLMSELECT to generate a design matrix and use the _GLSMOD macro variable to specify the names of the dummy variables in the final model. An example is given at the end of this article.

The final model in PROC QUANTSELECT

Similarly, PROC QUANTSELECT creates a macro variable (named _QSLInd) that names the independent variables in the final selected model. For example, the following example models the conditional 90th percentile of the LogSalary variable for the same data set:

/* QUANTSELECT creates _QRSInd for selected model */
proc quantselect data=sashelp.baseball;
class league division;
model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / 
                  selection=lasso quantile=0.9;
run;
 
%put &_QRSInd;
--- SAS Log ---
%put &_QRSInd;
   nRBI CrHome nBB nRuns

When modeling the top 10% of salaries, the selected model includes two factors (career home runs and the number of runs scored in the previous year) that did not appear when predicting the conditional mean of the salaries. You can use the _QRSInd macro to build a predictive model in PROC QUANTREG.

PROC HPQUANTSELECT also creates a macro variable that contains the final selected independent variables. The macro is named _HPQRSIND.

Macros created by other selection procedures in SAS

The HPGENSELECT and LOGISTIC procedures, which can perform variable selection for generalized linear models, do not create a macro variable that contains the selected variables. However, the STEPDISC procedure creates a macro variable named _STDVar that contains the names of the quantitative variables that best discriminate among the classes in a discriminant analysis.

What happens if you use splines or split classification variables?

You might wonder what happens if you use the EFFECT statement to generate spline effects and/or use the SPLIT option on the CLASS statement to enable individual levels of a classification variable to enter/leave the model independently of other levels. (For the LASSO, LAR, and ELASTICNET methods, all spline effects and classification effects are split.) In this case, the final model is likely to contain only certain levels of a categorical variable or only certain basis functions for a spline effect. Consequently, probably you will need to use the design matrix and _GLSMOD macro variable in subsequent procedures.

For example, the following example creates a spline effect with five internal knots and splits the levels of two categorical variables as part of selecting a LASSO model. The final model contains the "Type='Sedan'" level, the "Origin='Asia'" level, and the first and fourth spline basis for the Weight variable:

proc glmselect data=Sashelp.Cars(where=(Type^="Hybrid"))
               outdesign=GLSSplitDesign;      /* create design matrix */
  class origin type / split;                  /* LASSO will force the SPLIT option */
  effect splWt  = spline(weight / details naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
  model mpg_city = origin | type | splWt @ 2  /  selection=Lasso;
quit;
 
%put &_GLSInd;    /* model refers to effects that are not part of the input data */
%put &_GLSMod;    /* model refers to dummy variables that are in the OUTDESIGN= data set */
 
/* for split variables, use the _GLSMOD macro and the OUTDESIGN= data set */
proc glm data=GLSSplitDesign;
   model mpg_city = &_GLSMod / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   Type_Sedan Type_Sports Origin_Asia*Type_Sedan splWt:1 splWt:4 splWt:4*Origin_USA
%put &_GLSMod;
   Type_Sedan Type_Sports Origin_Asia_Type_Sedan splWt_1 splWt_4 splWt_4_Origin_USA

Notice that the _GLSIND macro contains names such as "splWt:1" that are not in the input data set. However, the names in the _GLSMOD macro are all valid columns in the GLSSplitDesign data set, which was created by using the OUTDESIGN= option. So if you are using the LASSO, LAR, or ELASTICNET methods of variable selection or if you manually specify the SPLIT option, you will want to use the _GLSMOD macro and the design data set in subsequent analyses.

The post Which variables are in the final selected model? appeared first on The DO Loop.

7月 302018
 

I can recognize several languages when I hear people speaking them (mostly because I lived in the Alexander International Dorm at NC State University). Therefore when I found a map of the most common languages spoken in each US state, it caught my attention, and I decided to try creating my [...]

The post What's the most popular language in each US state? appeared first on SAS Learning Post.

7月 302018
 

I can recognize several languages when I hear people speaking them (mostly because I lived in the Alexander International Dorm at NC State University). Therefore when I found a map of the most common languages spoken in each US state, it caught my attention, and I decided to try creating my [...]

The post What's the most popular language in each US state? appeared first on SAS Learning Post.

7月 302018
 

A programmer recently asked a question on a SAS discussion forum about design matrices for categorical variables. He had generated a design matrix by using PROC GLMMOD and wanted to use the design columns in a subsequent procedure. However, the columns were named COL1, COL2, COL3,..., so he couldn't tell which dummy variables correspond to each categorical variable. The following example illustrates his situation for the Weight_Status and Smoking_Status variables in the Sashelp.Heart data set:

proc glmmod data=Sashelp.Heart outdesign=GLMDesign;
   class Weight_Status Smoking_Status;
   model Cholesterol = Weight_Status Smoking_Status;
   ods select Parameters;
run;
 
proc contents data=GLMDesign varnum short; run;
The association betwen columns in a design matrix and levels of the original categorical variables

The "Parameters" table shows the association between columns of the design matrix and levels of the categorical variables in the model. The output from PROC CONTENTS shows that the columns of the design matrix (as stored in the GLMDesign data set) are named COL1, COL2, and so forth. On the discussion forum, I showed how to save the "Parameters" table to a SAS data and use a DATA _NULL_ step to form macro variables that you can use to associate the design columns to the original variables.

In retrospect, I missed a golden opportunity to mention that the GLMMOD procedure (which always produces a singular design matrix) is not as friendly or powerful as other procedures in SAS that can generate design matrices. In particular, the GLMSELECT and TRANSREG procedures can create design matrices for many different parameterizations of the classification variables. Furthermore, these procedures automatically create macro variables that tell you the names of the columns in the design matrix.

As a reminder, a major reason to create a design matrix is to perform an analysis with a SAS procedure that does not support a CLASS statement. For example, the documentation of the MCMC procedure shows how to use PROC TRANSREG to create a design matrix as preparation for performing a Bayesian regression analysis. For details of the many ways to generate design matrices in SAS, see my previous article, "Four ways to create a design matrix in SAS."

GLMSELECT: An easier way to associate columns of a design matrix

The GLMSELECT procedure supports the OUTDESIGN= option, which enables you to output a design matrix for the variables in a regression model. The GLMSELECT procedure has the following advantages of the GLMMOD procedure:

  • The procedure supports the EFFECT statement, which you can use to define spline effects, collection effects, and more. The OUTDESIGN= option creates columns for any effect that you can define.
  • The procedure supports nonsingular parameterizations such as the 'effect' or 'reference' parameterizations (PARAM=EFFECT or PARAM=REF, respectively).
  • The dummy variables have meaningful names of the form VarName_Level, where VarName is the name of a categorical variable and Level is one of its values.
  • The procedure automatically creates a macro variable (called _GLSMod) that contains the names of the columns of the design matrix.

The following statements create a design matrix for the Weight_Status and Smoking_Status variables. The design matrix uses the 'effect coding' parameterization and is written to the GLSDesign data set. The value of the _GLSMod macro is displayed in the SAS log and can be used in a subsequent procedure call:

proc glmselect data=Sashelp.Heart outdesign(fullmodel)=GLSDesign noprint;
   class Weight_Status(ref='Normal')
         Smoking_Status(ref='Non-smoker') / param=effect;
   model Cholesterol = Weight_Status Smoking_Status / selection=none; /* include ALL dummy variables */
run;
 
ods select Position;
proc contents data=GLSDesign varnum; run;   /* display all variables in OUTDESIGN= data set */
 
%put &_GLSMod;                              /* names of columns in design matrix */
--- SAS Log ---
Weight_Status_Overweight Weight_Status_Underweight
Smoking_Status_Heavy__16_25_ Smoking_Status_Light__1_5_
Smoking_Status_Moderate__6_15_ Smoking_Status_Very_Heavy____25_

The output from PROC CONTENTS show the variables in the GLSDesign data set, which are INTERCEPT, CHOLESTEROL (the response variable in the model), and the dummy variables listed in the _GLSMod macro variable. By default, the dummy variables have the pattern VarName_Level, although you can use the OUTDESIGN(PREFIX=prefix) option to generate column names of the form prefix1, prefix2, and so forth, where prefix is a user-specified prefix. You can see that values such as "Heavy (16-25)" are transformed into valid names for SAS variables.

In this example, the _GLSMOD macro contains the names of ALL dummy variables because the SELECTION=NONE option is used. If you use an option (such as SELECTION=LASSO) to perform variable selection, the _GLSMOD variable will contain the names of the dummy variables that are selected in the final model. This means that you can use the _GLSMOD variable to perform additional analyses. For example, if you want to use POC REG to output collinearity diagnostics for the variables in the final model, you could execute the following:

/* use the variables in the final model in a procedure that does not support a CLASS statement */
proc reg data=GLSDesign plots=none; 
   model Cholesterol = &_GLSMod / collin;
run;

TRANSREG: Dummy variables and transformed variable

In a similar way, the TRANSREG procedure supports the DESIGN option, which adds dummy variables to the output data set, as shown in the following call. The syntax is different, but the dummy variables in the TRGDesign data set are formed by the same 'effect coding' as in the previous example. The procedure automatically creates a macro variable (called _TRGInd) that contains the names of the columns of the design matrix.

proc transreg data=Sashelp.Heart design;
   model class(Weight_Status Smoking_Status / EFFECTS 
          zero="Normal"      "Non-smoker");
   output out=TRGDesign;
run;
 
%put &_TRGInd;
Weight_StatusOverweight Weight_StatusUnderweight Smoking_StatusHeavy__16_25_ Smoking_StatusLight__1_5_
Smoking_StatusModerate__6_15_ Smoking_StatusVery_Heavy____25_

The dummy variables contain the same values as in the previous example, but the two procedures construct slightly different names. The GLMSELECT names contain an extra underscore. For example, a dummy variable in the GLSDesign data set is Weight_Status_Overweight whereas the corresponding variable in the TRGDesign data set is Weight_StatusOverweight.

It is worth noting the PROC TRANSREG also supports macro variables that specify the names of dependent variables and transformed variables. You can even use the MACRO option to specify the name of the macro variables that are created. For details, see the documentation for the PROC TRANSREG statement.

In summary, when you generate a design matrix by using the GLMSELECT or TRANSREG procedures, the procedures create dummy variables that have meaningful names of the form VarName_Level or VarNameLevel. The procedures each create a macro variable that contains the names of the dummy variables. The _GLSMOD macro that is created by PROC GLMSELECT contains the names of dummy variables in the final selected model, so use SELECTION=NONE if you want all names. These macros make it easy for a programmer to refer to columns of the design matrix.

The post Meaningful names for columns of a design matrix appeared first on The DO Loop.