sas press

8月 272020
 

Decision trees are a fundamental machine learning technique that every data scientist should know. Luckily, the construction and implementation of decision trees in SAS is straightforward and easy to produce.

There are simply three sections to review for the development of decision trees:

  1. Data
  2. Tree development
  3. Model evaluation

Data

The data that we will use for this example is found in the fantastic UCI Machine Learning Repository. The data set is titled “Bank Marketing Dataset,” and it can be found at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#

This data set represents a direct marketing campaign (phone calls) conducted by a Portuguese banking institution. The goal of the direct marketing campaign was to have customers subscribe to a term deposit product. The data set consists of 15 independent variables that represent customer attributes (age, job, marital status, education, etc.) and marketing campaign attributes (month, day of week, number of marketing campaigns, etc.).

The target variable in the data set is represented as “y.” This variable is a binary indicator of whether the phone solicitation resulted in a sale of a term deposit product (“yes”) or did not result in a sale (“no”). For our purposes, we will recode this variable and label it as “TARGET,” and the binary outcomes will be 1 for “yes” and 0 for “no.”

The data set is randomly split into two data sets at a 70/30 ratio. The larger data set will be labeled “bank_train” and the smaller data set will be labeled “bank_test”. The decision tree will be developed on the bank_train data set. Once the decision tree has been developed, we will apply the model to the holdout bank_test data set.

Tree development

The code below specifies how to build a decision tree in SAS. The data set mydata.bank_train is used to develop the decision tree. The output code file will enable us to apply the model to our unseen bank_test data set.

ODS GRAPHICS ON;
 
PROC HPSPLIT DATA=mydata.bank_train;
 
    CLASS TARGET _CHARACTER_;
 
    MODEL TARGET(EVENT='1') = _NUMERIC_ _CHARACTER_;
 
    PRUNE costcomplexity;
 
    PARTITION FRACTION(VALIDATE=<strong>0.3</strong> SEED=<strong>42</strong>);
 
    CODE FILE='C:/Users/James Gearheart/Desktop/SAS Book Stuff/Data/bank_tree.sas';
 
    OUTPUT OUT = SCORED;
 
run;

The output of the decision tree algorithm is a new column labeled “P_TARGET1”. This column shows the probability of a positive outcome for each observation. The output also contains the standard tree diagram that demonstrates the model split points.

Model evaluation

Once you have developed your model, you will need to evaluate it to see whether it meets the needs of the project. In this example, we want to make sure that the model adequately predicts which observation will lead to a sale.

The first step is to apply the model to the holdout bank_test data set.

DATA test_scored;
 
    SET MYDATA.bank_test;
 
    %INCLUDE 'C:/Users/James Gearheart/Desktop/SAS Book Stuff/Data/bank_tree.sas';
 
RUN;

The %INCLUDE statement applied the decision tree algorithm to the bank_test data set and created the P_TARGET1 column for the bank_test data set.

Now that the model has been applied to the bank_test data set, we will need to evaluate the performance of the model by creating a lift table. Lift tables provide additional information that has been summarized in the ROC chart. Remember that every point along the ROC chart is a probability threshold. The lift table provides detailed information for every point along the ROC curve.

The model evaluation macro that we will use was developed by Wensui Liu. This easy-to-use macro is labeled “separation” and can be applied to any binary classification model output to evaluate the model results.

You can find this macro in my GitHub repository for my new book, End-to-End Data Science with SAS®. This GitHub repository contains all of the code demonstrated in the book along with all of the macros that were used in the book.

This macro on my C drive, and we call it with a %INCLUDE statement.

%INCLUDE 'C:/Users/James Gearheart/Desktop/SAS Book Stuff/Projects/separation.sas';
 
%<em>separation</em>(data = test_scored, score = P_TARGET1, y = target);

The score script that was generated from the CODE FILE statement in the PROC HPSPLIT procedure is applied to the holdout bank_test data set through the use of the %INCLUDE statement.

The table below is generated from the lift table macro.

This table shows that that model adequately separated the positive and negative observations. If we examine the top two rows of data in the table, we can see that the cumulative bad percent for the top 20% of observations is 47.03%. This can be interpreted as we can identify 47.03% of positive cases by selecting the top 20% of the population. This selection is made by selecting observations with a P_TARGET1 score greater than or equal to 0.8276 as defined by the MAX SCORE column.

Additional information about decision trees along with several other model designs are reviewed in detail in my new book End-to-End Data Science with SAS® available at Amazon and SAS.com.

Build a decision tree in SAS was published on SAS Users.

8月 252020
 

Analytics is playing an increasingly strategic role in the ongoing digital transformation of organizations today. However, to succeed and scale your digital transformation efforts, it is critical to enable analytics skills at all tiers of your organization. In a recent blog post covering 4 principles of analytics you cannot ignore, SAS COO Oliver Schabenberger articulated the importance of democratizing analytics. By scaling your analytics efforts beyond traditional data science teams and involving more people with strong business domain knowledge, you can gain more valuable insights and make more significant impacts.

SAS Viya was built from the ground up to fulfill this vision of democratizing analytics. At SAS, we believe analytics should be accessible to everyone. While SAS Viya offers tremendous support and will continue to be the tool of choice for many advanced users and programmers, it is also highly accessible for business analysts and insights team who prefer a more visual approach to analytics and insights discovery.

Self-service data management

First of all, SAS Viya makes it easy for anyone to ingest and prepare data without a single line of code. The integrated data preparation components within SAS Viya support ad-hoc, agile-oriented data management tasks where you can profile, cleanse, and join data easily and rapidly.

Automatically Generated Data Profiling Report

You can execute complex joins, create custom columns, and cleanse your data via a completely drag-and-drop interface. The automation built into SAS Viya eases the often tedious task of data profiling and data cleansing via automated data type identification and transform suggestions. In an area that can be both complex and intimidating, SAS Viya makes data management tasks easy and approachable, helping you to analyze more data and uncover more insights.

Data Join Using a Visual Interface

A visual approach supporting low-code and no-code programming

Speaking of no-code, SAS Viya’s visual approach and support extend deep into data exploration and advanced modeling. Not only can you quickly build charts such as histograms and box plots using a drag and drop interface, but you can also build complex machine learning models using algorithms such as decision trees and logistic regression on the same visual canvas.

Building a Decision Tree Model Using SAS Viya

By putting the appropriate guard rails and providing relevant and context-rich help for the user, SAS Viya empowers users to undertake data analysis using other advanced analytics techniques such as forecasting and correlation analysis. These techniques empower users to ask more complex questions and can potentially help uncover more actionable and valuable insights.

Correlation Analysis Using the Correlation Matrix within SAS Viya

Augmented analytics

Augmented analytics is an emerging area of analytics that leverages machine learning to streamline and automate the process of doing analytics and building machine learning models. SAS Viya leverages augmented analytics throughout the platform to automate various tasks. My favorite use of augmented analytics in SAS Viya, though, is the hyperparameters autotuning feature.

In machine learning, hyperparameters are parameters that you need to set before the learning processing can begin. They are only used during the training process and contribute significantly to the model training process. It can often be challenging to set the optimal hyperparameter settings, especially if you are not an experienced modeler. This is where SAS Viya can help by making building machine learning models easier for everyone one hyperparameter at a time.

Here is an example of using the SAS Viya autotuning feature to improve my decision tree model. Using the autotuning window, all I needed to do was tell SAS Viya how long I want the autotuning process to run for. It will then work its magic and determine the best hyperparameters to use, which, in this case, include the Maximum tree level and the number of Predictor bins. In most cases, you get a better model after coming back from getting a glass of water!

Hyperparameters Autotuning in SAS Viya

Under the hood, SAS Viya uses complex optimization techniques to try to find the best hyperparameter combinations to use all without you having to understand how it manages this impressive feat. I should add that hyperparameters autotuning is supported with many other algorithms in SAS Viya, and you have even more autotuning options when using it via the programmatic interface!

By leveraging a visually oriented framework and augmented analytics capabilities, SAS Viya is making analytics easier and machine learning models more accessible for everyone within an organization. For more on how SAS Viya enables everyone to ask more complex questions and uncover more valuable insights, check out my book Smart Data Discovery Using SAS® Viya®.

Analytics for everyone with SAS Viya was published on SAS Users.

8月 102020
 

The most fundamental concept that students learning introductory SAS programming must master is how SAS handles data. This might seem like an obvious statement, but it is often overlooked by students in their rush to produce code that works. I often tell my class to step back for a moment and "try to think like SAS" before they even touch the keyboard. There are many key topics that students must understand in order to be successful SAS programmers. How does SAS compile and execute a program? What is the built-in loop that SAS uses to process data observation by observation? What are the coding differences when working with numeric and character data? How does SAS handle missing observations?

One concept that is a common source of confusion for students is how to tell SAS to treat rows versus columns. An example that we use in class is how to write a program to calculate a basic descriptive statistic, such as the mean. The approach that we discuss is to identify our goal, rows or columns, and then decide what SAS programming statements are appropriate by thinking like SAS. First, we decide if we want to calculate the mean of an observation (a row) or the mean of a variable (a column). We also pause to consider other issues such as the type of variable, in this case numeric, and how SAS evaluates missing data. Once these concepts are understood we can proceed with an appropriate method: using DATA step programming, a procedure such as MEANS, TABULATE, REPORT or SQL, and so on. For more detailed information about this example there is an excellent user group paper on this topic called "Many Means to a Mean" written by Shannon Pileggi for the Western Users of SAS Software conference in 2017. In addition, The Little SAS® Book and its companion book, Exercises and Projects for the Little SAS® Book, Sixth Edition address these types of topics in easy-to-understand examples followed up with thought-provoking exercises.

Here is an example of the type of question that our book of exercises and projects uses to address this type of concept.

Short answer question

  1. Is there a difference between calculating the mean of three variables X1, X2, and X3 using the three methods as shown in the following examples of code? Explain your answer.
    Avg1 = MEAN(X1,X2,X3);
    Avg2 = (X1 + X2 + X3) / 3;
    PROC MEANS; VAR X1 X2 X3; RUN;

Solution

In the book, we provide solutions for odd-numbered multiple choice and short answer questions, and hints for the programming exercises. Here is the solution for this question:

  1. The variable Avg1 that uses the MEAN function returns the mean of nonmissing arguments and will provide a mean value of X1, X2, and X3 for each observation (row) in the data set. The variable Avg2 that uses an arithmetic equation will also calculate the mean for each observation (row), but will return a missing value if any of the variables for that observation have a missing value. Using PROC MEANS will calculate the mean of nonmissing data for each variable (column) X1, X2, and X3 vertically.

For more information about The Little SAS Book and its companion book of exercises and projects, check out these blogs:

Learning to think like SAS was published on SAS Users.

7月 142020
 

In my new book, End-to-End Data Science with SAS: A Hands-On Programming Guide, I use the 1.5 IQR rule to adjust multiple variables.  This program utilizes a macro that loops through a list of variables to make the necessary adjustments and creates an output data set.

One of the most popular ways to adjust for outliers is to use the 1.5 IQR rule. This rule is very straightforward and easy to understand. For any continuous variable, you can simply multiply the interquartile range by the number 1.5. You then add that number to the third quartile. Any values above that threshold are suspected as being an outlier. You can also perform the same calculation on the low end. You can subtract the value of IQR x 1.5 from the first quartile to find low-end outliers.

The process of adjusting for outliers can be tedious if you have several continuous variables that are suspected as having outliers. You will need to run PROC UNIVARIATE on each variable to identify its median, 25th percentile, 75th percentile, and interquartile range. You would then need to develop a program that identifies values above and below the 1.5 IQR rule thresholds and overwrite those values with new values at the threshold.

The following program is a bit complicated, but it automates the process of adjusting a list of continuous variables according to the 1.5 IQR rule. This program consists of three distinct parts:

    1. Create a BASE data set that excludes the variables contained in the &outliers global macro. Then create an OUTLIER data set that contains only the unique identifier ROW_NUM and the outlier variables.
    2. Create an algorithm that loops through each of the outlier variables contained in the global variable &outliers and apply the 1.5 IQR rule to cap each variable’s range according to its unique 1.5 IQR value.
    3. Merge the newly restricted outlier variable with the BASE data set.
/*Step 1: Create BASE and OUTLIER data sets*/
 
%let outliers = /*list of variables*/;
 
DATA MYDATA.BASE;
    SET MYDATA.LOAN_ADJUST (DROP=&amp;outliers.);
    ROW_NUM = _N_;
RUN;
 
DATA outliers;
    SET MYDATA.LOAN_ADJUST (KEEP=&amp;outliers. ROW_NUM);
    ROW_NUM = _N_;
RUN;
 
 /*Step 2: Create loop and apply the 1.5 IQR rule*/
 
%MACRO loopit(mylist);
    %LET n = %SYSFUNC(countw(&amp;mylist));
 
    %DO I=1 %TO &amp;n;
        %LET val = %SCAN(&amp;mylist,&amp;I);
 
        PROC UNIVARIATE DATA = outliers ;
            VAR &amp;val.;
            OUTPUT OUT=boxStats MEDIAN=median QRANGE=iqr;
        run;
 
        data _NULL_;
           SET boxStats;
           CALL symput ('median',median);
           CALL symput ('iqr', iqr);
        run;
 
        %PUT &amp;median;
        %PUT &amp;iqr;
 
        DATA out_&amp;val.(KEEP=ROW_NUM &amp;val.);
        SET outliers;
 
       IF &amp;val. ge &amp;median + 1.5 * &amp;iqr THEN
           &amp;val. = &amp;median + 1.5 * &amp;iqr;
       RUN;
 
/*Step 3: Merge restricted value to BASE data set*/
 
       PROC SQL;
           CREATE TABLE MYDATA.BASE AS
               SELECT *
               FROM MYDATA.BASE AS a
               LEFT JOIN out_&amp;val. as b
                   on a.ROW_NUM = b.ROW_NUM;
       QUIT;
 
    %END;
%MEND;
 
%LET list = &amp;outliers;
%loopit(&amp;list);

Notes on the outlier adjustment program:

  • A macro variable is created that contains all of the continuous variables that are suspected of having outliers.
  • Separate data sets were created: one that contains all of the outlier variables and one that excludes the outlier variables.
  • A macro program is developed to contain the process of looping through the list of variables.
  • A macro variable (n) is created that counts the number of variables contained in the macro variable.
  • A DO loop is created that starts at the first variable and runs the following program on each variable contained in the macro variable.
  • PROC UNIVARIATE identifies the variable’s median and interquartile range.
  • A macro variable is created to contain the values of the median and interquartile range.
  • A DATA step is created to adjust any values that exceed the 1.5 IQR rule on the high end and the low end.
  • PROC SQL adds the adjusted variables to the BASE data set.

This program might seem like overkill to you. It could be easier to simply adjust outlier variables one at a time. This is often the case; however, when you have a large number of outlier variables, it is often beneficial to create an algorithm to transform them efficiently and consistently

Adjusting outliers with the 1.5 IQR rule was published on SAS Users.

6月 112020
 

Whether you enjoy debugging or hate it, for programmers, debugging is a fact of life. It’s easy to misspell a keyword, scramble your array subscripts, or (heaven forbid!) forget a semicolon. That’s why we include a chapter on debugging in The Little SAS® Book and its companion book, Exercises and Projects for the Little SAS® Book. We believe that learning to debug makes you a better programmer. Once you understand a bug, you will be better prepared to avoid it in the future.

To help hone your debugging skills, here is an example of the type of problems you can find in our book of exercises and projects. See if you can find the bugs.

Programming exercise

  1. A friend tells you that she is learning SAS and wrote the following program. Unfortunately, the program won’t run. Help her improve her programming skills by finding the mistakes.

TITLE Height, Weight, and BMI;
TITLE2 by Sex and Age Group;
PROC CONTENT DATA = SASHELP.class; RUN;
DATA; SET SASHELP.class;
Height_m = Heigth * 0.0254;
Weight_kg = Weight * 0.4536;
BMI = Weight_kg / Height_m**2;
PROC FORMAT; VALUE
$sex 'M' = 'Boys' 'F' = 'Girls';
VALUE agegp 11-12 = 'Preteens
13-16 = 'Teens';
PROC TABULATE;
CLASS Sex Age; VAR Height_m Weight_kg;
TABLES (Height_m Weight_kg BMI)*
MEAN, Sex Age ALL;
FORMAT Sex $sex. Age agegp.;
RUN;
QUIT;

 

  • a. Examine the SAS data set SASHELP.CLASS including variable attributes.
  • b. Clean up the formatting of the program by adding appropriate indention and line spacing to show the structure of the DATA and PROC steps. Make changes as needed to make the program conform to standard best practices.
  • c. Fix any errors in the code so that the program will run correctly.
  • d. Add comments to the revised program for each bug that you fix so that your friend can understand her mistakes.

Solution

In the book, we provide solutions for odd-numbered multiple choice and short answer questions, and hints for the programming exercises. So here is a hint for this exercise:

  1. Hint: This program contains four bugs. It also contains “red herrings” that are unusual for SAS code, but nonetheless do run properly and so are not actual bugs. Be sure you know how SAS handles data set names by default. SAS Enterprise Guide can format code for you; right-click the Program window and select Format Code from the pop-up menu. To format code in SAS Studio, click the Format Code icon at the top of Program window.

For more about The Little SAS Book and its companion book of exercises and projects, check out these blogs:

What's wrong with this code? was published on SAS Users.

3月 102020
 

If you have been using SAS for long, you have probably noticed that there is generally more than one way to do anything. (For an example, see my co-author Lora Delwiche’s blog about PROC SQL.) The Little SAS Book has long covered reading and writing Microsoft Excel files with the IMPORT and EXPORT procedures, but for the Sixth Edition, we decided it was time to add two more ways: The ODS EXCEL destination makes it easy to convert procedure results into Excel files, while the XLSX LIBNAME engine allows you to access Excel files as if they were SAS data sets.

With the XLSX LIBNAME engine, you can convert an Excel file to a SAS data set (or vice versa) if you want to, but you can also access an Excel file directly without the need for a SAS data set. This engine works for files created using any version of Microsoft Excel 2007 or later in the Windows or UNIX operating environments. You must have SAS 9.4M2 or higher and SAS/ACCESS Interface to PC Files software. A nice thing about this engine is that it works with any combination of 32-bit and 64-bit systems.

The XLSX LIBNAME engine uses the first line in your file for the variable names, scans each full column to determine the variable type (character or numeric), assigns lengths to character variables, and recognizes dates, and numeric values containing commas or dollar signs. While the XLSX LIBNAME engine does not offer many options, because you are using an Excel file like a SAS data set, you can use many standard data set options. For example, you can use the RENAME= data set option to change the names of variables, and FIRSTOBS= and OBS= to select a subset of rows.

Reading an Excel file as is 

Suppose you have the following Excel file containing data about magnolia trees:

With the XLSX LIBNAME engine, SAS can read the file, without first converting it to a SAS data set. Here is a PROC PRINT that prints the data directly from the Excel file.

* Read an Excel spreadsheet using XLSX LIBNAME;
LIBNAME exfiles XLSX 'c:\MyExcel\Trees.xlsx';

PROC PRINT DATA = exfiles.sheet1;
   TITLE 'PROC PRINT of Excel File';
RUN;

Here are the results of the PROC PRINT. Notice that the variable names were taken from the first row in the file.

PROC PRINT of Excel File

Converting an Excel file to a SAS data set 

If you want to convert an Excel file to a SAS data set, you can do that too. Here is a DATA step that reads the Excel file. The RENAME= data set option changes the variable name MaxHeight to MaxHeightFeet. Then a new variable is computed which is equal to the height in meters.

* Import Excel into a SAS data set and compute height in meters;
DATA magnolia;
   SET exfiles.sheet1 (RENAME = (MaxHeight = MaxHeightFeet));
   MaxHeightMeters = ROUND(MaxHeightFeet * 0.3048);
RUN;

Here is the SAS data set with the renamed and new variables:


Writing to an Excel file 

It is just as easy to write to an Excel file as it is to read from it.

* Write a new sheet to the Excel file;
DATA exfiles.trees;
   SET magnolia;
RUN;
LIBNAME exfiles CLEAR;

Here is what the Excel file looks like with the new sheet. Notice that the new tab is labeled with the name of the SAS data set TREES.

The XLSX LIBNAME engine is so flexible and easy to use that we think it’s a great addition to any SAS programmer’s skill set.

To learn more about the content in The Little SAS Book, check out the free book excerpt.  To see up-and-coming titles and get exclusive discounts, make sure to subscribe to the SAS Books newsletter.

Accessing Excel files using LIBNAME XLSX was published on SAS Users.

3月 052020
 

Have you heard that SAS offers a collection of new, high-performance CAS procedures that are compatible with a multi-threaded approach? The free e-book Exploring SAS® Viya®: Data Mining and Machine Learning is a great resource to learn more about these procedures and the features of SAS® Visual Data Mining and Machine Learning. Download it today and keep reading for an excerpt from this free e-book!

In SAS Studio, you can access tasks that help automate your programming so that you do not have to manually write your code. However, there are three options for manually writing your programs in SAS® Viya®:

  1. SAS Studio provides a SAS programming environment for developing and submitting programs to the server.
  2. Batch submission is also still an option.
  3. Open-source languages such as Python, Lua, and Java can submit code to the CAS server.

In this blog post, you will learn the syntax for two of the new, advanced data mining and machine learning procedures: PROC TEXTMINE and PROCTMSCORE.

Overview

The TEXTMINE and TMSCORE procedures integrate the functionalities from both natural language processing and statistical analysis to provide essential functionalities for text mining. The procedures support essential natural language processing (NLP) features such as tokenizing, stemming, part-of-speech tagging, entity recognition, customized stop list, and so on. They also support dimensionality reduction and topic discovery through Singular Value Decomposition.

In this example, you will learn about some of the essential functionalities of PROC TEXTMINE and PROC TMSCORE by using a text data set containing 1,830 Amazon reviews of electronic gaming systems. The data set is named Amazon. You can find similar data sets of Amazon reviews at http://jmcauley.ucsd.edu/data/amazon/.

PROC TEXTMINE

The Amazon data set has already been loaded into CAS. The review content is stored in the variable ReviewBody, and we generate a unique review ID for each review. In the proc call shown in Program 1 we ask PROC TEXTMINE to do three tasks:

  1. parse the documents in table reviews and generate the term by document matrix
  2. perform dimensionality reduction via Singular Value Decomposition
  3. perform topic discovery based on Singular Value Decomposition results

Program 1: PROC TEXTMINE

data mycaslib.amazon;
    set mylib.amazon;
run;

data mycaslib.engstop;
    set mylib.engstop;
run;

proc textmine data=mycaslib.amazon;
    doc_id id;
    var reviewbody;

 /*(1)*/  parse reducef=2 entities=std stoplist=mycaslib.engstop 
          outterms=mycaslib.terms outparent=mycaslib.parent
          outconfig=mycaslib.config;

 /*(2)*/  svd k=10 svdu=mycaslib.svdu outdocpro=mycaslib.docpro
          outtopics=mycaslib.topics;

run;

(1) The first task (parsing) is specified in the PARSE statement. Parameter “reducef” specifies the minimum number of times a term needs to appear in the text to be included in the analysis. Parameter “stop” specifies a list of terms to be excluded from the analysis, such as “the”, “this”, and “that”. Outparent is the output table that stores the term by document matrix, and Outterms is the output table that stores the information of terms that are included in the term by document matrix. Outconfig is the output table that stores configuration information for future scoring.

(2) Tasks 2 and 3 (dimensionality reduction and topic discovery) are specified in the SVD statement. Parameter K specifies the desired number of dimensions and number of topics. Parameter SVDU is the output table that stores the U matrix from SVD calculations, which is needed in future scoring. Parameter OutDocPro is the output table that stores the new matrix with reduced dimensions. Parameter OutTopics specifies the output table that stores the topics discovered.

Click the Run shortcut button or press F3 to run Program 1. The terms table shown in Output 1 stores the tagging, stemming, and entity recognition results. It also stores the number of times each term appears in the text data.

Output 1: Results from Program 1

PROC TMSCORE

PROC TEXTMINE is used with large training data sets. When you have new documents coming in, you do not need to re-run all the parsing and SVD computations with PROC TEXTMINE. Instead, you can use PROC TMSCORE to score new text data. The scoring procedure parses the new document(s) and projects the text data into the same dimensions using the SVD weights derived from the original training data.

In order to use PROC TMSCORE to generate results consistent with PROC TEXTMINE, you need to provide the following tables generated by PROC TEXTMINE:

  • SVDU table – provides the required information for projection into the same dimensions.
  • Config table – provides parameter values for parsing.
  • Terms table – provides the terms that should be included in the analysis.

Program 2 shows an example of TMSCORE. It uses the same input data layout used for PROC TEXTMINE code, so it will generate the same docpro and parent output tables, as shown in Output 2.

Program 2: PROC TMSCORE

Proc tmscore data=mycaslib.amazon svdu=mycaslib.svdu
        config=mycaslib.config terms=mycaslib.terms
        svddocpro=mycaslib.score_docpro outparent=mycaslib.score_parent;
    var reviewbody;
    doc_id id;
run;

 

Output 2: Results from Program 2

To learn more about advanced data mining and machine learning procedures available in SAS Viya, including PROC FACTMAC, PROC TEXTMINE, and PROC NETWORK, you can download the free e-book, Exploring SAS® Viya®: Data Mining and Machine Learning. Exploring SAS® Viya® is a series of e-books that are based on content from SAS® Viya® Enablement, a free course available from SAS Education. You can follow along with examples in real time by watching the videos.

 

Learn about new data mining and machine learning procedures in SAS Viya was published on SAS Users.

2月 262020
 

Do you wish you could predict the likelihood that one of your customers will open your marketing email? Or what if you could tell whether a new medical treatment for a patient will have a better outcome than the standard treatment? If you are familiar with propensity modeling, then you know such predictions about future behavior are possible! Propensity models generate a propensity score, which is the probability that a future behavior will occur. Propensity models are used often in machine learning and predictive data analytics, particularly in the fields of marketing, economics, business, and healthcare. These models can detect and remove bias in analysis of real-world, observational data where there is no control group.

SAS provides several approaches for calculating propensity scores. This excerpt from the new book, Real World Health Care Data Analysis: Causal Methods and Implementation Using SAS®, discusses one approach for estimating propensity scores and provides associated SAS code. The example code and data used in the examples is available to download here.

A priori logistic regression model

One approach to estimating a propensity score is to fit a logistic regression model a priori, that is, identify the covariates in the model and fix the model before estimating the propensity score. The main advantage of an a priori model is that it allows researchers to incorporate knowledge external to the data into the model building. For example, if there is evidence that a covariate is correlated to the treatment assignment, then this covariate should be included in the model even if the association between this covariate and the treatment is not strong in the current data. In addition, the a priori model is easy to interpret. The directed acyclic graph approach could be very informative in building a logistic propensity score model a priori, as it clearly points out the relationship between covariates and interventions. The correlation structure between each covariate and the intervention selection is pre-specified and in a fixed form. However, one main challenge of the a priori modeling approach is that it might not provide the optimal balance between treatment and control groups.

Building an a priori model

To build an a priori model for propensity score estimation in SAS, we can use either PROC PSMATCH or PROC LOGISTIC as shown in Program 1. In both cases, the input data set is a one observation per patient data set containing the treatment and baseline covariates from the simulated REFLECTIONS study. Also, in both cases the code will produce an output data set containing the original data set with the additional estimated propensity score for each patient (_ps_).

Program 1: Propensity score estimation: a priori logistic regression

PROC PSMATCH DATA=REFL2 REGION=ALLOBS;
  CLASS COHORT GENDER RACE DR_RHEUM DR_PRIMCARE;
  PSMODEL COHORT(TREATED='OPIOID')= GENDER RACE AGE BMI_B BPIINTERF_B BPIPAIN_B
             CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PHYSICALSYMP_B SDS_B DR_RHEUM
             DR_PRIMCARE;
  OUTPUT OUT=PS PS=_PS_;
RUN;

PROC LOGISTIC DATA=REFL2;
  CLASS COHORT GENDER RACE DR_RHEUM DR_PRIMCARE;
  MODEL COHORT = GENDER RACE AGE BMI_B BPIINTERF_B BPIPAIN_B CPFQ_B FIQ_B GAD7_B
           ISIX_B PHQ8_B PHYSICALSYMP_B SDS_B DR_RHEUM DR_PRIMCARE;
  OUTPUT OUT=PS PREDICTED=PS;
RUN;

Before building a logistic model in SAS, we suggest examining the distribution of the intervention indicator at each level of the categorical variable to rule out the possibility of “complete separation” (or “perfect prediction”), which means that for subjects at some level of some categorical variable, they would all receive one intervention but not the other. Complete separation can occur for several reasons and one common example is when using several categorical variables whose categories are coded by indicators. When the logistic regression model is fit, the estimate of the regression coefficients βs is based on the maximum likelihood estimation, and MLEs under logistic regression modeling do not have a closed form. In other words, the MLE β̂ cannot be written as a function of Xi and Ti. Thus, the MLE of βs are obtained using some numerical analysis algorithms such as the Newton-Raphson method. However, if there is a covariate X that can completely separate the interventions, then the procedure will not converge in SAS. If PROC LOGISTIC was used, the following warning message will be issued.

WARNING: There is a complete separation of data points. The maximum likelihood estimate does not exist.

WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

Notice that SAS will continue to finish the computation despite issuing warning messages. However, the estimate of such βs are incorrect, and so are the estimated propensity scores. If after examining the intervention distribution at each level of the categorical variables complete separation is found, then efforts should be made to address this issue. One possible solution is to collapse the categorical variable causing the problem. That is, combine the different outcome categories such that the complete separation no longer exists.

Firth logistic regression

Another possible solution is to use Firth logistic regression. It uses a penalized likelihood estimation method. Firth bias-correction is considered an ideal solution to the separation issue for logistic regression (Heinze and Schemper, 2002). In PROC LOGISTIC, we can add an option to run the Firth logistic regression as shown in Program 2.

Program 2: Firth logistic regression

PROC LOGISTIC DATA=REFL2;
  CLASS COHORT GENDER RACE DR_RHEUM DR_PRIMCARE;
  MODEL COHORT = GENDER RACE DR_RHEUM DR_PRIMCARE BPIInterf_B BPIPain_B 
        CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B / FIRTH;
  OUTPUT OUT=PS PREDICTED=PS;
RUN;

 

References

Heinze G, Schemper M (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine 21.16: 2409-2419.

Propensity Score Estimation with PROC PSMATCH and PROC LOGISTIC was published on SAS Users.

2月 182020
 

In case you missed the news, there is a new edition of The Little SAS Book! Last fall, we completed the sixth edition of our book, and even though it is actually a few pages shorter than the fifth edition, we managed to add many more topics to the book. See if you can answer this question.

The answer is D – all of the above! We also added new sections on subsetting, summarizing, and creating macro variables using PROC SQL, new sections on the XLSX LIBNAME engine and ODS EXCEL, more on iterative DO statements, a new section on %DO, and more. For a summary of all the changes, see our blog post “The Little SAS Book 6.0: The best-selling SAS book gets even better."

Updating The Little SAS Book meant updating its companion book, Exercises and Projects for The Little SAS Book, as well. The exercises and projects book contains multiple choice and short answer questions as well as programming exercises that cover the same topics that are in The Little SAS Book. The exercises and projects book can be used in a classroom setting, or for anyone wanting to test their SAS knowledge and practice what they have learned.

Here are examples of the types of questions you might find in the exercises and projects book.

Multiple Choice

Short Answer

Programming Exercise

Solutions

In the book, we provide solutions for odd-numbered multiple choice and short answer questions and hints for the programming exercises.

  1. B
  2. Hint: New variables (columns) can be specified in the SELECT clause. Also, see our blog post “Expand your SAS Knowledge by Learning PROC SQL.”

While we don’t provide solutions for even-numbered questions, we can tell you that the iterative DO statement is covered in Section 3.12 of The Little SAS Book, Sixth Edition, “Using Iterative DO, DO WHILE, and DO UNTIL Statements.” The %DO statement is covered in Section 7.7, “Using %DO Loops in Macros.”

For more information about these books, explore the following links to the SAS website:

The Little SAS Book, Sixth Edition

Exercises and Projects for The Little SAS Book, Sixth Edition

Test your SAS skills with the newest edition of Exercises and Projects for The Little SAS Book was published on SAS Users.

2月 142020
 

In honor of Valentine’s day, we thought it would be fitting to present an excerpt from a paper about the LIKE operator because when you like something a lot, it may lead to love! If you want more, you can read the full paper “Like, Learn to Love SAS® Like” by Louise Hadden, which won best paper at WUSS 2019.

Introduction

SAS provides numerous time- and angst-saving techniques to make the SAS programmer’s life easier. Among those techniques are the ability to search and select data using SAS functions and operators in the data step and PROC SQL, as well as the ability to join data sets based on matches at various levels. This paper explores how LIKE is featured in each one of these techniques and is suitable for all SAS practitioners. I hope that LIKE will become part of your SAS toolbox, too.

Smooth Operators

SAS operators are used to perform a number of functions: arithmetic calculations, comparing or selecting variable values, or logical operations. Operators are loosely grouped as “prefix” (for example a sign before a variable) or “infix” which generally perform an operation BETWEEN two variables. Arithmetic operations using SAS operators may include exponentiation (**), multiplication (*), and addition (+), among others. Comparison operators may include greater than (>, GT) and equals (=, EQ), among others. Logical, or Boolean, operators include such operands as || or !!, AND, and OR, and serve the purpose of grouping SAS operations. Some operations that are performed by SAS operators have been formalized in functions. A good example of this is the concatenation operators (|| and !!) and the more powerful CAT functions which perform similar, but not identical, operations. LIKE operators are most frequently utilized in the DATA step and PROC SQL via a DATA step.

There is a category of SAS operators that act as comparison operators under special circumstances, generally in where statements in PROC SQL and the data step (and DS2) and subsetting if statements in the data step. These operators include the LIKE operator and the SOUNDS LIKE operator, as well as the CONTAINS and the SAME-AND operators. It is beyond the scope of this short paper to discuss all the smooth operators, but they are definitely worth a look.

LIKE Operator

Character operators are frequently used for “pattern matching,” that is, evaluating whether a variable value equals, does not equal, or sounds like a specified value or pattern. The LIKE operator is a case-sensitive character operator that employs two special “wildcard” characters to specify a pattern: the percent sign (%) indicates any number of characters in a pattern, while the underscore (_) indicates the presence of a single character per underscore in a pattern. The LIKE operator is akin to the GREP utility available on Unix/Linux systems in terms of its ability to search strings.

The LIKE operator also includes an escape routine in case you need to use a string that includes a comparison operator such as the carat, the underscore, or the percent sign, etc. An example of the escape routine syntax, when looking for a string containing a percent sign, is:

where yourvar like ‘100%’ escape ‘%’;

Additionally, SAS practitioners can use the NOT LIKE operator to select variables WITHOUT a given pattern. Please note that the LIKE statement is case-sensitive. You can use the UPCASE, LOWCASE, or PROPCASE functions to adjust input strings prior to using the LIKE statement. You may string multiple LIKE statements together with the AND or OR operators.

SOUNDS LIKE Operator

The LIKE operator, described above, searches the actual spelling of operands to make a comparison. The SOUNDS LIKE operator uses phonetic values to determine whether character strings match a given pattern. As with the LIKE operator, the SOUNDS LIKE operator is useful for when there are misspellings and similar sounding names in strings to be compared. The SOUNDS LIKE operator is denoted with a short cut ‘-*’. SOUNDS LIKE is based on SAS’s SOUNDEX algorithm. Strings are encoded by retaining the original first column, stripping all letters that are or act as vowels (A, E, H, I, O, U, W, Y), and then assigning numbers to groups: 1 includes B, F, P, and V; 2 includes C, G, J, K, Q, S, X, Z; 3 includes D and T; 4 includes L; 5 includes M and N; and 6 includes R. “Tristn” therefore becomes T6235, as does Tristan, Tristen, Tristian, and Tristin.

For more on the SOUNDS LIKE operator, please read the documentation.

Joins with the LIKE Operator

It is possible to select records with the LIKE operator in PROC SQL with a WHERE statement, including with joins. For example, the code below selects records from the SASHELP.ZIPCODE file that are in the state of Massachusetts and are for a city that begins with “SPR”.

proc sql;
    CREATE TABLE TEMP1 AS
    select
        a.City ,
        a.countynm  , a.city2 ,
         a.statename , a.statename2
    from sashelp.zipcode as a
    where upcase(a.city) like 'SPR%' and 
upcase(a.statename)='MASSACHUSETTS' ; 
quit;

The test print of table TEMP1 shows only cases for Springfield, Massachusetts.

The code below joins SASHELP.ZIPCODE and a copy of the same file with a renamed key column (city --> geocity), again selecting records for the join that are in the state of Massachusetts and are for a city that begins with “SPR”.

proc sql;
    CREATE TABLE TEMP2 AS
    select
        a.City , b.geocity, 
        a.countynm  ,
        a.statename , b.statecode, 
        a.x, a.y
    from sashelp.zipcode as a, zipcode2 as b
    where a.city = b.geocity and upcase(a.city) like 'SPR%' and b.statecode
= 'MA' ;
quit;

The test print of table TEMP2 shows only cases for Springfield, Massachusetts with additional variables from the joined file.

The LIKE “Condition”

The LIKE operator is sometimes referred to as a “condition,” generally in reference to character comparisons where the prefix of a string is specified in a search. LIKE “conditions” are restricted to the DATA step because the colon modifier is not supported in PROC SQL. The syntax for the LIKE “condition” is:

where firstname=: ‘Tr’;

This statement would select all first names in Table 2 above. To accomplish the same goal in PROC SQL, the LIKE operator can be used with a trailing % in a where statement.

Conclusion

SAS provides practitioners with several useful techniques using LIKE statements including the smooth LIKE operator/condition in both the DATA step and PROC SQL. There’s definitely reason to like LIKE in SAS programming.

To learn more about SAS Press, check out our up-and-coming titles, and to receive exclusive discounts make sure to subscribe to our newsletter.

References

    Gilsen, Bruce. September 2001. “SAS® Program Efficiency for Beginners.” Proceedings of the Northeast SAS Users Group Conference, Baltimore, MD.

    Roesch, Amanda. September 2011. “Matching Data Using Sounds-Like Operators and SAS® Compare Functions.” Proceedings of the Northeast SAS Users Group Conference, Portland, ME.

    Shankar, Charu. June 2019. “The Shape of SAS® Code.” Proceedings of PharmaSUG 2019 Conference, Philadelphia, PA.

Learn to Love SAS LIKE was published on SAS Users.