Nelson Grajales

12月 122019
 

Parts 1 and 2 of this blog post discussed exploring and preparing your data using SASPy. To recap, Part 1 discussed how to explore data using the SASPy interface with Python. Part 2 continued with an explanation of how to prepare your data to use it with a machine-learning model. This final installment continues the discussion about preparation by explaining techniques for normalizing your numerical data and one-hot encoding categorical variables.

Normalizing your data

Some of the numerical features in the examples from parts 1 and 2 have different ranges. The variable Age spans from 0-100 and Hours_per_week from 0-80. These ranges affect the calculation of each feature when you apply them to a supervised learner. To ensure the equal treatement of each feature, you need to scale the numerical features.

The following example uses the SAS STDIZE procedure to scale the numerical features. PROC STDIZE is not a standard procedure available in the SASPy library.  However, the good news is you can add any SAS procedure to SASPy! This feature enables Python users to become a part of the SAS community by contributing to the SASPy library and giving them the chance to use the vast number of powerful SAS procedures. To add PROC STDIZE to SASPy, see the instructions in the blog post Adding SAS procedures to the SASPy interface to Python.

After you add the STDIZE to SASPy, run the following code to scale the numerical features. The resulting values will be between 0 and 1.

# Creating a SASPy stat objcect
stat = sas.sasstat()
# Use the stdize function that was added to SASPy to scale our features
stat_result = stat.stdize(data=cen_data_logTransform,
                         procopts = 'method=range out=Sasuser.afternorm',
			 var = 'age education_num capital_gain capital_loss hours_per_week')

To use the STDIZE procedure in SASPy we need to specify the method and the output data set in the statement options. For this we use the "procopts" option and we specify range as our method and our "out" option to a new SAS data set, afternorm.

After running the STDIZE procedure we assign the new data set into a SAS data object.

norm_data = sas.sasdata('afternorm', libref='SASuser')

Now let's verify if we were successful in transforming our numerical features

norm_data.head(obs=5)


 

 

 

 

 
The output looks great! You have normalized the numerical features. So, it's time to tackle the last data-preparation step.

One-Hot Encoding

Now that you have adjusted the numerical features, what do you do with the categorical features? The categories Relationship, Race, Sex, and so on are in string format. Statistical models cannot interpret these values, so you need to transform the values from strings into a numerical representation. The one-hot encoding process provides the transformation you need for your data.

To use one-hot encoding, use the LOGISTIC procedure from the SASPy Stat class. SASPy natively includes the LOGISTIC procedure, so you can go straight to coding. To generate the syntax for the code below, I followed the instructions from Usage Note 23217: Saving the coded design matrix of a model to a data set.

stat_proc_log = stat.logistic(data=norm_data, procopts='outdesign=SASuser.test1 outdesignonly',
            cls = "workclass education_level marital_status occupation relationship race sex native_country / param=glm",
	    model = "age = workclass education_level marital_status occupation relationship race sex native_country / noint")

To view the results from this code, create a SAS data object from the newly created data set, as shown in this example:

one_hot_data = sas.sasdata('test1', libref='SASuser')
display(one_hot_data.head(obs=5))

The output:

 

 

 

 

Our data was successfully one-hot encoded! For future reference, due to SAS’ analytical power, this step is not required. When including a categorical feature in a class statement the procedure automatically generates a design matrix with the one-hot encoded feature. For more information, I recommend reading this post about different ways to create a design matrix in SAS.

Finally

You made it to the end of the journey! I hope everyone who reads these blogs can see the value that SASPy brings to the machine-learning community. Give SASPy  a try, and you'll see the power it can bring to your SAS solutions.

Stay curious, keep learning, and (most of all) continue innovating.

Machine Learning with SASPy: Exploring and Preparing your data - Part 3 was published on SAS Users.

12月 122019
 

Bringing the power of SAS to your Python scripts can be a game changer. An easy way to do that is by using SASPy, a Python interface to SAS allowing Python developers to use SAS® procedures within Python. However, not all SAS procedures are included in the SASPy library. So, what do you do if you want to use those excluded procedures? Easy! The SASPy library contains functionality enabling you to add SAS procedures to the SASPy library. In this post, I'll explain the process.

The basics for adding procedures are covered in the Contributing new methods section in the SASPy documentation. To further assist you, this post expands upon the steps, providing step-by-step details for adding the STDIZE procedure to SASPy. For a hands-on application of the use case refer the blog post Machine Learning with SASPy: Exploring and Preparing your data - Part 3.

This is your chance to contribute to the project! Whereas, you can choose to follow the steps below as a one-off solution, you also have the choice to share your work and incorporate it in the SASPy repository.

Prerequisites

Before you add a procedure to SASPy, you need to perform these prerequisite steps:

  1. Identify the SAS product associated with the procedure you want to add, e.g. SAS/STAT, SAS/ETS, SAS Enterprise Miner, etc.
  2. Locate the SASPy file (for example, sasstat.py, sasets.py, and so on) corresponding to the product from step 1.
  3. Ensure you have a current license for the SAS product in question.

Adding a SAS procedure to SASPy

SASPy utilizes Python Decorators to generate the code for adding SAS procedures. Roughly, the process is:

  1. define the procedure
  2. generate the code to add
  3. add the code to the proper SASPy file
  4. (optional)create a pull request to add the procedure to the SASPy repository

Below we'll walk through each step in detail.

Create a set of valid statements

Start a new python session with Jupyter and create a list of valid arguments for the chosen procedure. You determine the arguments for the procedure by searching for your procedure in the appropriate SAS documentation. For example, the PROC STDIZE arguments are documented in the SAS/STAT® 15.1 User's Guide, in the The STDIZE Procedure section, with the contents:

The STDIZE procedure

 
 
 
 
 
 
 
 
 
 

For example, I submitted the following command to create a set of valid arguments for PROC STDIZE:

lset = {'STDIZE', 'BY', 'FREQ', 'LOCATION', 'SCALE', 'VAR', 'WEIGHT'}

Call the doc_convert method

The doc_convert method takes two arguments: a list of valid statements (method_stmt) and the procedure name (stdize).

import saspy
 
print(saspy.sasdecorator.procDecorator.doc_convert(lset, 'STDIZE')['method_stmt'])
print(saspy.sasdecorator.procDecorator.doc_convert(lset, 'STDIZE')['markup_stmt'])

The command generates the method call and the docstring markup like the following:

def STDIZE(self, data: [SASdata', str] = None,
   by: [str, list] = None,
   location: str = None,
   scale: str = None,
   stdize: str = None,
   var: str = None,
   weight: str = None,
   procopts: str = None,
   stmtpassthrough: str = None,
   **kwargs: dict) -> 'SASresults':
   Python method to call the STDIZE procedure.
 
   Documentation link:
 
   :param data: SASdata object or string. This parameter is required.
   :parm by: The by variable can be a string or list type.
   :parm freq: The freq variable can only be a string type.
   :parm location: The location variable can only be a string type.
   :parm scale: The scale variable can only be a string type.
   :parm stdize: The stdize variable can be a string type.
   :parm var: The var variable can only be a string type.
   :parm weight: The weight variable can be a string type.
   :parm procopts: The procopts variable is a generic option avaiable for advanced use It can only be a string type.
   :parm stmtpassthrough: The stmtpassthrough variable is a generic option available for advanced use. It can only be a string type.
   :return: SAS Result Object

Update SASPy product file

We'll take the output and add it to the appropriate product file (sasstat.py in this case). When you open this file, be sure to open it with administrative privileges so you can save the changes. Prior to adding the code to the product file, perform the following tasks:

  1. add @procDecorator.proc_decorator({}) before the function definition
  2. add the proper documentation link from the SAS Programming Documentation site
  3. add triple quotes ("""") to comment out the second section of code
  4. include any additional details others might find helpful

The following output shows the final code to add to the sasstat.py file:

@procDecorator.proc_decorator({})
def STDIZE(self, data: [SASdata', str] = None,
   by: [str, list] = None,
   location: str = None,
   scale: str = None,
   stdize: str = None,
   var: str = None,
   weight: str = None,
   procopts: str = None,
   stmtpassthrough: str = None,
   **kwargs: dict) -> 'SASresults':
   """
   Python method to call the STDIZE procedure.
 
   Documentation link:
   https://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=statug&docsetTarget=statug_stdize_toc.htm&locale=en
   :param data: SASdata object or string. This parameter is required.
   :parm by: The by variable can be a string or list type.
   :parm freq: The freq variable can only be a string type.
   :parm location: The location variable can only be a string type.
   :parm scale: The scale variable can only be a string type.
   :parm stdize: The stdize variable can be a string type.
   :parm var: The var variable can only be a string type.
   :parm weight: The weight variable can be a string type.
   :parm procopts: The procopts variable is a generic option avaiable for advanced use It can only be a string type.
   :parm stmtpassthrough: The stmtpassthrough variable is a generic option available for advanced use. It can only be a string type.
   :return: SAS Result Object
   """

Update sasdecorator file with the new method

Alter the sasdecorator.py file by adding stdize in the code on line 29, as shown below.

if proc in ['hplogistic', 'hpreg', 'stdize']:

Important: The update to the sasdecorator file is only a requirement when you add a procedure with no plot options. The sasstat.py library assumes all procedures produce plots. However, PROC STDIZE does not include them. So, you should perform this step ONLY when your procedure does not include plot options. This will more than likely change in a future release, so please follow the Github page for any updates.

Document a test for your function

Make sure you write at least one test for the procedure. Then, add the test to the appropriate testing file.

Finally

Congratulations! All done. You now have the knowledge to add even more procedures in the future.

After you add your procedure, I highly recommend you contribute your procedure to the SASPy GitHub library. To contribute, follow the outlined instructions on the Contributing Rules GitHub page.

Adding SAS® procedures to the SASPy interface to Python was published on SAS Users.

9月 282019
 

This article continues a series that began with Machine learning with SASPy: Exploring and preparing your data (part 1). Part 1 showed you how to explore data using SASPy with Python. Here, in part 2, you will learn how to begin to prepare your data to use it within a machine-learning model.

Review part 1 if needed and ensure you still have the ADULT data set ready to use. (The data set is available from the UCI Machine Learning Repository.) If not, take some time to download and explore the data again, as described in part 1.

Preparing your data

Preparing data is a necessary step to perform before applying the data toward a model. There are string values, skewed data, and missing data points to consider. In the data set, be sure to clear missing values, so you can jump into other methods.

For this exercise, you will explore how to transform skewed features using SASPy and Pandas.

First, you must separate the income data from the data set, because the income feature will later become your target variable to model.

Drop the income data and turn the pandas data frame back into a SAS data object, with the following code:

Now, let's take a second look at the numerical features. You will use SASPy to create a histogram of all numerical features. Typically, the Matplotlib library is used, but SASPy provides great opportunities to visualize the data.

The following graphs represent the expected output.

Taking a look at the numerical features, two values stick out. CAPITAL_GAIN and CAPITAL_LOSS are highly skewed. Highly skewed features can affect your model, as most models try to maintain a normally distributed curve. To fix this, you will apply a logarithmic transformation using pandas and then visualize the change using SASPy.

Transforming skewed features

First, you need to change the SAS data object back into a pandas data frame and assign the skewed features to a list variable:

Then, use pandas to apply the logarithmic transformation and convert the pandas data frame back into a SAS data object:

Display transformed data

Now, you are ready to visualize these changes using SASPy. In the previous section, you used histograms to display the data. To display this transformation, you will use the SASPy SASUTIL class. Specifically, you will use a procedure typically used in SAS, the UNIVARIATE procedure.

To use the SASUTIL class with SASPy, you first need to create a Python object that uses the SASUTIL class:

 

Now, use the univariate function from SASPy:

 

Using the UNIVARIATE procedure, you can set axis limits to the output histograms so that you can see the data in a clearer format. After running the selected code, you can use the dir() function to verify successful submission:

 

Here is the output:

 

 

 

The function calculates various descriptive statistics and plots. However, for this example, the focus is on the histogram.

 

Here are the results:

Wrapping up

You have now transformed the skewed data. Pandas applied the logarithmic transformation and SASPy displayed the histograms.

Up next

In the next and final article of this series, you will continue preparing your data by normalizing numerical features and one-hot encoding categorical features.

Machine learning with SASPy: Exploring and preparing your data (part 2) was published on SAS Users.

9月 102019
 

SASPy is a powerful Python library that interfaces with SAS and can help with your machine-learning solutions. SASPy was created for Python programmers to leverage the power of SAS within their Python scripts. If you are not familiar with SASPy, see the following resources:

This blog post shows you how powerful SASPy can be. SASPy helps you with providing visuals and descriptive statistics quickly and accurately. To demonstrate this capability, let’s explore and prepare your data using SASPy.

Prerequisites

To get started, here is what you need:

  • The Census Income data set from the University of California Irvine’s Machine Learning Repository
    • Download the adult.data data set from the data folder.
    • Remove the missing values prior to exploring and preparing.
  • SAS®9.4 or SAS® Viya® 3.1 or any later variations of these
  • Jupyter Notebook
  • SASPy (To install SASPy, refer to the installation and configuration documentation.)

After verifying you have completed the above requirements, you can start your Jupyter Notebook and begin coding using SASPy.

Let's start by importing libraries we will use in this example

  1. Import the libraries:
  2. Start your SAS session. Use the command below to establish a connection.

A "SAS Connection established" message returns once connected. This example uses a local connection to SAS. However, you can use an STDIO connection or an IOM connection to SAS if you prefer. For more information, see SAS Configuration.

  1. Read in your data set. You have two options: You can either read in the data set using pandas and then read the data into a SAS data object or you can read it directly into a SAS data object. This example shows reading the data directly into a SAS data object.

To access existing data in a SAS session, use the SAS data object. A SAS data object can be used to do the following:

  • Create various graphs such as histograms, scatter plots, heatmaps, and so on.
  • Display descriptive statistics.
  • Transfer data in between a pandas data frame and a SAS data object.

The SAS data object is versatile. To view all of its capabilities, refer to the SAS Data Object documentation.

  1. Verify whether you successfully read in your data set:

Similar to pandas, SASPy has a head function to display data points. The only difference is when you are specifying how many data points you would like to see. You need to include “obs=n” if you are using a SAS data object.

Exploring your Data

SASPy provides many options to explore your data. This example uses a combination of SASPy functions and pandas to explore the data.

  1. Determine the number of records in your data:
  2. Determine how many individuals earn more or less than $50,000. For this step, this example uses pandas to demonstrate how you can switch between using SASPy and pandas seamlessly.
    1. Change your SAS data object into a pandas data frame:
    2. Use the value_counts function to determine how many individuals earn more or less than $50,000:
    3. View the percent of individuals whose income is greater than $50,000:                                               
    4. Display all your values to gain an understanding of your data:

As you can see from the output above, there are 30,162 records. About 7,508 individuals earn more than $50,000, and about 22,654 individuals make up to $50,000. From all the data, you can see about 25% percent of individuals earn more than $50,000.

  1. It is also important to look at your numerical features. Use SASPy to get a quick description of your data:

As you can see above, the table lists calculated values for the mean, median, and other valuable statistical values.

Exploring your data is just the first step in generating your machine-learning solutions. This blog post described how to generate basic statistical values and display output using SASPy, pandas, and Python. Part 2 and 3 of this blog post cover how to prepare your data using SASPy and to then apply it to a machine learning model.

For more information about the data set, see the UC Irvine Machine Learning Repository.

Machine learning with SASPy: Exploring and preparing your data (part 1) was published on SAS Users.

9月 102019
 

SASPy is a powerful Python library that interfaces with SAS and can help with your machine-learning solutions. SASPy was created for Python programmers to leverage the power of SAS within their Python scripts. If you are not familiar with SASPy, see the following resources:

This blog post shows you how powerful SASPy can be. SASPy helps you with providing visuals and descriptive statistics quickly and accurately. To demonstrate this capability, let’s explore and prepare your data using SASPy.

Prerequisites

To get started, here is what you need:

  • The Census Income data set from the University of California Irvine’s Machine Learning Repository
    • Download the adult.data data set from the data folder.
    • Remove the missing values prior to exploring and preparing.
  • SAS®9.4 or SAS® Viya® 3.1 or any later variations of these
  • Jupyter Notebook
  • SASPy (To install SASPy, refer to the installation and configuration documentation.)

After verifying you have completed the above requirements, you can start your Jupyter Notebook and begin coding using SASPy.

Let's start by importing libraries we will use in this example

  1. Import the libraries:
  2. Start your SAS session. Use the command below to establish a connection.

A "SAS Connection established" message returns once connected. This example uses a local connection to SAS. However, you can use an STDIO connection or an IOM connection to SAS if you prefer. For more information, see SAS Configuration.

  1. Read in your data set. You have two options: You can either read in the data set using pandas and then read the data into a SAS data object or you can read it directly into a SAS data object. This example shows reading the data directly into a SAS data object.

To access existing data in a SAS session, use the SAS data object. A SAS data object can be used to do the following:

  • Create various graphs such as histograms, scatter plots, heatmaps, and so on.
  • Display descriptive statistics.
  • Transfer data in between a pandas data frame and a SAS data object.

The SAS data object is versatile. To view all of its capabilities, refer to the SAS Data Object documentation.

  1. Verify whether you successfully read in your data set:

Similar to pandas, SASPy has a head function to display data points. The only difference is when you are specifying how many data points you would like to see. You need to include “obs=n” if you are using a SAS data object.

Exploring your Data

SASPy provides many options to explore your data. This example uses a combination of SASPy functions and pandas to explore the data.

  1. Determine the number of records in your data:
  2. Determine how many individuals earn more or less than $50,000. For this step, this example uses pandas to demonstrate how you can switch between using SASPy and pandas seamlessly.
    1. Change your SAS data object into a pandas data frame:
    2. Use the value_counts function to determine how many individuals earn more or less than $50,000:
    3. View the percent of individuals whose income is greater than $50,000:                                               
    4. Display all your values to gain an understanding of your data:

As you can see from the output above, there are 30,162 records. About 7,508 individuals earn more than $50,000, and about 22,654 individuals make up to $50,000. From all the data, you can see about 25% percent of individuals earn more than $50,000.

  1. It is also important to look at your numerical features. Use SASPy to get a quick description of your data:

As you can see above, the table lists calculated values for the mean, median, and other valuable statistical values.

Exploring your data is just the first step in generating your machine-learning solutions. This blog post described how to generate basic statistical values and display output using SASPy, pandas, and Python. Part 2 and 3 of this blog post cover how to prepare your data using SASPy and to then apply it to a machine learning model.

For more information about the data set, see the UC Irvine Machine Learning Repository.

Machine learning with SASPy: Exploring and preparing your data (part 1) was published on SAS Users.