python

6月 042020
 

Learning never stops. When SAS had to change this year’s SAS Global Forum (SGF) to a virtual event, everyone was disappointed. I am, however, super excited about all of the papers and stream of video releases over the last month (and I encourage you to register for the upcoming live event in June). For now, I made a pact with myself to read or watch one piece of SGF related material per day. While I haven’t hit my goal 100%, I sure have learned a lot from all the reading and viewing. One particular paper, Using Jupyter to Boost Your Data Science Workflow, and its accompanying video by Hunter Glanz caught my eye this week. This post elaborates on one piece of his material: how to save Jupyter notebooks in other file formats.

Hunter’s story

Hunter is a professor who teaches multiple classes using SAS® University Edition, which comes equipped with an integrated Jupyter notebook. His focus is on SAS programming and he requires his students to create notebooks to complete assignments; however he wants to see the results of their work, not to run their raw code. The notebooks include text, code, images, reports, etc. Let's explore how the students can transform their navitve notebooks into other, more consumable formats. We'll also discuss other use cases in which SAS users may want to create a copy of their work from a notebook, to say a .pdf, .html, or .py file, just to name a few.

What you’ll find here and what you won’t

This post will not cover how to use Jupyter notebooks with SAS or other languages. There is a multitude of other resources, starting with Hunter’s work, to explore those topics. This post will cover how to produce other file formats in SAS, Python, and R. I’ll outline multiple methods including a point-and-click method, how to write inline code directly in the notebook, and finally using the command line.

Many of the processes discussed below are language agnostic. When there are distinct differences, I’ll make a note.

A LITTLE about Jupyter notebooks

A Jupyter notebook is a web application allowing clients to run commands, view responses, include images, and write inline text all in one concourse. The all-encompassing notebook supports users to telling complete story without having to use multiple apps. Jupyter notebooks were originally created for the Python language, and are now available for many other programming languages. JupyterLab, the notebooks’ cousin, is a later, more sophisticated version, but for this writing, we’ll focus on the notebook. The functionality in this use case is similar.

Where do we start? First, we need to install the notebook, if you're not working in a SAS University Edition.

Install Anaconda

The easiest way to get started with the Jupyter Notebook App is by installing Anaconda (this will also install JupyterLab). Anaconda is an open source distribution tool for the management and deployment of scientific computing. Out-of-the-box, the notebook from the Anaconda install includes the Python kernel. For use with other languages, you need to install additional kernels.

Install additional language kernels

In this post, we’ll focus on Python, R, and SAS. The Python kernel is readily available after the Anaconda install. For the R language, follow the instructions on the GitHub R kernel repository. I also found the instructions on How to Install R in Jupyter with IRKernel in 3 Steps quite straight forward and useful. Further, here are the official install instructions for the SAS kernel and a supporting SAS Community Library article.

With the additional kernels are in place, you should see all available languages when creating a new notebook as pictured below.

Available kernels list

File conversion methods

Now we’re ready to dive into the export process. Let’s look at three approaches in detail.

Download (Export) option

Once you’ve opened your notebook and run the code, select File-> Download As (appears as Export Notebook As… in JupyterLab).

"Download As"  option in Jupyter notebook

"Export Notebook As" option in JupyterLab

HTML format output

Notice the list of options, some more familiar than others. Select the HTML option and Jupyter converts your entire notebook: text, commands, figures, images, etc, into a file with a .html extension. Opening the resulting file would display in a browser as expected. See the images below for a comparison of the .ipynb and .html files.

SAS code in a Jupyther notebook

Corresponding SAS code notebook in html form

SAS (aka script) format output

Using the Save As-> SAS option renders a .sas file and is depicted in Enterprise Guide below. Note: when using a different kernel, say Python or R, you have the option to save in that language specific script format.

SAS code saved from a notebook displayed in Enterprise Guide

One thing to note here is only the code appears in the output file. The markdown code, figures, etc., from the original notebook, are not display options in EG, so they are removed.

PDF format output

There is one (two actually) special case(s) I need to mention. If you want to create a PDF (or LaTeX, which is used to create pdf files) output of your notebook, you need additional software. For converting to PDF, Jupyter uses the TeX document preparation ecosystem. If you attempt to download without TeX, the conversion fails, and you get a message to download TeX. Depending on your OS the TeX software will have a different name but will include TeX in the name. You may also, in certain instances, need Pandoc for certain formats. I suggest installing both to be safe. Install TeX from its dowload site. And do the same for Pandoc.

Once I’ve completed creating the files, the new files appear in my File Explorer.

New SAS file in Windows File Explorer

Cheaters may never win, but they can create a PDF quickly

Well, now that we’ve covered how to properly convert and download a .pdf file, there may be an easier way. While in the notebook, press the Crtl + P keys. In the Print window, select the Save to PDF option, choose a file destination and save. It works, but I felt less accomplished afterward. Your choice.

Inline code option

Point-and-click is a perfectly valid option, but let’s say you want to introduce automation into your world. The jupyter nbconvert command provides the capability to transform the current notebook into any format mentioned earlier. All you must do is pass the command with a couple of parameters in the notebook.

In Python, the nbconvert command is part of the os library. The following lines are representative of the general structure.

import os
os.system("jupyter nbconvert myNotebook.ipynb --to html")

An example with Python

The example below is from a Python notebook. The "0" out code represents success.

Code to create a PDF file from a Python notebook

An example with SAS

As you see with the Python example, the code is just that: Python. Generally, you cannot run Python code in a Jupyter notebook running the SAS kernel. Luckily we have Jupyter magics, which allow us to write and run Python code inside a SAS kernel. The magics are a two-way street and you can also run SAS code inside a Python shell. See the SASPy documentation for more information.

The code below is from a SAS notebook, but is running Python code (triggered by the %%python magic).

Code to create a PDF file from a SAS notebook

The EmployeeChurnSASCode.pdf file is created in same directory as the original notebook file:

Jupyter file system display in a web browser

An example with R

Things are fairly straight forward in an R notebook. However, you must install and load the nbconvert package.

Code to create an HTML file from an R notebook

The first line installs the package, the second line loads the package, and the third actually does the conversion. Double-check your paths if you run into trouble.

The command line

The last method we look at is the command line. This option is the same regardless of the language with which you’re working. The possibilities are endless for this option. You could include it in a script, use it in code to run and display in a web app, or create the file and email it to a colleague. The examples below were all run on a Windows OS machine using the Anaconda command prompt.

An example with a SAS notebook

Convert sasNotebook.ipynb to a SAS file.

>> ls -la |grep sasNotebook
-rw-r--r-- 1 jofurb 1049089  448185 May 29 14:34 sasNotebook.ipynb
 
>> jupyter nbconvert --to script sasNotebook.ipynb
[NbConvertApp] Converting notebook sasNotebook.ipynb to script
[NbConvertApp] Writing 351 bytes to sasNotebook.sas
 
>> ls -la |grep sasNotebook
-rw-r--r-- 1 jofurb 1049089  448185 May 29 14:34 sasNotebook.ipynb
-rw-r--r-- 1 jofurb 1049089     369 May 29 14:57 sasNotebook.sas

An example with a Python notebook

Convert 1_load_data.ipynb to a PDF file

>> ls -la |grep 1_load
-rw-r--r-- 1 jofurb 1049089   6004 May 29 07:37 1_load_data.ipynb
 
>> jupyter nbconvert 1_load_data.ipynb --to pdf
[NbConvertApp] Converting notebook 1_load_data.ipynb to pdf
[NbConvertApp] Writing 27341 bytes to .\notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', '.\\notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', '.\\notebook']
[NbConvertApp] WARNING | b had problems, most likely because there were no citations
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 32957 bytes to 1_load_data.pdf
 
>> ls -la |grep 1_load
-rw-r--r-- 1 jofurb 1049089   6004 May 29 07:37 1_load_data.ipynb
-rw-r--r-- 1 jofurb 1049089  32957 May 29 15:23 1_load_data.pdf

An example with an R notebook

Convert HR_R.ipynb to an R file.

>> ls -la | grep HR
-rw-r--r-- 1 jofurb 1049089   5253 Nov 19  2019 HR_R.ipynb
 
>> jupyter nbconvert HR_R.ipynb --to script
[NbConvertApp] Converting notebook HR_R.ipynb to script
[NbConvertApp] Writing 981 bytes to HR_R.r
 
>> ls -la | grep HR
-rw-r--r-- 1 jofurb 1049089   5253 Nov 19  2019 HR_R.ipynb
-rw-r--r-- 1 jofurb 1049089   1021 May 29 15:44 HR_R.r

Wrapping things up

Whether you’re a student of Hunter’s, an analyst creating a report, or a data scientist monitoring data streaming models, you may have the need/requirement to transform you work from Jupyter notebook to a more consumable asset. Regardless of the language of your notebook, you have multiple choices for saving your work including menu options, inline code, and from the command line. This is a great way to show off your creation in a very consumable mode.

How to save Jupyter notebooks in assorted formats was published on SAS Users.

6月 042020
 

Learning never stops. When SAS had to change this year’s SAS Global Forum (SGF) to a virtual event, everyone was disappointed. I am, however, super excited about all of the papers and stream of video releases over the last month (and I encourage you to register for the upcoming live event in June). For now, I made a pact with myself to read or watch one piece of SGF related material per day. While I haven’t hit my goal 100%, I sure have learned a lot from all the reading and viewing. One particular paper, Using Jupyter to Boost Your Data Science Workflow, and its accompanying video by Hunter Glanz caught my eye this week. This post elaborates on one piece of his material: how to save Jupyter notebooks in other file formats.

Hunter’s story

Hunter is a professor who teaches multiple classes using SAS® University Edition, which comes equipped with an integrated Jupyter notebook. His focus is on SAS programming and he requires his students to create notebooks to complete assignments; however he wants to see the results of their work, not to run their raw code. The notebooks include text, code, images, reports, etc. Let's explore how the students can transform their navitve notebooks into other, more consumable formats. We'll also discuss other use cases in which SAS users may want to create a copy of their work from a notebook, to say a .pdf, .html, or .py file, just to name a few.

What you’ll find here and what you won’t

This post will not cover how to use Jupyter notebooks with SAS or other languages. There is a multitude of other resources, starting with Hunter’s work, to explore those topics. This post will cover how to produce other file formats in SAS, Python, and R. I’ll outline multiple methods including a point-and-click method, how to write inline code directly in the notebook, and finally using the command line.

Many of the processes discussed below are language agnostic. When there are distinct differences, I’ll make a note.

A LITTLE about Jupyter notebooks

A Jupyter notebook is a web application allowing clients to run commands, view responses, include images, and write inline text all in one concourse. The all-encompassing notebook supports users to telling complete story without having to use multiple apps. Jupyter notebooks were originally created for the Python language, and are now available for many other programming languages. JupyterLab, the notebooks’ cousin, is a later, more sophisticated version, but for this writing, we’ll focus on the notebook. The functionality in this use case is similar.

Where do we start? First, we need to install the notebook, if you're not working in a SAS University Edition.

Install Anaconda

The easiest way to get started with the Jupyter Notebook App is by installing Anaconda (this will also install JupyterLab). Anaconda is an open source distribution tool for the management and deployment of scientific computing. Out-of-the-box, the notebook from the Anaconda install includes the Python kernel. For use with other languages, you need to install additional kernels.

Install additional language kernels

In this post, we’ll focus on Python, R, and SAS. The Python kernel is readily available after the Anaconda install. For the R language, follow the instructions on the GitHub R kernel repository. I also found the instructions on How to Install R in Jupyter with IRKernel in 3 Steps quite straight forward and useful. Further, here are the official install instructions for the SAS kernel and a supporting SAS Community Library article.

With the additional kernels are in place, you should see all available languages when creating a new notebook as pictured below.

Available kernels list

File conversion methods

Now we’re ready to dive into the export process. Let’s look at three approaches in detail.

Download (Export) option

Once you’ve opened your notebook and run the code, select File-> Download As (appears as Export Notebook As… in JupyterLab).

"Download As"  option in Jupyter notebook

"Export Notebook As" option in JupyterLab

HTML format output

Notice the list of options, some more familiar than others. Select the HTML option and Jupyter converts your entire notebook: text, commands, figures, images, etc, into a file with a .html extension. Opening the resulting file would display in a browser as expected. See the images below for a comparison of the .ipynb and .html files.

SAS code in a Jupyther notebook

Corresponding SAS code notebook in html form

SAS (aka script) format output

Using the Save As-> SAS option renders a .sas file and is depicted in Enterprise Guide below. Note: when using a different kernel, say Python or R, you have the option to save in that language specific script format.

SAS code saved from a notebook displayed in Enterprise Guide

One thing to note here is only the code appears in the output file. The markdown code, figures, etc., from the original notebook, are not display options in EG, so they are removed.

PDF format output

There is one (two actually) special case(s) I need to mention. If you want to create a PDF (or LaTeX, which is used to create pdf files) output of your notebook, you need additional software. For converting to PDF, Jupyter uses the TeX document preparation ecosystem. If you attempt to download without TeX, the conversion fails, and you get a message to download TeX. Depending on your OS the TeX software will have a different name but will include TeX in the name. You may also, in certain instances, need Pandoc for certain formats. I suggest installing both to be safe. Install TeX from its dowload site. And do the same for Pandoc.

Once I’ve completed creating the files, the new files appear in my File Explorer.

New SAS file in Windows File Explorer

Cheaters may never win, but they can create a PDF quickly

Well, now that we’ve covered how to properly convert and download a .pdf file, there may be an easier way. While in the notebook, press the Crtl + P keys. In the Print window, select the Save to PDF option, choose a file destination and save. It works, but I felt less accomplished afterward. Your choice.

Inline code option

Point-and-click is a perfectly valid option, but let’s say you want to introduce automation into your world. The jupyter nbconvert command provides the capability to transform the current notebook into any format mentioned earlier. All you must do is pass the command with a couple of parameters in the notebook.

In Python, the nbconvert command is part of the os library. The following lines are representative of the general structure.

import os
os.system("jupyter nbconvert myNotebook.ipynb --to html")

An example with Python

The example below is from a Python notebook. The "0" out code represents success.

Code to create a PDF file from a Python notebook

An example with SAS

As you see with the Python example, the code is just that: Python. Generally, you cannot run Python code in a Jupyter notebook running the SAS kernel. Luckily we have Jupyter magics, which allow us to write and run Python code inside a SAS kernel. The magics are a two-way street and you can also run SAS code inside a Python shell. See the SASPy documentation for more information.

The code below is from a SAS notebook, but is running Python code (triggered by the %%python magic).

Code to create a PDF file from a SAS notebook

The EmployeeChurnSASCode.pdf file is created in same directory as the original notebook file:

Jupyter file system display in a web browser

An example with R

Things are fairly straight forward in an R notebook. However, you must install and load the nbconvert package.

Code to create an HTML file from an R notebook

The first line installs the package, the second line loads the package, and the third actually does the conversion. Double-check your paths if you run into trouble.

The command line

The last method we look at is the command line. This option is the same regardless of the language with which you’re working. The possibilities are endless for this option. You could include it in a script, use it in code to run and display in a web app, or create the file and email it to a colleague. The examples below were all run on a Windows OS machine using the Anaconda command prompt.

An example with a SAS notebook

Convert sasNotebook.ipynb to a SAS file.

>> ls -la |grep sasNotebook
-rw-r--r-- 1 jofurb 1049089  448185 May 29 14:34 sasNotebook.ipynb
 
>> jupyter nbconvert --to script sasNotebook.ipynb
[NbConvertApp] Converting notebook sasNotebook.ipynb to script
[NbConvertApp] Writing 351 bytes to sasNotebook.sas
 
>> ls -la |grep sasNotebook
-rw-r--r-- 1 jofurb 1049089  448185 May 29 14:34 sasNotebook.ipynb
-rw-r--r-- 1 jofurb 1049089     369 May 29 14:57 sasNotebook.sas

An example with a Python notebook

Convert 1_load_data.ipynb to a PDF file

>> ls -la |grep 1_load
-rw-r--r-- 1 jofurb 1049089   6004 May 29 07:37 1_load_data.ipynb
 
>> jupyter nbconvert 1_load_data.ipynb --to pdf
[NbConvertApp] Converting notebook 1_load_data.ipynb to pdf
[NbConvertApp] Writing 27341 bytes to .\notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', '.\\notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', '.\\notebook']
[NbConvertApp] WARNING | b had problems, most likely because there were no citations
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 32957 bytes to 1_load_data.pdf
 
>> ls -la |grep 1_load
-rw-r--r-- 1 jofurb 1049089   6004 May 29 07:37 1_load_data.ipynb
-rw-r--r-- 1 jofurb 1049089  32957 May 29 15:23 1_load_data.pdf

An example with an R notebook

Convert HR_R.ipynb to an R file.

>> ls -la | grep HR
-rw-r--r-- 1 jofurb 1049089   5253 Nov 19  2019 HR_R.ipynb
 
>> jupyter nbconvert HR_R.ipynb --to script
[NbConvertApp] Converting notebook HR_R.ipynb to script
[NbConvertApp] Writing 981 bytes to HR_R.r
 
>> ls -la | grep HR
-rw-r--r-- 1 jofurb 1049089   5253 Nov 19  2019 HR_R.ipynb
-rw-r--r-- 1 jofurb 1049089   1021 May 29 15:44 HR_R.r

Wrapping things up

Whether you’re a student of Hunter’s, an analyst creating a report, or a data scientist monitoring data streaming models, you may have the need/requirement to transform you work from Jupyter notebook to a more consumable asset. Regardless of the language of your notebook, you have multiple choices for saving your work including menu options, inline code, and from the command line. This is a great way to show off your creation in a very consumable mode.

How to save Jupyter notebooks in assorted formats was published on SAS Users.

4月 282020
 

With increasing interest in Continuous Integration/Continuous Delivery (CI/CD), many SAS Users want to know what can be done for Visual Analytics reports. In this article, I will explain how to use Python and SAS Viya REST APIs to extract a report from a SAS Viya environment and import it into another environment. For those trying to understand the secret behind CI/CD and DevOps, here it is:

What you do tomorrow will be better than what you did yesterday because you gained more experience today!

About Continuous Integration/Continuous Delivery

If you apply this principle to code development, it means that you may update your code every day, test it, deploy it, and start the process all over again the following day. You might feel like Sisyphus rolling his boulder around for eternity. This is where CI/CD can help. In the code deployment process, you have many recurrent tasks that can be automated to reduce repetitiveness and boredom. CI/CD is a paradigm where improvements to code are pushed, tested, validated, and deployed to production in a continuous automated manner.

About ModelOps and AnalyticOps

I hope you now have a better understanding of what CI/CD is. You might now wonder how CI/CD relates to Visual Analytics reports, models, etc. With the success of DevOps which describes the Development Operations for software development, companies have moved to the CI/CD paradigms for operations not related to software development. This is why you hear about ModelOps, AnalyticOps... Wait a second, what is the difference between writing or generating code for a model or report versus writing code for software? You create a model, you test it, you validate it, and finally deploy it. You create a report, you test it, you validate it, and then you deploy it. Essentially, the processes are the same. This is why we apply CI/CD techniques to models, reports, and many other business-related tasks.

About tools

As with many methodologies like CI/CD, tools are developed to help users through the process. There are many tools available and some of them are more widely used. SAS offers SAS Workflow Manager for building workflows to help with ModelOps. Additionally, you have surely heard about Git and maybe even Jenkins.

  • Git is a version control system that is used by many companies with some popular implementations: GitHub, GitLab, BitBucket.
  • Jenkins is an automation program that is designed to build action flows to ease the CI/CD process.

With these tools, you have the needed architecture to start your CI/CD journey.

The steps

With a basic understanding of the CI/CD world; you might ask yourself: How does this apply to reports?

When designing a report in an environment managed by DevOps principles, here are the steps to deploy the report from a development environment to production:

  1. Design the report in development environment.
  2. Validate the report with Business stakeholders.
  3. Export the report from the development environment.
  4. Save a version of the report.
  5. Import the report into the test environment.
  6. Test the report into the test environment.
  7. Import the report into the production environment.
  8. Monitor the report usage and performance.

Note: In some companies, the development and test environments are the same. In this case, steps 4 to 6 are not required.

Walking through the steps, we identify steps 1 and 2 are manual. The other steps can be automated as part of the CI/CD process. I will not explain how to build a pipeline in Jenkins or other tools in this post. I will nevertheless provide you with the Python code to extract, import, and test a report.

The code

To perform the steps described in the previous section, you can use different techniques and programming languages. I’ve chosen to use Python and REST APIs. You might wonder why I've chosen these and not the sas-admin CLI or another language. The reason is quite simple: my Chinese zodiac sign is a snake!

Jokes aside, I opted for Python because:

  • It is easy to read and understand.
  • There is no need to compile.
  • It can run on different operating systems without adaptation.
  • The developer.sas.com site provides examples.

I’ve created four Python files:

  1. getReport.py: to extract the report from an environment.
  2. postReport.py: to create the report in an environment.
  3. testReport.py: to test the report in an environment.
  4. functions.py: contains the functions that are used in the other Python files.

All the files are available on GitHub.

The usage

Before you use the above code, you should configure your SAS Viya environment to enable access to REST APIs. Please follow the first step from this article to register your client.

You should also make sure the environment executing the Python code has Python 3 with the requests package installed. If you're missing the requests package, you will get an error message when executing the Python files.

Get the report

Now that your environment is set up, you can execute the getReport code to extract the report content from your source environment. Below is the command line arguments to pass to execute the code:

python3 getReport.py -a myAdmin -p myAdminPW -sn http://va85.gel.sas.com -an app -as appsecret -rl "/Users/sbxxab/My Folder" -rn CarsReport -o /tmp/CICD/

The parameters are:

  • a - the user that is used to connect to the SAS Viya environment.
  • p - the password of the user.
  • sn - the URL of the SAS Viya environment.
  • an - the name of the application that was defined to enable REST APIs.
  • as - the secret to access the REST APIs.
  • rl - the report location should be between quotes if it contains white spaces.
  • rn - the report name should be between quotes if it contains white spaces.
  • o - the output location that will be used.

The output location should ideally be a Git repository. This allows a commit as the next step in the CI/CD process to keep a history log of any report changes.

The output generated by getReport is a JSON file which has the following structure:

{
"name": "CarsReport",
"location": "/Users/sbxxab/My Folder",
"content": {
"@element": "SASReport",
"xmlns": "http://www.sas.com/sasreportmodel/bird-4.2.4",
"label": "CarsReport",
"dateCreated": "2020-01-14T08:10:31Z",
"createdApplicationName": "SAS Visual Analytics 8.5",
"dateModified": "2020-02-17T15:33:13Z",
"lastModifiedApplicationName": "SAS Visual Analytics 8.5",
"createdVersion": "4.2.4",
"createdLocale": "en",
"nextUniqueNameIndex": 94,
 
...
 
}

From the response:

  • The name value is the report name.
  • The location value is the folder location in the SAS Content.
  • The content value is the BIRD representation of the report.

The generated file is not a result of the direct extraction of the report in the SAS environment. It is a combination of multiple elements to build a file containing the required information to import the report in the target environment.

Version the report

The next step in the process is to commit the file within the Git repository. You can use the git commit command followed by a git push to upload the content to a remote repository. Here are some examples:

# Saving the report in the local repository
git commit -m "Save CarsReport"
 
# Pushing the report to a remote repository after a local commit
git push https://gitlab.sas.com/myRepository.git

Promote the report

As soon as you have saved a version of the report, you can import the report in the target environment (production). This is where the postReport comes into play. Here is a sample command line:

python3 postReport.py -a myAdmin -p myAdminPW -sn http://va85.gel.sas.com -an app -as appsecret -i /tmp/CICD/CarsReport.json

The parameters are:

  • a - the user that is used to connect to the SAS Viya environment.
  • p - the password of the user.
  • sn - the URL of the SAS Viya environment.
  • an - the name of the application that was defined to enable REST APIs.
  • as - the secret to access the REST APIs.
  • i - the input JSON file which contains the output of the getReport.py.

The execution of the code returns nothing except in the case of an error.

Testing the report

You now have access to the report in the production environment. A good practice is to test/validate access to the report. While testing manually in an interface is possible, it's best to automate. Sure, you could validate one report, but what if you had twenty? Use the testReport script to verify the report. Below are the command line arguments to execute the code:

python3 testReport.py -a myAdmin -p myAdminPW -sn http://va85.gel.sas.com -an app -as appsecret -i /tmp/CICD/CarsReport.json

The parameters are:

  • a - the user that is used to connect to the SAS Viya environment.
  • p - the password of the user.
  • sn - the URL of the SAS Viya environment.
  • an - the name of the application that was defined to enable REST APIs.
  • as - the secret to access the REST APIs.
  • i - the input JSON file which contains the output of the getReport.py

The testReport connects to SAS Visual Analytics using REST APIs and generates an image of the first section of the report. The image generation process produces an SVG image of little interest. The most compelling part of the test is the duration of the execution. This gives us an indication of the report's performance.

Throughout the CI/CD process, validation is important and it is also interesting to get a benchmark for the report. This is why the testReport populates a .perf file for the report. The response file has the following structure:

{
"name": "CarsReport",
"location": "/Users/sbxxab/My Folder",
"performance": [
{
"testDate": "2020-03-26T08:55:37.296Z",
"duration": 7.571
},
{
"testDate": "2020-03-26T08:55:56.449Z",
"duration": 8.288
}
]
}

From the response:

  • The name value is the report name.
  • The location value is the folder location in the SAS Content.
  • The performance array contains the date time stamp of the test and the time needed to generate the image.

The file updates for each execution of the testReport code.

Conclusion

CI/CD is important to SAS. Many SAS users need solutions to automate their deployment processes for code, reports, models, etc. This is not something to fear because SAS Viya has the tools required to integrate into CI/CD pipelines. As you have seen in this post, we write code to ease the integration. Even if you don’t have CI/CD tools like Jenkins to orchestrate the deployment, you can execute the different Python files to promote content from one environment to another and test the deployed content.

If you want to get more information about ModelOps, I recommend to have a look at this series.

Continuous Integration/Continuous Delivery – Using Python and REST APIs for SAS Visual Analytics reports was published on SAS Users.

4月 152020
 

Welcome to the first post for the Getting Started with Python Integration to SAS Viya series! With the popularity of the Python programming language for data analysis and SAS Viya's ability to integrate with Python, I thought, why not create tutorials for users integrating the two?

To begin the series I want to talk about the most important step, making a connection to SAS Viya through your favorite Python client. In the examples I will use a Jupyter notebook, but the method would be the same on any Python client interface. Before I begin diving into code and connections, I want to provide a brief, high level overview of SAS Viya.

What is SAS Viya?

SAS Viya extends the SAS Platform, operates in the cloud (as well as in hybrid and on-prem solutions) and is open source-friendly. For better performance, SAS Viya operates on in-memory data, removing the read/write data transfer overhead. Data processing and analytic procedures use the SAS Cloud Analytic Services (CAS), the engine behind SAS Viya. Further, it enables everyone in an organization to collaborate and work with data by providing a variety of products & solutions running in CAS.

What exactly is CAS? Let's consider the image below.

CAS distributes heavy workloads among multiple computing instances for fast and efficient processing. The environment consists of a controller and a set of worker nodes allowing data storage and processing. Let's take a simple example. A table with 300GB of data is uploaded to the CAS environment. The controller parses out 100GB chunks of the data to each of the worker nodes. The data is loaded into memory on each of the worker nodes, and each node processes their 100GBs of data.

Additionally, CAS uses modern, dynamic algorithms to rapidly perform analytical processing on data of any size. Throughout the series I will refer to the CAS distributed environment as the CAS server, or simply CAS.

For more information about the Cloud Analytic Services architecture visit SAS® Cloud Analytic Services 3.5: Fundamentals.

Furthermore, SAS Viya is open. Business analysts and data scientists can explore, prepare and manage data to provide insights, create visualizations or analytical models using the SAS programming language or a variety of open source languages like Python, R, Lua, or Java. Because of this, programmers can easily process data in CAS, using a language of their choice.

Now that the high level overview of SAS Viya is out of the way, let's discuss the why.

Why do I want to integrate Python to SAS Viya?

When working with data on your local computer you are typically constrained to your computer's resources. For example, when you are working with smaller data (generally think around 1GB) you most likely will not have any resource issues. Alternatively, what if your data is 100GB? A terabyte? What do you do then?

The solution is simple; integrate Python to SAS Viya! At the highest level, the SAS Viya architecture is meant to work with large data your client machine cannot handle. You can load your large data into CAS which distributes chunks of the data to each of the worker nodes. The data is loaded into memory on the worker nodes and processed using Python. Sounds great right? I haven't even told you the best part.

Many of the familiar Pandas methods are available through the Scripting Wrapper for Analytics Transfer (SWAT) package. The SWAT package provides functionality and syntax having the feel of open source code, but simply wraps up CAS actions to send to the server. CAS actions are small units of work the CAS server understands. They load and transform data, compute statistics, perform analytics and create output.

Compare the simple code samples below. You see the commands written in SAS, Python, or R native code. When sent from a client to CAS, SWAT translates the command, performs the action on the CAS server, and returns a response to the client.

Another great feature is the ability to transfer summarized data from the CAS server back to the client machine. Having data on the client machine allows you to use any familiar Python packages like Pandas, Matplotlib, Seaborn, scikit-learn and many more!

So how do you get started?

Connecting to the CAS server

To connect to the CAS server, complete the following two steps:

  1. Install the SWAT package
  2. Make a Connection to the CAS Server

Install the SWAT package

First, we need to install the SWAT package. The SWAT package provides a means for Python users to submit code to the CAS server. The SWAT package translates the Python code to CAS actions the server understands. Install the SWAT package using the pip command as follows:

pip install swat

Or, if you are using Anaconda:

conda install -c sas-institute swat

For more information on the installing the SWAT package for a specific platform and version visit sassoftware/python-swat GitHub page or the documentation.

Make a connection to the CAS server

Next, it's time to make a connection from your local client to the CAS server. This step is required and there are some variations in how this is implemented. I have prepared two methods to connect to CAS. The first method employs a username and password. The second uses token authentication with the use of environment variables. Explore other authentication methods from links provided in the Additional Resources section at the end of this post.

Connecting to CAS using a username and password

First we'll import the SWAT package.

import swat

Next we'll use the swat.CAS constructor to create a connection object to the CAS server. I name this new connection object conn. Feel free to name the object whatever you would like. I've seen it named s in some documentation. I prefer conn since it's my connection to CAS.

The CAS constructor requires the host name and the listening port of the CAS controller. We also need to authenticate -- that is, we need to tell CAS who we are. In this example, we use a username and password. If you do not know any of this connection information, speak with your administrator.

conn = swat.CAS(hostname="https//:server.demo.sas.com/cas-shared-default-http/",
              port=8777, username="student", password="MetaData0")

Let's quickly investigate the output of our conn object.

display(conn)
CAS('server.demo.sas.com', 8777, 'student', protocol='http', name='py-session-1', session='c8091979--483f-8cde-3c97a1f372a3')

We are now connected! Notice the conn object holds all of our connection information. We will use this connection object going forward.

Finally, let's look at the new object's type.

type(conn)
swat.cas.connection.CAS

Alert! I hope you noticed something. You should be thinking, "did we just type our authentication information in plain text?". The answer is yes. However, when you are using password authentication, you should NEVER type your information in plain text. In this example I am using a training machine and I want to demonstrate the simplest connection procedure. For real world scenarios, there are options like setting Python environment variables, creating an authinfo file, or creating a hashed version. My recommendation is to follow your company policy for authentication.

Great! Everything looks ready and we are connected to CAS! Let's now look at a second method of connecting.

Connecting to CAS using token authentication and environment variables

In this example, I am using SAS Viya for Learners. SAS Viya for Learners is a SAS Viya implementation created for educators and their students. To access a Jupyter notebook when using SAS Viya for Learners, log in using your username and password and launch the application. Once in the application, look at the bottom of the screen in the orange box and select the Jupyter notebook icon, as seen below.

After Jupyter opens, select the Python kernel and begin.

Now it's time to make the connection to CAS. First, I'm going to import the SWAT and os packages. The os package allows me to obtain environment variable values necessary to connect to CAS.

import swat
import os

Next, I want to obtain the values of a few environment variables. SAS Viya for Learners comes configured with the needed environment variables. To start, I'm going to use the os.environ.get method to obtain the CASHOST, CASPORT and SAS_VIYA_TOKEN values and set them equal to new variables.

hostValue = os.environ.get('CASHOST')
portValue = os.environ.get('CASPORT')
passwordToken = os.environ.get('SAS_VIYA_TOKEN')

The variables hostValue, portValue and passwordToken all contain the necessary values to connect to CAS. Let's use the swat.CAS constructor again with the newly created variables.

conn = swat.CAS(hostname=hostValue, port=portValue, password=passowrdToken)

Let's view the output of the conn object.

display(conn)
CAS('svflhost.demo.sas.com', 5570, email='joe.test@sas.com', protocol='cas', name='py-session-1', session='efff4323-a862-bd6e-beea737b4249')

That's it. We're connected!

Summary

In summary, the most important thing to know is that there are multiple ways to connect to the CAS server. For additional approaches, see links below and also Joe Furbee's blog post, Authentication to SAS Viya: a couple of approaches. If you're not sure of the proper method, I recommend you discuss how to connect to CAS with the administrator of your environment.

That was just the start of the series, but it's the most important step! You can't do anything else, unless you first authenticate and connect. Once you are connected to the CAS server the next question should be, what do I do now? In the next post we will talk about sending commands to the CAS server in Working with CAS Actions and CASResults Objects.

Additional information and resources

Getting Started with Python Integration to SAS® Viya® - Part 1 - Making a Connection was published on SAS Users.

4月 032020
 

Whether you like it or not, Microsoft Excel is still a big hit in the data analysis world. From small to big customers, we still see fit for daily routines such as filtering, generating plots, calculating items on ad-hoc analysis or even running statistical models. Whenever I talk to customers, there is always someone who will either ask: Can this be exported to excel or can we import data from excel?. Recently, other questions started to come up more often: Can we run Python within SAS? How do I allow my team to choose their language of preference? How do I provide an interface that looks like Microsoft Excel, but has SAS functionalities?.

Well… good news is: we can answer YES to all of these questions. With the increase in number of users performing analytics and the number of analytical tools available, for me it was clear that we would end up having lots of disparate processes. For a while this was a problem, but naturally, companies started developing ways to integrate these siloed teams.

In the beginning of last decade, SAS developed SAS Add-in for Microsoft Office. The tool allows customers to run/embed SAS analytic capabilities inside Microsoft Office applications. More recently, SAS released a new version of PROC FCMP allowing users to write Python code and call, if as a function, inside SAS programs.

These advancements provide users the ability to run Python inside Excel. When I say inside, I really mean from within Excel's interface.

Before we jump to how we can do it, you may ask yourself: Why is this relevant to me? If I know SAS, I import the dataset and work with the data in SAS; If I know Python, I open a Jupyter notebook, import the data set and do my thing. Well… you are kind of right, but let me tell you a story.

The use case

Recently I worked with a customer and his business process was like this: I have a team of data scientists that is highly technical and knowledgeable in Python and SAS. Additionally, I have a team of analysts with little Python knowledge, but are always working with Excel to summarize data, create filters, graphs, etc. My teams need to communicate and collaborate. The normal chain of events follows:

  1. the Python team works on the data, and exports the results to Excel
  2. the analytics team picks up the data set, and runs SAS scripts and excel formulas

This is a problem of inefficiency for the customer. Why can't the data scientist pass his or her code to the analyst to execute it on the same project without having to wait on the Python specialist to run the code?

I know this sounds overly complicated, but as my SAS colleague Mike Zizzi concludes in his post SAS or Python? Why not use both? Using Python functions inside SAS programs, at the end of the day what matters is that you get your work done. No matter which language, software or IDE you are using. I highly recommend Mike's article if you want a deep dive on what PROC FCMP has to offer.

The process

Let's walk through a data scoring scenario similar to my customer's story. Imagine I am a SAS programmer using Excel to explore data. I am also part of a team that uses Python, creating scoring data code using analytical models developed in Python. My job is to score and analyze the data on Excel and pass the results to the service representative, so they can forward the response to the customer.

Importing data

The data set we'll work with in this example will help me analyze which customers are more likely to default on a loan. The data and all code used in this article are in the associated GitHub repository. The data dictionary for the data set is located here. First, I open the data set as seen on Sheet1 below in Excel.

Upload data to SAS

Before we jump to the coding part with SAS and Python, I need to send the data to SAS. We'll use the SAS add-in, in Excel to send data to the local server. I cover the steps in detail below.

I start by selecting the cells I want to upload to the library.

Next, I move to the SAS tab and select the Copy to SAS Server task.

A popup shows up where I confirm the selected cells.

After I click OK, I configure column, table, naming and location options.

SAS uploads the table to the requested library. Additionally, a new worksheet with the library.table name displays the results. As you can see on the image below, the sheet created follows the name WORK.IMPORTED_DATA we setup on the previous step. This represents the table in the SAS library memory. Notice, however, we are still working in Excel.

The next step is to incorporate the code sent from my teammate.

The Python code

The code our colleague sent is pure Python. I don't necessarily have to understand the code details, just what it does. The Python code below imports and scores a model and returns a score. Note: if you're attempting this in your own environment, make sure to update the hmeq_model.sav file location in the # Import model pickle file section.

def score_predictions(CLAGE, CLNO, DEBTINC,DELINQ, DEROG, LOAN, MORTDUE, NINQ,VALUE, YOJ):
	"Output: scored"
	# Imporing libraries
	import pandas as pd
	from sklearn.preprocessing import OneHotEncoder
	from sklearn.compose import ColumnTransformer
	from sklearn.externals import joblib
 
	# Create pandas dataframe with input vars
	dataset = pd.DataFrame({'CLAGE':CLAGE, 'CLNO':CLNO, 'DEBTINC':DEBTINC, 'DELINQ':DELINQ, 'DEROG':DEROG, 'LOAN':LOAN, 'MORTDUE':MORTDUE, 'NINQ':NINQ, 'VALUE':VALUE, 'YOJ':YOJ}, index=[0])
 
	X = dataset.values
 
	# Import model pickle file
	loaded_model = joblib.load("C://assets/hmeq_model.sav")
 
	# Score the input dataframe and get 0 or 1 
	scored = int(loaded_model.predict_proba(X)[0,1])
 
	# Return scored dataframe
	return scored

My SAS code calls this Python code from a SAS function defined in the next section.

The SAS code

Turning back to Excel, in the SAS Add-in side of the screen, I click on Programs. This displays a code editor, and as explained on this video, is like any other SAS code editor.

We will use this code editor to write, run and view results from our code.

The code below defines a FCMP function called Score_Python, that imports the Python script from my colleague and calls it from a SAS datastep. The output table, HMEQ_SCORED, is saved on the WORK library in SAS. Note: if you're attempting this in your own environment, make sure to update the script.py file location in the /* Getting Python file */ section.

proc fcmp outlib=work.fcmp.pyfuncs;
 
/* Defining FCMP function */
proc fcmp outlib=work.fcmp.pyfuncs;
	/* Defining name and arguments of the Python function to be called */
 
	function Score_Python(CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
		/* Python object */
		declare object py(python);
 
		/* Getting Python file  */
		rc = py.infile("C:\assets\script.py");
 
		/* Send code to Python interpreter */
		rc = py.publish();
 
		/* Call python function with arguments */
		rc = py.call("score_predictions",CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
 
		/* Pass Python results to SAS variable */
		MyFCMPResult = py.results["scored"];
 
		return(MyFCMPResult);
	endsub;
run;
 
options cmplib=work.fcmp;
 
/* Calling FCMP function from data step */
data work.hmeq_scored;
	set work._excelexport;
	scored_bad = Score_Python(CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
	put scored_bad=;
run;

 

I place my code in the editor.

We're now ready to run the code. Select the 'Running man' icon in the editor.

The output

The result represents the outcome of the Python model scoring code. Once the code completes the WORK.HMEQ_SCORED worksheet updates with a new column, scored_bad.

The binary value represents if the customer is likely (a '1') or unlikely (a '0') to default on his or her loan. I could now use any built in Excel features to filter or further analyze the data. For instance, I could filter all the customers likely to default on their loans and pass a report on to the customer management team.

Final thoughts

In this article we've explored how collaboration between teams with different skills can streamline processes for efficient data analysis. Each team focuses on what they're good at and all of the work is organized and completed in one place. It is a win-win for everyone.

Related resources

Extending Excel with Python and SAS Viya was published on SAS Users.

2月 202020
 

Image by 동철 이 from Pixabay

SAS® Scripting Wrapper for Analytics Transfer (SWAT), a powerful Python interface, enables you to integrate your Python code with SAS® Cloud Analytic Services (CAS). Using SWAT, you can execute CAS analytic actions, including feature engineering, machine learning modeling, and model testing, and then analyze the results locally.

This article demonstrates how you can predict the survival rates of Titanic passengers with a combination of both Python and CAS using SWAT. You can then see how well the models performed with some visual statistics.

 

Prerequisites


To get started, you will need the following:

  1. 64-bit Python 2.7 or Python 3.4+
  2. SAS® Viya®
  3. Jupyter Notebook
  4. SWAT (if needed, see the Installation page in the documentation)

After you install and configure these resources, start a Jupyter Notebook session to get started!

 

Step 1. Initialize the Python packages

Before you can build models and test how they perform, you need to initialize the different Python libraries that you will use throughout this demonstration.

Submit the following code and insert the specific values for your environment where needed:

# Import SAS SWAT Library
import swat

# Import OS for Local File Paths
import os
for dirname, _, filenames in os.walk('Desktop/'):
     for filename in filenames:
     print(os.path.join(dirname, filename))

# Import Numpy Library for Linear Algebra
import numpy as np
from numpy import trapz

# Import Pandas Library for Panda Dataframe
import pandas as pd

# Import Seaborn & Matplotlib for Data Visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

Step 2. Create the connection between Python and CAS

After you import the libraries, now you want to connect to CAS by using the SWAT package. In this demonstration, SSL is enabled in the SAS Viya environment and the SSL certificate is stored locally. If you have any connection errors, see Encryption (SSL).

Use the command below to create a connection to CAS:

# Create a Connection to CAS
conn = swat.CAS("ServerName.sas.com", PORT "UserID", "Password")

 

This command confirms the connection status:

# Verify the Connection to CAS
connection_status = conn.serverstatus()
connection_status

 

If the connection to CAS is working (if it is, you would see information similar to the above status), you can begin to import and explore the data.

Step 3. Import data into CAS

To gather the data needed for this analysis, run the following code in SAS and save the data locally.

This example saves the data locally in the Documents folder. With a CAS action, you can import this data on to the Viya CAS server.

# Import Titanic Data on to Viya CAS Server
titanic_cas = conn.read_csv(r"C:\Users\krstob\Documents\Titanic\titanic.csv", 
casout = dict(name="titanic", replace=True))

Step 4. Explore the data loaded in CAS

Now that the data is loaded into CAS memory, use the SWAT interface to interact with the data set. Using CAS actions, you can look at the shape, column information, records, and descriptions of the data. A machine learning engineer should review data before loading the data locally, in order to dive deeper on certain features.

If any of the SWAT syntax looks familiar to you, it is because SWAT is integrated with pandas. Here is a high-level look at the data:

  • The shape (rows, columns):
  • The column information:
  • The first three records:
  • An in-depth feature description:

Take a moment to think about this data set. If you were just using a combination of pandas and scikit-learn, you would need a good amount of data preprocessing. There are missing entries, character features that need to be converted to numeric, and data that needs to be distributed correctly for efficient and accurate processing.

Luckily, when SWAT is integrated with CAS, CAS does a lot of this work for you. CAS machine learning modeling can easily import character variables, impute missing values, normalize the data, partition the data, and much more.

The next step is to take a closer look at some of the data features.

 

Step 5. Explore the data locally

There is great information here from CAS about your 14 features. You know the data types, means, unique values, standard deviation, and others. Now, bring the data back locally into essentially a pandas data frame and create some graphs on what you believe might be variables to predict on.

# Use the CAS To_Frame Action to Bring the CAS Table Locally into a Data Frame
titanic_pandas_df = titanic_cas.to_frame()

With data loaded locally, examine the numerical distributions:
# How Is the Data Distributed? (Numerical)
distribution_plot = titanic_pandas_df.drop('survived', axis=1).hist(bins = 15, figsize = (12,12), alpha = 0.75)

The pclass variable represents the passenger class (first class, second class, third class). Does the passenger class have any effect on the survival rate? To look more into this, you can plot histograms that compare pclass, age, and the number who survived.

# For Seaborne Facet Grids, Create an Empty 3 by 2 Graph to Place Data On
pclass_survived = sns.FacetGrid(titanic_pandas_df, col='survived', row = 'pclass', height = 2.5, aspect = 3)

# Overlay a Histogram of Y(Age) = Survived
pclass_survived.map(plt.hist, 'age', alpha = 0.75, bins = 25)

# Add a Legend for Readability
pclass_survived.add_legend()

Note: 1 = survived, 0 = did not survive

 

As this graph suggests, the higher the class, the better the chance that someone survived. There is also a low survival rate for the approximately 18–35 age range for the lower class. This information is great, because you can build new features later that focus on pclass.

Another common predictor feature is someone’s sex. Were women and children saved first in the case of the Titanic crash? The following code creates graphs to help answer this question:

# Create a Graph Canvas - One for Female Survival Rate - One for Male
Survived = 'Survived'
Not_Survived = 'Not Survived'
fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,4))

# Initialize Women and Male Variables to the Data Set Value
Women = titanic_pandas_df[titanic_pandas_df['sex'] == 'female']
Male = titanic_pandas_df[titanic_pandas_df['sex'] == 'male']

# For the First Graph, Plot the Amount of Women Who Survived Dependent on Their Age
Female_vs_Male = sns.distplot(Women[Women['survived']==1].age.dropna(), 
               bins=25, label = Survived, ax = axes[0], kde = False)

# For the First Graph, Layer the Amount of Women Who Did Not Survive Dependent on Their Age
Female_vs_Male = sns.distplot(Women[Women['survived']==0].age.dropna(), 
               bins=25, label = Not_Survived, ax = axes[0], kde = False)

# Display a Legend for the First Graph
Female_vs_Male.legend()
Female_vs_Male.set_title('Females')

# For the Second Graph, Plot the Amount of Men Who Survived Dependent on Their Age
Female_vs_Male = sns.distplot(Male[Male['survived']==1].age.dropna(), 
               bins=25, label = Survived, ax = axes[1], kde = False)

# For the Second Graph, Layer the Amount of Men Who Did Not Survive Dependent on Their Age
Female_vs_Male = sns.distplot(Male[Male['survived']==0].age.dropna(), 
                bins=25, label = Not_Survived, ax = axes[1], kde = False)

# Display a Legend for the Second Graph
Female_vs_Male.legend()
Female_vs_Male.set_title('Men')

This graph confirms that both women and children had a better chance at survival.

These feature graphs are nice visuals, but a correlation matrix is another great way to compare numerical features and the correlation to survival rate:

# Heatmap Correlation Matrix That Compares Numerical Values and Survived
Heatmap_Matrix = sns.heatmap(
              titanic_pandas_df[["survived","sibsp","parch","age","fare"]].corr(),
              annot = True,
              fmt = ".3f",
              cmap = "coolwarm",
              center = 0,
              linewidths = 0.1
)

The heat map shows that the fare, surprisingly, has a significant correlation with survival rate (or seems to, at least). Keep this in mind when you build the models.

You now have great knowledge of the features in this data set. Age, fare, and sex do affect someone’s survival chances. For a general machine learning problem, you would typically explore each feature in more detail, but, for now, it is time to move on to some feature engineering in CAS.

 

Step 6. Check for missing values

Now that you have a general idea about which features are important, you should clean up any missing values quickly by using CAS. By default, CAS replaces any missing values with the mean, but there are many other modes to choose from.

For this test case, you can keep the default because the data set is quite small. Use the CAS impute action to perform a data matrix (variable) imputation that fills in missing values.

First, check to see how many missing values are in the data set:

# Check for Missing Values
titanic_cas.distinct()

Both fare and age are important numeric variables that have missing values, so run the impute action with SWAT to fill in any missing values:

# Impute Missing Values (Replace with Substituted Values [By Default w/ the Mean])
conn.dataPreprocess.impute(
     table            = 'titanic',
     inputs          = ['age','fare'],
     copyAllVars = True,
     casOut        = dict(name = 'titanic', replace = True)
)

And, just like that, CAS takes care of the missing numeric values and creates a new variable, IMP_variable. This action would have taken more time to do with pandas or scikit-learn, so CAS was a nice time saver.

Step 7. Load the data locally to create new features

You now have great, clean data that you can model in CAS. Sometimes, though, for a machine learning problem, you want to create your own custom features.

It is easy to do that by using a data frame locally, so bring the data back to your local machine to create some custom features.

Using the to_frame() CAS action, convert the CAS data set into a local data frame. Keep only the variables needed for modeling:

<u></u><span style="font-size: 14px;"># Use the CAS To_Frame Action to bring the CAS Table Locally into a Data Frame</span>
titanic_pandas_df = titanic_cas.to_frame()

# Remove Some Features That Are Not Needed for Predictive Modeling
titanic_pandas_df = titanic_pandas_df[['embarked',
     'parch',
     'sex',
     'pclass',
     'sibsp',
     'survived',
     'IMP_fare',
     'IMP_age']
]

After the predictive features are available locally, confirm that the CAS statistical imputation worked:
# Check How Many Values Are Null by Using the isnull() Function
total_missing = titanic_pandas_df.isnull().sum().sort_values(ascending=False)
total_missing.head(5)

# Find the Total Values
total = titanic_pandas_df.notnull().sum().sort_values(ascending=False)
total.head(5)

# Find the Percentage of Missing Values per Variable
Percent = titanic_pandas_df.isnull().sum()/titanic_pandas_df.isnull().count()*100
Percent.sort_values(ascending=False).head(5)

# Round to One Decimal Place for Less Storage
Percent_Rounded = (round(Percent,1)).sort_values(ascending=False)

# Plot the Missing Data [Total Missing, Percentage Missing] with a Concatenation of Two Columns
Missing_Data = pd.concat([total, total_missing, Percent_Rounded], axis = 1,
                                            keys=['Non Missing Values', 'Total Missing Values', '% Missing'], sort=True)
Missing_Data

As you can see from the output above, all features are clean, and no values are missing! Now you can create some new features.

Step 8. Create new features

With machine learning, there are times when you want to create your own features that combine useful information to create a more accurate model. This action can help with overfitting, memory usage, or many other reasons.

This demo shows how to build four new features:

  • Relatives
  • Alone_on_Ship
  • Age_Times_Class
  • Fare_Per_Person

 

Relatives and Alone_on_Ship

The sibsp feature is the number of siblings and spouses, and the parch variable is the number of parents and children. So, you can combine these two for a Relatives feature that indicates how many people someone had on the ship in total. If a passenger traveled completely alone, you can flag that by creating the categorical variable Alone_on_Ship:

# Create a Relatives Variable / Alone on Ship
data = [titanic_pandas_df]
for dataset in data:
     dataset['Relatives'] = dataset['sibsp'] + dataset['parch']
     dataset.loc[dataset['Relatives'] &gt; 0, 'Alone_On_Ship'] = 0
     dataset.loc[dataset['Relatives'] == 0, 'Alone_On_Ship'] = 1
     dataset['Alone_On_Ship'] = dataset['Alone_On_Ship'].astype(int)

Age_Times_Class

As discussed earlier in this demo, both age and class had an effect on survivability. So, create a new Age_Times_Class feature that combines a person’s age and class:

data = [titanic_pandas_df]

# For Loop That Creates a New Variable, Age_Times_Class
for dataset in data:
     dataset['Age_Times_Class']= dataset['IMP_age'] * dataset['pclass']

Fare_Per_Person

The Fare_Per_Person variable is created by dividing the IMP_fare variable (cleaned by CAS) by the Relatives variable and then adding 1, which accounts for the passenger:

# Set the Training &amp; Testing Data for Efficiency
data = [titanic_pandas_df]

# For Loop Through Both Data Sets That Creates a New Variable, Fare_Per_Person
for dataset in data:
     dataset['Fare_Per_Person'] = dataset['IMP_fare']/(dataset['Relatives']+1)
     dataset['Sib_Div_Spouse'] = dataset['sibsp']
     dataset['Parents_Div_Children'] = dataset['parch']

# Drop the Parent Variable
titanic_pandas_df = titanic_pandas_df.drop(['parch'], axis=1)

# Drop the Siblings Variable
titanic_pandas_df = titanic_pandas_df.drop(['sibsp'], axis=1)

With these new features that you created with Python, here is how the data set looks:
# Look at How the Data Is Distributed
titanic_pandas_df.head(5)

Step 9. Load the data back into CAS for model training

Now you can make some models! Load the clean data back into CAS and start training:

# Upload the Data Set
titanic_cas = conn.upload_frame(titanic_pandas_df, casout = dict(name='titanic', replace='True'))

These code examples display the tables that the data is in:
# The Data Frame Type of the CAS Data
type(titanic_train_cas)


# The Data Frame Type of the Local Data
type(titanic_train_pd)

As you can see, the Titanic CAS table has a CASTable data frame and the local table has a SAS data frame.

Here is a final look at the data that you are going to train:

If you were training a model using scikit-learn, it would not really achieve great results without more preparation. Most of the values are in float format, and there are a few categorical variables. Luckily, CAS can handle these issues when it builds the models.

 

Step 10. Create a testing and training set

One of the awesome things about CAS machine learning is that you do not have to manually separate the data set. Instead, you can run a partitioning function and then model and test based on these parameters. To do this, you need to load the Sampling action set. Then, you can call the srs action, which can quickly partition a data table.

# Partitioning the Data
conn.loadActionSet('sampling')
conn.sampling.srs(
     table = 'titanic',
     samppct = 80,
     partind = True,
     output = dict(casout = dict(name = 'titanic', replace = True), copyVars = 'ALL')
)

This code partitions the data and adds a unique identifier to the data row to indicate whether it is testing or training. The unique identifier is _PartInd_. When a data row has this identifier equal to 0, it is part of the testing set. Similarly, when it is equal to 1, the data row is part of the training set.

 

Step 11. Build various models with CAS

One of my favorite parts about machine learning with CAS is how simple building a model is. With looping, you can dynamically change the targets, inputs, and nominal variables. If you are trying to build an extremely accurate model, it would be a great solution.

Building a model with CAS requires you to do a few things:

  • Load the action set (in this demo, a forest model, decision tree, and gradient boosting model)
  • Set your model variables (targets, inputs, and nominals)
  • Train the model

 

Forest Model

What does the data look like with a forest model? Here is the relevant code:

# Load the decisionTree CAS Action Set
conn.loadActionSet('decisionTree')

# Set Out Target for Predictive Modeling
target = 'survived'

# Set Inputs to Use to Predict Survived (Numerical Variable Inputs)
inputs = ['sex', 'pclass', 'Alone_On_Ship', 
                'Age_Times_Class', 'Relatives', 'IMP_age', 
                'IMP_fare', 'Fare_Per_Person', 'embarked']

# Set Nominal Variables to Use in Model (Categorial Variable Inputs)
nominals = ['sex', 'pclass', 'Alone_On_Ship', 'embarked', 'survived']

# Train the Forest Model
conn.decisionTree.forestTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_forest_model', replace = True)
)

Why does the input table have a WHERE clause? That is because you are looking at rows that contain the training flag, which was created with the srs action.

After running that block of code, you also get a response from CAS detailing how the model was trained, including great parameters like Number of Trees, Confidence Level for Pruning, and Max Number of Tree Nodes. If you wanted to do hyper-parameter tuning, this output shows you what the model currently looks like before you even adjust how it executes.

Decision Tree Model

The code to train a decision tree model is similar to the forest model example:

# Train the Decision Tree Model
conn.decisionTree.forestTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_decisiontree_model', replace = True)
)

Gradient Boosting Model

Lastly, you can build a gradient boosting model in this way:

# Train the Gradient Boosting Model
conn.decisionTree.gbtreeTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_gradient_model', replace = True)
)

Step 12. Score the models

How do you score the models with CAS? CAS has a score function for each model that is built. This function generates a new table that contains how the model performed on each data input.

Here is how to score the three models:

titanic_forest_score = conn.decisionTree.forestScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_forest_model',
     casout = dict(name='titanic_forest_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

titanic_decisiontree_score = conn.decisionTree.dtreeScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_decisiontree_model',
     casout = dict(name='titanic_decisiontree_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

titanic_gradient_score = conn.decisionTree.gbtreeScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_gradient_model',
     casout = dict(name='titanic_gradient_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

When the scoring function is running, it creates the new variables P_survived1, which is the prediction for whether a passenger survived or not, and P_survived0, which is the prediction for whether the passenger did not survive. With this scoring function, you can see how accurately a model could correctly classify passengers on the testing set.

If you dive deeper into the Python object, this scoring function is set as equal to, so you can actually see the misclassification rate!

For example, examine how the forest model did by running this code:

titanic_forest_score

The scoring function read in all of the model-tested set and told you the misclassification error. By calculating [1 – Misclassification Error], you can see that the model was approximately 85% accurate. For barely exploring the data and testing and training on a small data set, this score is good. These scores can be misleading though, as they do not tell the entire story. Do you have more false positives or false negatives? When it comes to predicting human survival, those parameters are important to investigate.

To analyze that, load the percentile CAS action set. This action set provides actions for calculating percentiles and box plot values. In your case, it also assesses models. With this information, CAS has an assess function to determine a final assessment of how the model did.

conn.loadActionSet('percentile')
prediction = 'P_survived1'

titanic_forest_assessed = conn.percentile.assess(
     table = 'titanic_forest_score',
     inputs = prediction,
     casout = dict(name = 'titanic_forest_assessed', replace = True),
     response = target,
     event = '1'
)

titanic_decissiontree_assessed = conn.percentile.assess(
     table = 'titanic_decisiontree_score',
     inputs = prediction,
     casout = dict(name = 'titanic_decissiontree_assessed', replace = True),
     response = target,
     event = '1'
)

titanic_gradient_assessed = conn.percentile.assess(
     table = 'titanic_gradient_score',
     inputs = prediction,
     casout = dict(name = 'titanic_gradient_assessed', replace = True),
     response = target,
     event = '1'
)

This CAS action returns three types of assessments: lift-related assessments, ROC-related assessments, and concordance statistics. Python is great at graphing data, so now you can move the data locally and see how it did with the new assessments.

 

Step 13. Analyze the results locally

You can plot the receiver operating characteristic (ROC) curve and the cumulative lift to determine how the models performed. Using the ROC curve, you can then calculate the area under the ROC curve (AUC) to see overall how well the models predicted the survival rate.

What exactly is a ROC curve or lift?

  • A ROC curve is determined by plotting the true positive rate (TPR) against the false positive rate. The true positive rate is the proportional observations that were correctly predicted to be positive. The false positive rate is the proportional observations that were incorrectly predicted to be positive.
  • A lift chart is derived from a gains chart. The X axis acts as a percentile, but the Y axis is the ratio of the gains value of our model and the gains value of a model that is choosing passengers randomly. That is, it details how many times the model is better than the random choice of cases.

Before you can plot these tables locally, you need to create a connection to them. CAS created some new assessed tables, so create a connection to these CAS tables for analysis:

# Assess Forest
titanic_assess_ROC_Forest = conn.CASTable('titanic_forest_assessed_ROC')
titanic_assess_Lift_Forest = conn.CASTable('titanic_forest_assessed')

titanic_ROC_pandas_Forest = titanic_assess_ROC_Forest.to_frame()
titanic_Lift_pandas_Forest = titanic_assess_Lift_Forest.to_frame()

# Assess Decision Tree
titanic_assess_ROC_DT = conn.CASTable('titanic_decisiontree_assessed_ROC')
titanic_assess_Lift_DT = conn.CASTable('titanic_decisiontree_assessed')

titanic_ROC_pandas_DT = titanic_assess_ROC_DT.to_frame()
titanic_Lift_pandas_DT = titanic_assess_Lift_DT.to_frame()

# Assess GB
titanic_assess_ROC_gb = conn.CASTable('titanic_gradient_assessed_ROC')
titanic_assess_Lift_gb = conn.CASTable('titanic_gradient_assessed')

titanic_ROC_pandas_gb = titanic_assess_ROC_gb.to_frame()
titanic_Lift_pandas_gb = titanic_assess_Lift_gb.to_frame()

Now that there is a connection to these tables, you can use the Matplotlib library to plot the ROC curve. Plot each model on this graph to see which model performed the best:
# Plot ROC Locally
plt.figure(figsize = (10,10))
plt.plot(1-titanic_ROC_pandas_Forest['_Specificity_'], 
             titanic_ROC_pandas_Forest['_Sensitivity_'], 'bo-', linewidth = 3)
plt.plot(1-titanic_ROC_pandas_DT['_Specificity_'], 
            titanic_ROC_pandas_DT['_Sensitivity_'], 'ro-', linewidth = 3)
plt.plot(1-titanic_ROC_pandas_gb['_Specificity_'], 
             titanic_ROC_pandas_gb['_Sensitivity_'], 'go-', linewidth = 3)
plt.plot(pd.Series(range(0,11,1))/10, pd.Series(range(0,11,1))/10, 'k--')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.legend(['Forest', 'Decision Tree', 'Gradient Boosting'])
plt.show()

You can also take the depth and cumulative lift scores from the assessed data set and plot that information:

# Plot Lift Locally
plt.figure(figsize = (10,10))
plt.plot(titanic_Lift_pandas_Forest['_Depth_'], titanic_Lift_pandas_Forest['_CumLift_'], 'bo-', linewidth = 3)
plt.plot(titanic_Lift_pandas_DT['_Depth_'], titanic_Lift_pandas_DT['_CumLift_'], 'ro-', linewidth = 3)
plt.plot(titanic_Lift_pandas_gb['_Depth_'], titanic_Lift_pandas_gb['_CumLift_'], 'go-', linewidth = 3)
plt.xlabel('Depth')
plt.ylabel('Cumulative Lift')
plt.title('Cumulative Lift Curve')
plt.legend(['Forest', 'Decision Tree', 'Gradient Boosting'])
plt.show()

Although these curves work well for exploring the model success and perhaps how you can better tune the data, typically most people just want to see overall how well the model performed. To get a general idea, you can integrate the ROC curve to get this overview. This area is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

# Forest Scores
x_forest = np.array([titanic_ROC_pandas_Forest['_Specificity_']])
y_forest  = np.array([titanic_ROC_pandas_Forest['_Sensitivity_']])

# Decision Tree Scores
x_dt = np.array([titanic_ROC_pandas_DT['_Specificity_']])
y_dt  = np.array([titanic_ROC_pandas_DT['_Sensitivity_']])

# GB Scores
x_gb = np.array([titanic_ROC_pandas_gb['_Specificity_']])
y_gb  = np.array([titanic_ROC_pandas_gb['_Sensitivity_']])

# Calculate Area Under Curve (Integrate)
area_forest = trapz(y_forest ,x_forest)
area_dt = trapz(y_dt ,x_dt)
area_gb = trapz(y_gb ,x_gb)

# Table For Model Scores
Model_Results = pd.DataFrame({
'Model': ['Forest', 'Decision Tree', 'Gradient Boosting'],
'Score': [area_forest, area_dt, area_gb]})

Model_Results

With the AUC ROC score, you can now see how well the model performs at distinguishing between positive and negative outcomes.

 

Conclusion

Any machine learning engineer should take time to further investigate integrating SAS Viya with their normal programming environment. When it works through the SWAT Python interface, CAS excels at quickly building and scoring a model. You can do this even with large data sets, because the data is stored in CAS memory. If you want to go further in-depth with using ensemble methods, I would recommend using SAS® Model Studio on SAS Viya, or perhaps one of the many great open-source libraries, like scikit-learn on Python.

The ability of CAS to quickly clean, model, and score a prediction is quite impressive. If you would like to take a further look at what SWAT and CAS can do, check out the different action sets that can be completed.

If you would like some more information about SWAT and SAS® Viya®, see these resources:

Would you like to see SWAT machine learning work with a larger data set, or perhaps use SWAT to build a neural network? Please leave a comment!

Building Machine Learning Models by Integrating Python and SAS® Viya® was published on SAS Users.

2月 202020
 

Image by 동철 이 from Pixabay

SAS® Scripting Wrapper for Analytics Transfer (SWAT), a powerful Python interface, enables you to integrate your Python code with SAS® Cloud Analytic Services (CAS). Using SWAT, you can execute CAS analytic actions, including feature engineering, machine learning modeling, and model testing, and then analyze the results locally.

This article demonstrates how you can predict the survival rates of Titanic passengers with a combination of both Python and CAS using SWAT. You can then see how well the models performed with some visual statistics.

 

Prerequisites


To get started, you will need the following:

  1. 64-bit Python 2.7 or Python 3.4+
  2. SAS® Viya®
  3. Jupyter Notebook
  4. SWAT (if needed, see the Installation page in the documentation)

After you install and configure these resources, start a Jupyter Notebook session to get started!

 

Step 1. Initialize the Python packages

Before you can build models and test how they perform, you need to initialize the different Python libraries that you will use throughout this demonstration.

Submit the following code and insert the specific values for your environment where needed:

# Import SAS SWAT Library
import swat

# Import OS for Local File Paths
import os
for dirname, _, filenames in os.walk('Desktop/'):
     for filename in filenames:
     print(os.path.join(dirname, filename))

# Import Numpy Library for Linear Algebra
import numpy as np
from numpy import trapz

# Import Pandas Library for Panda Dataframe
import pandas as pd

# Import Seaborn &amp; Matplotlib for Data Visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

Step 2. Create the connection between Python and CAS

After you import the libraries, now you want to connect to CAS by using the SWAT package. In this demonstration, SSL is enabled in the SAS Viya environment and the SSL certificate is stored locally. If you have any connection errors, see Encryption (SSL).

Use the command below to create a connection to CAS:

# Create a Connection to CAS
conn = swat.CAS("ServerName.sas.com", PORT "UserID", "Password")

 

This command confirms the connection status:

# Verify the Connection to CAS
connection_status = conn.serverstatus()
connection_status

 

If the connection to CAS is working (if it is, you would see information similar to the above status), you can begin to import and explore the data.

Step 3. Import data into CAS

To gather the data needed for this analysis, run the following code in SAS and save the data locally.

This example saves the data locally in the Documents folder. With a CAS action, you can import this data on to the Viya CAS server.

# Import Titanic Data on to Viya CAS Server
titanic_cas = conn.read_csv(r"C:\Users\krstob\Documents\Titanic\titanic.csv", 
casout = dict(name="titanic", replace=True))

Step 4. Explore the data loaded in CAS

Now that the data is loaded into CAS memory, use the SWAT interface to interact with the data set. Using CAS actions, you can look at the shape, column information, records, and descriptions of the data. A machine learning engineer should review data before loading the data locally, in order to dive deeper on certain features.

If any of the SWAT syntax looks familiar to you, it is because SWAT is integrated with pandas. Here is a high-level look at the data:

  • The shape (rows, columns):
  • The column information:
  • The first three records:
  • An in-depth feature description:

Take a moment to think about this data set. If you were just using a combination of pandas and scikit-learn, you would need a good amount of data preprocessing. There are missing entries, character features that need to be converted to numeric, and data that needs to be distributed correctly for efficient and accurate processing.

Luckily, when SWAT is integrated with CAS, CAS does a lot of this work for you. CAS machine learning modeling can easily import character variables, impute missing values, normalize the data, partition the data, and much more.

The next step is to take a closer look at some of the data features.

 

Step 5. Explore the data locally

There is great information here from CAS about your 14 features. You know the data types, means, unique values, standard deviation, and others. Now, bring the data back locally into essentially a pandas data frame and create some graphs on what you believe might be variables to predict on.

# Use the CAS To_Frame Action to Bring the CAS Table Locally into a Data Frame
titanic_pandas_df = titanic_cas.to_frame()

With data loaded locally, examine the numerical distributions:
# How Is the Data Distributed? (Numerical)
distribution_plot = titanic_pandas_df.drop('survived', axis=1).hist(bins = 15, figsize = (12,12), alpha = 0.75)

The pclass variable represents the passenger class (first class, second class, third class). Does the passenger class have any effect on the survival rate? To look more into this, you can plot histograms that compare pclass, age, and the number who survived.

# For Seaborne Facet Grids, Create an Empty 3 by 2 Graph to Place Data On
pclass_survived = sns.FacetGrid(titanic_pandas_df, col='survived', row = 'pclass', height = 2.5, aspect = 3)

# Overlay a Histogram of Y(Age) = Survived
pclass_survived.map(plt.hist, 'age', alpha = 0.75, bins = 25)

# Add a Legend for Readability
pclass_survived.add_legend()

Note: 1 = survived, 0 = did not survive

 

As this graph suggests, the higher the class, the better the chance that someone survived. There is also a low survival rate for the approximately 18–35 age range for the lower class. This information is great, because you can build new features later that focus on pclass.

Another common predictor feature is someone’s sex. Were women and children saved first in the case of the Titanic crash? The following code creates graphs to help answer this question:

# Create a Graph Canvas - One for Female Survival Rate - One for Male
Survived = 'Survived'
Not_Survived = 'Not Survived'
fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,4))

# Initialize Women and Male Variables to the Data Set Value
Women = titanic_pandas_df[titanic_pandas_df['sex'] == 'female']
Male = titanic_pandas_df[titanic_pandas_df['sex'] == 'male']

# For the First Graph, Plot the Amount of Women Who Survived Dependent on Their Age
Female_vs_Male = sns.distplot(Women[Women['survived']==1].age.dropna(), 
               bins=25, label = Survived, ax = axes[0], kde = False)

# For the First Graph, Layer the Amount of Women Who Did Not Survive Dependent on Their Age
Female_vs_Male = sns.distplot(Women[Women['survived']==0].age.dropna(), 
               bins=25, label = Not_Survived, ax = axes[0], kde = False)

# Display a Legend for the First Graph
Female_vs_Male.legend()
Female_vs_Male.set_title('Females')

# For the Second Graph, Plot the Amount of Men Who Survived Dependent on Their Age
Female_vs_Male = sns.distplot(Male[Male['survived']==1].age.dropna(), 
               bins=25, label = Survived, ax = axes[1], kde = False)

# For the Second Graph, Layer the Amount of Men Who Did Not Survive Dependent on Their Age
Female_vs_Male = sns.distplot(Male[Male['survived']==0].age.dropna(), 
                bins=25, label = Not_Survived, ax = axes[1], kde = False)

# Display a Legend for the Second Graph
Female_vs_Male.legend()
Female_vs_Male.set_title('Men')

This graph confirms that both women and children had a better chance at survival.

These feature graphs are nice visuals, but a correlation matrix is another great way to compare numerical features and the correlation to survival rate:

# Heatmap Correlation Matrix That Compares Numerical Values and Survived
Heatmap_Matrix = sns.heatmap(
              titanic_pandas_df[["survived","sibsp","parch","age","fare"]].corr(),
              annot = True,
              fmt = ".3f",
              cmap = "coolwarm",
              center = 0,
              linewidths = 0.1
)

The heat map shows that the fare, surprisingly, has a significant correlation with survival rate (or seems to, at least). Keep this in mind when you build the models.

You now have great knowledge of the features in this data set. Age, fare, and sex do affect someone’s survival chances. For a general machine learning problem, you would typically explore each feature in more detail, but, for now, it is time to move on to some feature engineering in CAS.

 

Step 6. Check for missing values

Now that you have a general idea about which features are important, you should clean up any missing values quickly by using CAS. By default, CAS replaces any missing values with the mean, but there are many other modes to choose from.

For this test case, you can keep the default because the data set is quite small. Use the CAS impute action to perform a data matrix (variable) imputation that fills in missing values.

First, check to see how many missing values are in the data set:

# Check for Missing Values
titanic_cas.distinct()

Both fare and age are important numeric variables that have missing values, so run the impute action with SWAT to fill in any missing values:

# Impute Missing Values (Replace with Substituted Values [By Default w/ the Mean])
conn.dataPreprocess.impute(
     table            = 'titanic',
     inputs          = ['age','fare'],
     copyAllVars = True,
     casOut        = dict(name = 'titanic', replace = True)
)

And, just like that, CAS takes care of the missing numeric values and creates a new variable, IMP_variable. This action would have taken more time to do with pandas or scikit-learn, so CAS was a nice time saver.

Step 7. Load the data locally to create new features

You now have great, clean data that you can model in CAS. Sometimes, though, for a machine learning problem, you want to create your own custom features.

It is easy to do that by using a data frame locally, so bring the data back to your local machine to create some custom features.

Using the to_frame() CAS action, convert the CAS data set into a local data frame. Keep only the variables needed for modeling:

<u></u><span style="font-size: 14px;"># Use the CAS To_Frame Action to bring the CAS Table Locally into a Data Frame</span>
titanic_pandas_df = titanic_cas.to_frame()

# Remove Some Features That Are Not Needed for Predictive Modeling
titanic_pandas_df = titanic_pandas_df[['embarked',
     'parch',
     'sex',
     'pclass',
     'sibsp',
     'survived',
     'IMP_fare',
     'IMP_age']
]

After the predictive features are available locally, confirm that the CAS statistical imputation worked:
# Check How Many Values Are Null by Using the isnull() Function
total_missing = titanic_pandas_df.isnull().sum().sort_values(ascending=False)
total_missing.head(5)

# Find the Total Values
total = titanic_pandas_df.notnull().sum().sort_values(ascending=False)
total.head(5)

# Find the Percentage of Missing Values per Variable
Percent = titanic_pandas_df.isnull().sum()/titanic_pandas_df.isnull().count()*100
Percent.sort_values(ascending=False).head(5)

# Round to One Decimal Place for Less Storage
Percent_Rounded = (round(Percent,1)).sort_values(ascending=False)

# Plot the Missing Data [Total Missing, Percentage Missing] with a Concatenation of Two Columns
Missing_Data = pd.concat([total, total_missing, Percent_Rounded], axis = 1,
                                            keys=['Non Missing Values', 'Total Missing Values', '% Missing'], sort=True)
Missing_Data

As you can see from the output above, all features are clean, and no values are missing! Now you can create some new features.

Step 8. Create new features

With machine learning, there are times when you want to create your own features that combine useful information to create a more accurate model. This action can help with overfitting, memory usage, or many other reasons.

This demo shows how to build four new features:

  • Relatives
  • Alone_on_Ship
  • Age_Times_Class
  • Fare_Per_Person

 

Relatives and Alone_on_Ship

The sibsp feature is the number of siblings and spouses, and the parch variable is the number of parents and children. So, you can combine these two for a Relatives feature that indicates how many people someone had on the ship in total. If a passenger traveled completely alone, you can flag that by creating the categorical variable Alone_on_Ship:

# Create a Relatives Variable / Alone on Ship
data = [titanic_pandas_df]
for dataset in data:
     dataset['Relatives'] = dataset['sibsp'] + dataset['parch']
     dataset.loc[dataset['Relatives'] &gt; 0, 'Alone_On_Ship'] = 0
     dataset.loc[dataset['Relatives'] == 0, 'Alone_On_Ship'] = 1
     dataset['Alone_On_Ship'] = dataset['Alone_On_Ship'].astype(int)

Age_Times_Class

As discussed earlier in this demo, both age and class had an effect on survivability. So, create a new Age_Times_Class feature that combines a person’s age and class:

data = [titanic_pandas_df]

# For Loop That Creates a New Variable, Age_Times_Class
for dataset in data:
     dataset['Age_Times_Class']= dataset['IMP_age'] * dataset['pclass']

Fare_Per_Person

The Fare_Per_Person variable is created by dividing the IMP_fare variable (cleaned by CAS) by the Relatives variable and then adding 1, which accounts for the passenger:

# Set the Training &amp; Testing Data for Efficiency
data = [titanic_pandas_df]

# For Loop Through Both Data Sets That Creates a New Variable, Fare_Per_Person
for dataset in data:
     dataset['Fare_Per_Person'] = dataset['IMP_fare']/(dataset['Relatives']+1)
     dataset['Sib_Div_Spouse'] = dataset['sibsp']
     dataset['Parents_Div_Children'] = dataset['parch']

# Drop the Parent Variable
titanic_pandas_df = titanic_pandas_df.drop(['parch'], axis=1)

# Drop the Siblings Variable
titanic_pandas_df = titanic_pandas_df.drop(['sibsp'], axis=1)

With these new features that you created with Python, here is how the data set looks:
# Look at How the Data Is Distributed
titanic_pandas_df.head(5)

Step 9. Load the data back into CAS for model training

Now you can make some models! Load the clean data back into CAS and start training:

# Upload the Data Set
titanic_cas = conn.upload_frame(titanic_pandas_df, casout = dict(name='titanic', replace='True'))

These code examples display the tables that the data is in:
# The Data Frame Type of the CAS Data
type(titanic_train_cas)


# The Data Frame Type of the Local Data
type(titanic_train_pd)

As you can see, the Titanic CAS table has a CASTable data frame and the local table has a SAS data frame.

Here is a final look at the data that you are going to train:

If you were training a model using scikit-learn, it would not really achieve great results without more preparation. Most of the values are in float format, and there are a few categorical variables. Luckily, CAS can handle these issues when it builds the models.

 

Step 10. Create a testing and training set

One of the awesome things about CAS machine learning is that you do not have to manually separate the data set. Instead, you can run a partitioning function and then model and test based on these parameters. To do this, you need to load the Sampling action set. Then, you can call the srs action, which can quickly partition a data table.

# Partitioning the Data
conn.loadActionSet('sampling')
conn.sampling.srs(
     table = 'titanic',
     samppct = 80,
     partind = True,
     output = dict(casout = dict(name = 'titanic', replace = True), copyVars = 'ALL')
)

This code partitions the data and adds a unique identifier to the data row to indicate whether it is testing or training. The unique identifier is _PartInd_. When a data row has this identifier equal to 0, it is part of the testing set. Similarly, when it is equal to 1, the data row is part of the training set.

 

Step 11. Build various models with CAS

One of my favorite parts about machine learning with CAS is how simple building a model is. With looping, you can dynamically change the targets, inputs, and nominal variables. If you are trying to build an extremely accurate model, it would be a great solution.

Building a model with CAS requires you to do a few things:

  • Load the action set (in this demo, a forest model, decision tree, and gradient boosting model)
  • Set your model variables (targets, inputs, and nominals)
  • Train the model

 

Forest Model

What does the data look like with a forest model? Here is the relevant code:

# Load the decisionTree CAS Action Set
conn.loadActionSet('decisionTree')

# Set Out Target for Predictive Modeling
target = 'survived'

# Set Inputs to Use to Predict Survived (Numerical Variable Inputs)
inputs = ['sex', 'pclass', 'Alone_On_Ship', 
                'Age_Times_Class', 'Relatives', 'IMP_age', 
                'IMP_fare', 'Fare_Per_Person', 'embarked']

# Set Nominal Variables to Use in Model (Categorial Variable Inputs)
nominals = ['sex', 'pclass', 'Alone_On_Ship', 'embarked', 'survived']

# Train the Forest Model
conn.decisionTree.forestTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_forest_model', replace = True)
)

Why does the input table have a WHERE clause? That is because you are looking at rows that contain the training flag, which was created with the srs action.

After running that block of code, you also get a response from CAS detailing how the model was trained, including great parameters like Number of Trees, Confidence Level for Pruning, and Max Number of Tree Nodes. If you wanted to do hyper-parameter tuning, this output shows you what the model currently looks like before you even adjust how it executes.

Decision Tree Model

The code to train a decision tree model is similar to the forest model example:

# Train the Decision Tree Model
conn.decisionTree.forestTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_decisiontree_model', replace = True)
)

Gradient Boosting Model

Lastly, you can build a gradient boosting model in this way:

# Train the Gradient Boosting Model
conn.decisionTree.gbtreeTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_gradient_model', replace = True)
)

Step 12. Score the models

How do you score the models with CAS? CAS has a score function for each model that is built. This function generates a new table that contains how the model performed on each data input.

Here is how to score the three models:

titanic_forest_score = conn.decisionTree.forestScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_forest_model',
     casout = dict(name='titanic_forest_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

titanic_decisiontree_score = conn.decisionTree.dtreeScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_decisiontree_model',
     casout = dict(name='titanic_decisiontree_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

titanic_gradient_score = conn.decisionTree.gbtreeScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_gradient_model',
     casout = dict(name='titanic_gradient_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

When the scoring function is running, it creates the new variables P_survived1, which is the prediction for whether a passenger survived or not, and P_survived0, which is the prediction for whether the passenger did not survive. With this scoring function, you can see how accurately a model could correctly classify passengers on the testing set.

If you dive deeper into the Python object, this scoring function is set as equal to, so you can actually see the misclassification rate!

For example, examine how the forest model did by running this code:

titanic_forest_score

The scoring function read in all of the model-tested set and told you the misclassification error. By calculating [1 – Misclassification Error], you can see that the model was approximately 85% accurate. For barely exploring the data and testing and training on a small data set, this score is good. These scores can be misleading though, as they do not tell the entire story. Do you have more false positives or false negatives? When it comes to predicting human survival, those parameters are important to investigate.

To analyze that, load the percentile CAS action set. This action set provides actions for calculating percentiles and box plot values. In your case, it also assesses models. With this information, CAS has an assess function to determine a final assessment of how the model did.

conn.loadActionSet('percentile')
prediction = 'P_survived1'

titanic_forest_assessed = conn.percentile.assess(
     table = 'titanic_forest_score',
     inputs = prediction,
     casout = dict(name = 'titanic_forest_assessed', replace = True),
     response = target,
     event = '1'
)

titanic_decissiontree_assessed = conn.percentile.assess(
     table = 'titanic_decisiontree_score',
     inputs = prediction,
     casout = dict(name = 'titanic_decissiontree_assessed', replace = True),
     response = target,
     event = '1'
)

titanic_gradient_assessed = conn.percentile.assess(
     table = 'titanic_gradient_score',
     inputs = prediction,
     casout = dict(name = 'titanic_gradient_assessed', replace = True),
     response = target,
     event = '1'
)

This CAS action returns three types of assessments: lift-related assessments, ROC-related assessments, and concordance statistics. Python is great at graphing data, so now you can move the data locally and see how it did with the new assessments.

 

Step 13. Analyze the results locally

You can plot the receiver operating characteristic (ROC) curve and the cumulative lift to determine how the models performed. Using the ROC curve, you can then calculate the area under the ROC curve (AUC) to see overall how well the models predicted the survival rate.

What exactly is a ROC curve or lift?

  • A ROC curve is determined by plotting the true positive rate (TPR) against the false positive rate. The true positive rate is the proportional observations that were correctly predicted to be positive. The false positive rate is the proportional observations that were incorrectly predicted to be positive.
  • A lift chart is derived from a gains chart. The X axis acts as a percentile, but the Y axis is the ratio of the gains value of our model and the gains value of a model that is choosing passengers randomly. That is, it details how many times the model is better than the random choice of cases.

Before you can plot these tables locally, you need to create a connection to them. CAS created some new assessed tables, so create a connection to these CAS tables for analysis:

# Assess Forest
titanic_assess_ROC_Forest = conn.CASTable('titanic_forest_assessed_ROC')
titanic_assess_Lift_Forest = conn.CASTable('titanic_forest_assessed')

titanic_ROC_pandas_Forest = titanic_assess_ROC_Forest.to_frame()
titanic_Lift_pandas_Forest = titanic_assess_Lift_Forest.to_frame()

# Assess Decision Tree
titanic_assess_ROC_DT = conn.CASTable('titanic_decisiontree_assessed_ROC')
titanic_assess_Lift_DT = conn.CASTable('titanic_decisiontree_assessed')

titanic_ROC_pandas_DT = titanic_assess_ROC_DT.to_frame()
titanic_Lift_pandas_DT = titanic_assess_Lift_DT.to_frame()

# Assess GB
titanic_assess_ROC_gb = conn.CASTable('titanic_gradient_assessed_ROC')
titanic_assess_Lift_gb = conn.CASTable('titanic_gradient_assessed')

titanic_ROC_pandas_gb = titanic_assess_ROC_gb.to_frame()
titanic_Lift_pandas_gb = titanic_assess_Lift_gb.to_frame()

Now that there is a connection to these tables, you can use the Matplotlib library to plot the ROC curve. Plot each model on this graph to see which model performed the best:
# Plot ROC Locally
plt.figure(figsize = (10,10))
plt.plot(1-titanic_ROC_pandas_Forest['_Specificity_'], 
             titanic_ROC_pandas_Forest['_Sensitivity_'], 'bo-', linewidth = 3)
plt.plot(1-titanic_ROC_pandas_DT['_Specificity_'], 
            titanic_ROC_pandas_DT['_Sensitivity_'], 'ro-', linewidth = 3)
plt.plot(1-titanic_ROC_pandas_gb['_Specificity_'], 
             titanic_ROC_pandas_gb['_Sensitivity_'], 'go-', linewidth = 3)
plt.plot(pd.Series(range(0,11,1))/10, pd.Series(range(0,11,1))/10, 'k--')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.legend(['Forest', 'Decision Tree', 'Gradient Boosting'])
plt.show()

You can also take the depth and cumulative lift scores from the assessed data set and plot that information:

# Plot Lift Locally
plt.figure(figsize = (10,10))
plt.plot(titanic_Lift_pandas_Forest['_Depth_'], titanic_Lift_pandas_Forest['_CumLift_'], 'bo-', linewidth = 3)
plt.plot(titanic_Lift_pandas_DT['_Depth_'], titanic_Lift_pandas_DT['_CumLift_'], 'ro-', linewidth = 3)
plt.plot(titanic_Lift_pandas_gb['_Depth_'], titanic_Lift_pandas_gb['_CumLift_'], 'go-', linewidth = 3)
plt.xlabel('Depth')
plt.ylabel('Cumulative Lift')
plt.title('Cumulative Lift Curve')
plt.legend(['Forest', 'Decision Tree', 'Gradient Boosting'])
plt.show()

Although these curves work well for exploring the model success and perhaps how you can better tune the data, typically most people just want to see overall how well the model performed. To get a general idea, you can integrate the ROC curve to get this overview. This area is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

# Forest Scores
x_forest = np.array([titanic_ROC_pandas_Forest['_Specificity_']])
y_forest  = np.array([titanic_ROC_pandas_Forest['_Sensitivity_']])

# Decision Tree Scores
x_dt = np.array([titanic_ROC_pandas_DT['_Specificity_']])
y_dt  = np.array([titanic_ROC_pandas_DT['_Sensitivity_']])

# GB Scores
x_gb = np.array([titanic_ROC_pandas_gb['_Specificity_']])
y_gb  = np.array([titanic_ROC_pandas_gb['_Sensitivity_']])

# Calculate Area Under Curve (Integrate)
area_forest = trapz(y_forest ,x_forest)
area_dt = trapz(y_dt ,x_dt)
area_gb = trapz(y_gb ,x_gb)

# Table For Model Scores
Model_Results = pd.DataFrame({
'Model': ['Forest', 'Decision Tree', 'Gradient Boosting'],
'Score': [area_forest, area_dt, area_gb]})

Model_Results

With the AUC ROC score, you can now see how well the model performs at distinguishing between positive and negative outcomes.

 

Conclusion

Any machine learning engineer should take time to further investigate integrating SAS Viya with their normal programming environment. When it works through the SWAT Python interface, CAS excels at quickly building and scoring a model. You can do this even with large data sets, because the data is stored in CAS memory. If you want to go further in-depth with using ensemble methods, I would recommend using SAS® Model Studio on SAS Viya, or perhaps one of the many great open-source libraries, like scikit-learn on Python.

The ability of CAS to quickly clean, model, and score a prediction is quite impressive. If you would like to take a further look at what SWAT and CAS can do, check out the different action sets that can be completed.

If you would like some more information about SWAT and SAS® Viya®, see these resources:

Would you like to see SWAT machine learning work with a larger data set, or perhaps use SWAT to build a neural network? Please leave a comment!

Building Machine Learning Models by Integrating Python and SAS® Viya® was published on SAS Users.

12月 122019
 

Parts 1 and 2 of this blog post discussed exploring and preparing your data using SASPy. To recap, Part 1 discussed how to explore data using the SASPy interface with Python. Part 2 continued with an explanation of how to prepare your data to use it with a machine-learning model. This final installment continues the discussion about preparation by explaining techniques for normalizing your numerical data and one-hot encoding categorical variables.

Normalizing your data

Some of the numerical features in the examples from parts 1 and 2 have different ranges. The variable Age spans from 0-100 and Hours_per_week from 0-80. These ranges affect the calculation of each feature when you apply them to a supervised learner. To ensure the equal treatement of each feature, you need to scale the numerical features.

The following example uses the SAS STDIZE procedure to scale the numerical features. PROC STDIZE is not a standard procedure available in the SASPy library.  However, the good news is you can add any SAS procedure to SASPy! This feature enables Python users to become a part of the SAS community by contributing to the SASPy library and giving them the chance to use the vast number of powerful SAS procedures. To add PROC STDIZE to SASPy, see the instructions in the blog post Adding SAS procedures to the SASPy interface to Python.

After you add the STDIZE to SASPy, run the following code to scale the numerical features. The resulting values will be between 0 and 1.

# Creating a SASPy stat objcect
stat = sas.sasstat()
# Use the stdize function that was added to SASPy to scale our features
stat_result = stat.stdize(data=cen_data_logTransform,
                         procopts = 'method=range out=Sasuser.afternorm',
			 var = 'age education_num capital_gain capital_loss hours_per_week')

To use the STDIZE procedure in SASPy we need to specify the method and the output data set in the statement options. For this we use the "procopts" option and we specify range as our method and our "out" option to a new SAS data set, afternorm.

After running the STDIZE procedure we assign the new data set into a SAS data object.

norm_data = sas.sasdata('afternorm', libref='SASuser')

Now let's verify if we were successful in transforming our numerical features

norm_data.head(obs=5)


 

 

 

 

 
The output looks great! You have normalized the numerical features. So, it's time to tackle the last data-preparation step.

One-Hot Encoding

Now that you have adjusted the numerical features, what do you do with the categorical features? The categories Relationship, Race, Sex, and so on are in string format. Statistical models cannot interpret these values, so you need to transform the values from strings into a numerical representation. The one-hot encoding process provides the transformation you need for your data.

To use one-hot encoding, use the LOGISTIC procedure from the SASPy Stat class. SASPy natively includes the LOGISTIC procedure, so you can go straight to coding. To generate the syntax for the code below, I followed the instructions from Usage Note 23217: Saving the coded design matrix of a model to a data set.

stat_proc_log = stat.logistic(data=norm_data, procopts='outdesign=SASuser.test1 outdesignonly',
            cls = "workclass education_level marital_status occupation relationship race sex native_country / param=glm",
	    model = "age = workclass education_level marital_status occupation relationship race sex native_country / noint")

To view the results from this code, create a SAS data object from the newly created data set, as shown in this example:

one_hot_data = sas.sasdata('test1', libref='SASuser')
display(one_hot_data.head(obs=5))

The output:

 

 

 

 

Our data was successfully one-hot encoded! For future reference, due to SAS’ analytical power, this step is not required. When including a categorical feature in a class statement the procedure automatically generates a design matrix with the one-hot encoded feature. For more information, I recommend reading this post about different ways to create a design matrix in SAS.

Finally

You made it to the end of the journey! I hope everyone who reads these blogs can see the value that SASPy brings to the machine-learning community. Give SASPy  a try, and you'll see the power it can bring to your SAS solutions.

Stay curious, keep learning, and (most of all) continue innovating.

Machine Learning with SASPy: Exploring and Preparing your data - Part 3 was published on SAS Users.

12月 122019
 

Bringing the power of SAS to your Python scripts can be a game changer. An easy way to do that is by using SASPy, a Python interface to SAS allowing Python developers to use SAS® procedures within Python. However, not all SAS procedures are included in the SASPy library. So, what do you do if you want to use those excluded procedures? Easy! The SASPy library contains functionality enabling you to add SAS procedures to the SASPy library. In this post, I'll explain the process.

The basics for adding procedures are covered in the Contributing new methods section in the SASPy documentation. To further assist you, this post expands upon the steps, providing step-by-step details for adding the STDIZE procedure to SASPy. For a hands-on application of the use case refer the blog post Machine Learning with SASPy: Exploring and Preparing your data - Part 3.

This is your chance to contribute to the project! Whereas, you can choose to follow the steps below as a one-off solution, you also have the choice to share your work and incorporate it in the SASPy repository.

Prerequisites

Before you add a procedure to SASPy, you need to perform these prerequisite steps:

  1. Identify the SAS product associated with the procedure you want to add, e.g. SAS/STAT, SAS/ETS, SAS Enterprise Miner, etc.
  2. Locate the SASPy file (for example, sasstat.py, sasets.py, and so on) corresponding to the product from step 1.
  3. Ensure you have a current license for the SAS product in question.

Adding a SAS procedure to SASPy

SASPy utilizes Python Decorators to generate the code for adding SAS procedures. Roughly, the process is:

  1. define the procedure
  2. generate the code to add
  3. add the code to the proper SASPy file
  4. (optional)create a pull request to add the procedure to the SASPy repository

Below we'll walk through each step in detail.

Create a set of valid statements

Start a new python session with Jupyter and create a list of valid arguments for the chosen procedure. You determine the arguments for the procedure by searching for your procedure in the appropriate SAS documentation. For example, the PROC STDIZE arguments are documented in the SAS/STAT® 15.1 User's Guide, in the The STDIZE Procedure section, with the contents:

The STDIZE procedure

 
 
 
 
 
 
 
 
 
 

For example, I submitted the following command to create a set of valid arguments for PROC STDIZE:

lset = {'STDIZE', 'BY', 'FREQ', 'LOCATION', 'SCALE', 'VAR', 'WEIGHT'}

Call the doc_convert method

The doc_convert method takes two arguments: a list of valid statements (method_stmt) and the procedure name (stdize).

import saspy
 
print(saspy.sasdecorator.procDecorator.doc_convert(lset, 'STDIZE')['method_stmt'])
print(saspy.sasdecorator.procDecorator.doc_convert(lset, 'STDIZE')['markup_stmt'])

The command generates the method call and the docstring markup like the following:

def STDIZE(self, data: [SASdata', str] = None,
   by: [str, list] = None,
   location: str = None,
   scale: str = None,
   stdize: str = None,
   var: str = None,
   weight: str = None,
   procopts: str = None,
   stmtpassthrough: str = None,
   **kwargs: dict) -> 'SASresults':
   Python method to call the STDIZE procedure.
 
   Documentation link:
 
   :param data: SASdata object or string. This parameter is required.
   :parm by: The by variable can be a string or list type.
   :parm freq: The freq variable can only be a string type.
   :parm location: The location variable can only be a string type.
   :parm scale: The scale variable can only be a string type.
   :parm stdize: The stdize variable can be a string type.
   :parm var: The var variable can only be a string type.
   :parm weight: The weight variable can be a string type.
   :parm procopts: The procopts variable is a generic option avaiable for advanced use It can only be a string type.
   :parm stmtpassthrough: The stmtpassthrough variable is a generic option available for advanced use. It can only be a string type.
   :return: SAS Result Object

Update SASPy product file

We'll take the output and add it to the appropriate product file (sasstat.py in this case). When you open this file, be sure to open it with administrative privileges so you can save the changes. Prior to adding the code to the product file, perform the following tasks:

  1. add @procDecorator.proc_decorator({}) before the function definition
  2. add the proper documentation link from the SAS Programming Documentation site
  3. add triple quotes ("""") to comment out the second section of code
  4. include any additional details others might find helpful

The following output shows the final code to add to the sasstat.py file:

@procDecorator.proc_decorator({})
def STDIZE(self, data: [SASdata', str] = None,
   by: [str, list] = None,
   location: str = None,
   scale: str = None,
   stdize: str = None,
   var: str = None,
   weight: str = None,
   procopts: str = None,
   stmtpassthrough: str = None,
   **kwargs: dict) -> 'SASresults':
   """
   Python method to call the STDIZE procedure.
 
   Documentation link:
   https://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=statug&docsetTarget=statug_stdize_toc.htm&locale=en
   :param data: SASdata object or string. This parameter is required.
   :parm by: The by variable can be a string or list type.
   :parm freq: The freq variable can only be a string type.
   :parm location: The location variable can only be a string type.
   :parm scale: The scale variable can only be a string type.
   :parm stdize: The stdize variable can be a string type.
   :parm var: The var variable can only be a string type.
   :parm weight: The weight variable can be a string type.
   :parm procopts: The procopts variable is a generic option avaiable for advanced use It can only be a string type.
   :parm stmtpassthrough: The stmtpassthrough variable is a generic option available for advanced use. It can only be a string type.
   :return: SAS Result Object
   """

Update sasdecorator file with the new method

Alter the sasdecorator.py file by adding stdize in the code on line 29, as shown below.

if proc in ['hplogistic', 'hpreg', 'stdize']:

Important: The update to the sasdecorator file is only a requirement when you add a procedure with no plot options. The sasstat.py library assumes all procedures produce plots. However, PROC STDIZE does not include them. So, you should perform this step ONLY when your procedure does not include plot options. This will more than likely change in a future release, so please follow the Github page for any updates.

Document a test for your function

Make sure you write at least one test for the procedure. Then, add the test to the appropriate testing file.

Finally

Congratulations! All done. You now have the knowledge to add even more procedures in the future.

After you add your procedure, I highly recommend you contribute your procedure to the SASPy GitHub library. To contribute, follow the outlined instructions on the Contributing Rules GitHub page.

Adding SAS® procedures to the SASPy interface to Python was published on SAS Users.

12月 042019
 

Site relaunches with improved content, organization and navigation.

In 2016, a cross-divisional SAS team created developer.sas.com. Their mission: Build a bridge between SAS (and our software) and open source developers.

The initial effort made available basic information about SAS® Viya® and integration with open source technologies. In June 2018, the Developer Advocate role was created to build on that foundation. Collaborating with many of you, the SAS Communities team has improved the site by clarifying its scope and updating it consistently with helpful content.

Design is an iterative process. One idea often builds on another.

-- businessman Mark Parker

The team is happy to report that recently developer.sas.com relaunched, with marked improvements in content, organization and navigation. Please check it out and share with others.

New overview page on developer.sas.com

The developer experience

The developer experience goes beyond the developer.sas.com portal. The Q&A below provides more perspective and background.

What is the developer experience?

Think of the developer experience (DX) as equivalent to the user experience (UX), only the developer interacts with the software through code, not points and clicks. Developers expect and require an easy interface to software code, good documentation, support resources and open communication. All this interaction occurs on the developer portal.

What is a developer portal?

The white paper Developer Portal Components captures the key elements of a developer portal. Without going into detail, the portal must contain (or link to) these resources: an overview page, onboarding pages, guides, API reference, forums and support, and software development kits (SDKs). In conjunction with the Developers Community, the site’s relaunch includes most of these items.

Who are these developers?

Many developers fit somewhere in these categories:

  • Data scientists and analysts who code in open source languages (mainly Python and R in this case).
  • Web application developers who create apps that require data and processing from SAS.
  • IT service admins who manage customer environments.

All need to interact with SAS but may not have written SAS code. We want this population to benefit from our software.

What is open source and how is SAS involved?

Simply put, open source software is just what the name implies: the source code is open to all. Many of the programs in use every day are based on open source technologies: operating systems, programming languages, web browsers and servers, etc. Leveraging open source technologies and integrating them with commercial software is a popular industry trend today. SAS is keeping up with the market by providing tools that allow open source developers to interact with SAS software.

What is an API?

All communications between open source and SAS are possible through APIs, or application programming interfaces. APIs allow software systems to communicate with one another. Software companies expose their APIs so developers can incorporate functionality and send or request data from the software.

Why does SAS care about APIs?

APIs allow the use of SAS analytics outside of SAS software. By allowing developers to communicate with SAS through APIs, customer applications easily incorporate SAS functions. SAS has created various libraries to aid in open source integration. These tools allow developers to code in the language of their choice, yet still interface with SAS. Most of these tools exist on github.com/sassoftware or on the REST API guides page.

A use case for SAS APIs

A classic use of SAS APIs is for a loan default application. A bank creates a model in SAS that determines the likelihood of a customer defaulting on a loan based on multiple factors. The bank also builds an application where a bank representative enters the information for a new potential customer. The bank application code uses APIs to communicate this information to the SAS model and return a credit decision.

What is a developer advocate?

A developer advocate is someone who helps developers succeed with a platform or technology. Their role is to act as a bridge between the engineering team and the developer community. At SAS, the developer advocate fields questions and comments on the Developers Community and works with R&D to provide answers. The administration of developer.sas.com also falls under the responsibility of the developer advocate.

We’re not done

The site will continue to evolve, with additions of other SAS products and offerenings, and other initiatives. Check back often to see what’s new.
Now that you are an open source and SAS expert, please check out the new developer.sas.com. We encourage feedback and suggestions for content. Leave comments and questions on the site or contact Joe Furbee: joe.furbee@sas.com.

developer.sas.com 2.0: More than just a pretty interface was published on SAS Users.