Tech

3月 222023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to count missing values in a CAS table using the Python SWAT package.

Load and prepare data

First, I connect my Python client to the distributed CAS server and named my connection conn. Next, I create a small pandas DataFrame with missing values for the demonstration. Further, I load the DafaFrame to the CAS server using the upload_frame method and and name the CAS table MISSING_DATA and place it in the Casuser caslib. The upload_frame method returns a reference to the CAS table in the variable castbl.

For more information about uploading client-side data to the CAS server, check out my previous post.

conn = ## your CAS connection information
 
## Create a simple dataframe
df = pd.DataFrame([
            [np.nan, 2, 45, 0, 'A'],
            [3, 4, np.nan, 1,'A'],
            [np.nan, np.nan, 50, np.nan,'B'],
            [np.nan, 3, np.nan, 4,],
            [2, 2, np.nan, 0, 'A'],
            [3, 4, np.nan, 1,'A'],
            [np.nan, np.nan, 75, np.nan,'B'],
            [np.nan, 3, 60, 4,]
            ],
            columns=['col1','col2','col3','col4','col5'])
 
## Upload the dataframe to the CAS server as a CAS table
castbl = conn.upload_frame(df,
                           casout = {'name':'missing_data', 
                                     'caslib':'casuser', 
                                     'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table MISSING_DATA in caslib CASUSER(Peter).
NOTE: The table MISSING_DATA has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the MISSING_DATA CAS table has been loaded to the CAS server.

Lastly, I'll preview the distributed CAS table using the SWAT package head method.

castbl.head(10)

Count missing values in a CAS table

There are a variety of ways to count the number missing values in a CAS table. Counting missing values in CAS tables is not exactly the same in the SWAT package as it is in pandas. However, it's just as easy. Let's look at a few different methods.

Using the SWAT package nmiss method

For users experienced with pandas, you traditionally run the isna and sum methods to find the number of missing values in each column of a DataFrame. For CAS tables it's even easier. You can use the SWAT package nmiss method. The nmiss method returns the total number of missing values in each column.

Here I'll specify the CAS table object castbl, then the SWAT nmiss method.

castbl.nmiss()

The distributed CAS server counts the number of missing values in each column and returns a Series to the Python client. Once you have the Series object you can use traditional pandas for additional processing. For example, I'll chain the pandas plot method after the SWAT nmiss method to plot the series. The summarization occurs in the distributed CAS server and the plotting occurs on the client.

(castbl                    ## CAS table
 .nmiss()                  ## SWAT method
 .plot(kind = 'bar'))      ## pandas plot

You can also use the nmiss method to find the number of missing values in specific columns. Simply specify the CASTable object, the columns, then the method.

colNames = ['col1','col5']
 
castbl[colNames].nmiss()

Using the Distinct CAS action

The distinct CAS action is one of my favorite actions. It not only finds the number of missing values, but it gives you the number of distinct values in each column. Now, depending on the size of your data and the number of distinct values, this one can be a bit more resource intensive since finding the number of distinct values in a column can be more time consuming.

To use the distinct CAS action specify the CASTable object, castbl, then the distinct CAS action.

castbl.distinct()

The distinct action returns a CASResults object (dictionary) back to the Python client with the number of distinct and missing values.

In the distinct action you can use the inputs parameter to specify the columns you want to summarize.

castbl.distinct(inputs = colNames)

The results show the number of distinct and missing values in the specified columns. Now with the summarized data form the CAS server in a CASResults object you can use your Python client to continue working with the data. For more information on working with CASResults objects, check out my previous post.

Using the summary CAS action

You can also use the summary CAS action to find the number of missing values for numeric columns. The summary action will also generate a variety of descriptive statistics such as the mean, variance, size, sum of squares and more.

castbl.summary()

The results show the descriptive statistics. The NMiss column contains the number of missing values for the numeric columns.

Within the summary action you can also specify the columns to analyze with the inputs parameter, and the summary statistics to generate with the subSet parameter.

castbl.summary(inputs = ['col1','col2'], 
               subSet = ['min','max','nmiss'])

Summary

The SWAT package blends the world of pandas and CAS to process your distributed data. In this example I focused on using specific SWAT methods and CAS actions to count the number of missing values in CAS table columns.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 15 - Count Missing Values in a CAS Table was published on SAS Users.

3月 202023
 

Welcome to my series on getting started with Python integration to SAS Viya for predictive modeling.

  1. Exploring Data - Learn how to explore the data before fitting a model
  2. Working with Dates - Learn how to format a SAS Date and calculate a new column
  3. Imputing Missing Values - Learn how to replace missing values
  4. Creating Training and Validation Data Sets - Learn a way to split your data into a training and validation data set to be used for modeling
  5. Fitting a Linear Regression - Learn how to fit a linear regression and use your model to score new data
  6. Fitting a Logistic Regression - Learn how to fit a logistic regression and use your model to score new data
  7. Fitting a Decision Tree - Learn how to fit a decision tree and use your decision tree model to score new data
  8. Comparing Logistic Regression and Decision Tree - Which of our models is better at predicting our outcome? Learn how to compare models using misclassification, area under the curve (ROC) charts, and lift charts with validation data.
  9. Fitting a Random Forest - Learn how to fit a random forest and use your model to score new data
  10. Fitting a Gradient Boosting Model - Learn how to fit a gradient boosting model and use your model to score new data
  11. Autotuning Your Model - Learn how to improve your models by utilizing the built in autotuning feature in SAS Viya
  12. Creating new Features Automatically - Learn how to utilize SAS to automatically create your features for your models
  13. Creating Machine Learning Pipelines Automatically - Learn how use AI withing SAS Viya to create machine learning pipelines that start with the data, create new features, fit several models, and select the best model overall.

Getting Started with Python Integration to SAS Viya for Predictive Modeling - Index was published on SAS Users.

3月 202023
 

Welcome to the first post in my series Getting Started with Python Integration to SAS Viya for Predictive Modeling.

I'm going to dive right into the content assuming you have minimal knowledge on SAS Cloud Analytic Services (CAS), CAS Actions and Python. For some background on these subjects, refer to the following:

Now let's get started and learn how to explore the data before fitting a model. Before fitting any models, it is essential to inspect the data and explore questions such as: How accurate is the data? Are there missing values? Is there a possibility of incorrect or faulty information in some columns? Answering these queries can help you ensure that your model works correctly.

Load Data

For this Blog series I will use the home_equity.csv file from the SAS Viya example data library. For additional information on loading data into CAS see Loading a Client-Side CSV File into CAS or Loading Server-Side Files into Memory.

After making a connection to CAS (for more information on connecting to CAS see Making a Connection) load the Home Equity data into memory on CAS. Here the word conn is our connection object to CAS.

HomeEquity = conn.upload_file("https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/home_equity.csv",   
                     casOut={"name":"HomeEquity",
                             "caslib":"casuser",
                             "replace":True})

The data is now not only loaded onto the CAS server, but also into CAS memory, and ready to be explored.

HomeEquity is our CAS table object and will be how I refer to and access the data in the code below.

Before I start building any predictive model, there are always certain questions I make sure to ask regarding my data.

  • How many rows and columns?
  • What type of columns (numeric or categorical)?
  • What do the values look like?
  • Are there any issues I can see visually (think dirty data)?
  • Are there any missing values?

Python (pandas) and CAS Actions provide us with an effective way to execute these tasks. Of course, there are multiple approaches available depending on your preferences. In this tutorial, I will be utilizing CAS Actions primarily because the requests are processed quickly and efficiently on a dedicated server when dealing with huge data sets.

How Many Columns in the Data

To answer the question on the number of columns in the data, let’s use the table action set and the recordCount action.

conn.table.recordCount(table='HomeEquity')

This data has 5,960 rows.

Traditional Python methods are available as well. If you want both the number of columns and number of rows, use the shape method.

display(HomeEquity.shape)

HomeEquity has not only 5,960 rows but also 18 columns.

How to Look at Column Information

Let's delve deeper and investigate which columns our data contains and what types of data each holds. Use the table action set and the columnInfo action.

HomeEquity.table.columnInfo()

Of the 18 columns, we have 6 that are categorical, represented by varchar and the other 12 are numerical represented by double. Some of the latter may be categorical as well, but we will have to do further exploration to determine that. Let’s look.

Review the Data

Let's take a closer look at the values of the columns. By utilizing both the table action set and fetch action, we can glance at the initial five rows with ease.

conn.table.fetch(table='HomeEquity', to=5)

When I look at BAD, I notice its values are 1 and 0. However, REASON and JOB have only one value each. Are there any additional values for these three columns?

I also notice a portion of data is missing, represented as NaN (in numerical form) or an empty field for character values.

It appears that APPDATE is a SAS Date, and we will dive into how to work with Dates in the next post.

Descriptive Statistics

Using descriptive statistics will help us see more of the information in the data: the number of values for categorical, the number of missing for all the columns, and the mean, minimum, and maximum for numerical columns.

Using three different actions (distinct, freq, and summary) from the simple action set, let’s explore the data values further.

Distinct action

The distinct action gives us both the number of unique values for each column and the number of missing values.

conn.simple.distinct(table = HomeEquity)

We have 5 columns with less than 10 unique values, these are potential categorical inputs (often referred to as nominal) for our model. For categorical inputs, we want to minimize the number of levels because more levels mean more complexity for our models.

We also have several columns with missing values. Part 3 of this series will address how to work with these columns.

Freq action

The freq action gives us the frequency of each of the unique values for the individual columns. In this example the inputs = option is used to only look at the frequency for the categorical columns with seven or fewer levels. To keep it simple we will use REGION instead of DIVISION to represent a geographic element in our model.

As you can imagine using freq on all columns would produce a lot of output. 😊

conn.simple.freq(table = 'HomeEquity',
    inputs = ["BAD","REASON","JOB","REGION"])

This output shows that we have missing values for REASON and JOB, but not BAD and REGION. This data also reveals that the columns aren't sparse, as there is a decent proportion of rows for each value.

Summary action

The summary action only creates summary descriptive statistics for numerical columns because the mean of a categorical doesn’t make sense 😊.

In this example, all the summary statistics are calculated and include the minimum (MIN), maximum (MAX), number of rows (N), number missing (NMISS), mean, sum, standard deviation (STD), standard error (STDERR), variance (VAR), Uncorrected Sum of Squares (USS), Corrected Sum of Squares (CSS), Coefficient of variation (CV), t-statistic (TValue), the p-value for statistic (ProbT), Skewness, and Kurtosis.

conn.simple.summary(table='HomeEquity')

With the subSet= option, we quickly get a clearer picture of the descriptive statistics of interest.

conn.simple.summary(table='HomeEquity',
                    subSet=["N","NMISS","MEAN","MIN","MAX"])

By examining the descriptive statistics, we can detect any dirty data present. Thankfully, there appears to be none here! We'll also take note of missing values in each column which is something that will get discussed during Part 3 of this series.

Advanced statistics such as Skewness and Kurtosis may come into play when deciding which columns or transformations to use for inputs, but they won't be utilizing them for this series.

The Wrap-Up: Exploring Data

In conclusion, exploring data is the first and most important step before building a predictive model. Using the SAS CAS action sets for simple descriptive statistics helps us quickly identify missing values and potential categorical inputs for our model. This post focused on how to use the Fetch, Distinct, Freq, and Summary actions from the Simple Action Set. In the next post, we will learn how to work with dates in the data.

 

 

 

Getting Started with Python Integration to SAS Viya for Predictive Modeling - Exploring Data was published on SAS Users.

3月 112023
 

As a software and services company, SAS has been providing technical support to customers for over 40 years—decades longer than most software companies have been in existence. We are proud of the high customer satisfaction ratings and valuable relationships that customers have formed with our support engineers.

My background

I am fairly new to SAS and am excited to start my first full year leading Technical Support. I have over 23 years of experience working in the software field, primarily in support, which has made me truly value customers and providing exceptional support. That is why I wanted to connect with you and let you know about some improvements coming in 2023.

One of the things that drew me to SAS was data analytics, which is a central part of my vision for providing excellent support. In my leadership role, I analyze data to understand what works and what needs to be improved. I have a special interest in creating efficiency for the support team by using tools, reducing case volume by leveraging self-help resources and ensuring customer satisfaction by providing processes that work well. See my LinkedIn profile for more details.

New ServiceNow Customer Service Management (CSM) system

Our biggest plan for 2023 is to move our key Tech Support applications to a new CSM. This project contains many aspects and will be accomplished in a phased approach. Internally, our engineers will adopt an advanced issue-tracking system and a new knowledge-management tool with improved functionality. Externally, the application that you use to access your tracks (soon to be called “cases”) with SAS Technical Support will be replaced.

Right now, we are in the initial stages of the project. We anticipate that all our systems will be moved by the end of the third quarter. I will update you about our progress over the next several months.

3 ways your experience will be better

This transition is going to greatly improve your customer experience in the following ways:

  1. Your interactions with Technical Support will be much easier to manage.
  2. Once the new system is launched on the SAS Support site, self-help resources that are related to your issue will be automatically suggested by the CSM when you open a case with SAS Tech Support. As we use the CSM more, the suggestions will improve because the CSM uses machine learning behind-the-scenes to match queries to results.
  3. You’ll also experience an improved portal interface to communicate with Tech Support engineers who are assigned to your case.

New global cloud-based phone system: call us anytime

In the first quarter, we are moving about half of our engineers to a cloud-based phone system. It’s the same system that we have been using on-premises for the past few years, so it won’t be completely new. Then, by the end of the second quarter, we plan to have all of our engineers using this system. Since we have over 500 global TS engineers, this is a big project.

You shouldn’t experience any interruption in service as a result of this new global, cloud-based phone system. We’ll start to leverage this system later this year so that no matter which phone number you use to call SAS or what time you call, you’ll be routed to an engineer who can assist*. By the end of the second quarter of 2023, you will no longer have to call only during certain hours. Support calls will be accepted 24/7.

*Note that this change to how we route calls does not mean that we are changing our support hours; it just means that we will be able to route calls to a Tech Support office that is open.

Behind-the-scenes work

These are some of the projects that we are working on that will directly affect you. Inside Tech Support, we are working on many projects that are aimed at helping our engineers to provide better and more efficient service to you. We are leveraging SAS® software to analyze data across SAS to identify and close information gaps, which will help us improve our software and our customer support. We want you to be able to effectively use our software to accomplish your goals. Be sure to look for future updates from me!

2023 Plans for SAS Technical Support was published on SAS Users.

2月 232023
 

As you know, almost every SAS programming problem has many very different solutions. I’m going to solve a very simple problem using two different approaches.

The problem: Compute the sum of integers from 1 to 1,000,000.

I bet most of you thought of a solution almost immediately. Let me guess that you thought of one of the solutions shown below:

options fullstimer; ❶
 
Solution 1
data _null_; ❷
   do i = 1 to 1000000;
      Sum + i; ❸
   end;
   file print; ❹
   put Sum=;
run;

The option fullstimer will give you more complete timing information.

❷ To be more efficient use data _null_.

❸ The SUM statement does several things: 1) the variable Sum is retained (not set back to missing) for each iteration of the DATA Step. 2) Sum it initialized at zero. 3) If you had an expression instead of the constant (1) missing values would be ignored.

❹ Use FILE PRINT to send the output to the output window instead of the default location, the LOG.

Solution 2

data Integers; ❶
   do i = 1 to 10000000;
      output; ❷
   end;
run;
 
title "Sum of Integers";
proc means data=Integers n sum; ❸
   var i;
run;

Create a data set called Integers.

❷ Output an observation for each iteration of the DATA Step.  Note that the OUTPUT statement is inside the DO Loop.

❸ Use PROC MEANS to compute the sum.

Although both programs work, there is a difference in CPU time. Does that mean you should always seek a DATA Step solution? Not really. It depends on several factors, such as how often the program is to be run and which method you feel most comfortable with.

Here is a partial listing of the SAS Log showing timing information:

NOTE: 1 lines were written to file PRINT.
NOTE: DATA statement used (Total process time):
      real time           1.00 seconds
      user cpu time       0.21 seconds
      system cpu time     0.32 seconds
      memory              7875.03k
      OS Memory           16876.00k
      Timestamp           02/16/2023 08:23:18 AM
      Step Count           1  Switch Count  0

NOTE: The data set WORK.INTEGERS has 10000000 observations and 1
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.21 seconds
      user cpu time       0.20 seconds
      system cpu time     0.00 seconds
      memory              410.03k
      OS Memory           17392.00k
      Timestamp           02/16/2023 08:23:18 AM
      Step Count           2  Switch Count  0
NOTE: There were 10000000 observations read from the data set
      WORK.INTEGERS.
NOTE: PROCEDURE MEANS used (Total process time):
      real time           0.25 seconds
      user cpu time       1.07 seconds
      system cpu time     0.03 seconds
      memory              8471.84k
      OS Memory           25116.00k
      Timestamp           02/16/2023 08:23:19 AM
      Step Count           3  Switch Count  0

Do you care about CPU time? Unless this is a production program, I think you should program in a way that is most comfortable (unless you are a compulsive programmer and want the “best” program). By the way, if you remove the FILE PRINT statement from solution 1, the System CPU time is 0.0.  I guess there is some overhead to sending the results to your output device.

I’m interested in what your first instinct was when you read the problem. One of these two, or something else. Please post your comments below.

LEARN MORE | Ron Cody's books on Amazon

DATA Step or PROC? It depends... was published on SAS Users.

2月 142023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to bring a distributed CAS table back to your Python client as a DataFrame.

In this example, I'm using Python on my laptop (Python client) to connect to the CAS server. Bringing a distributed CAS table back to the Python client as a local DataFrame is typically appropriate for smaller data. If the data sets become larger, it is more efficient to process the data in the CAS server, or bring a smaller subset of data back to the client. The big question is, why would I want to bring a CAS table out of the CAS server's massively parallel processing (MPP) environment back to my client as a DataFrame?

 

The CAS server in SAS Viya is setup to connect to a variety of data sources throughout your organization like databases, folder paths, cloud data and more. Typically these connections are setup by your administrator. Having these connections setup makes it easy to access data. In this example, maybe the data you need is available to the CAS server and is small enough where you don't need to use the power of the CAS server's MPP environment. Maybe you simply want to pull a CAS table back to your client as a DataFrame and use a familiar Python package like numpy, pandas, seaborn, matplotlib, or scikit-learn for data exploration, preprocessing, visualization or machine learning.

Use caution when bringing data back to your Python client

With the to_frame SWAT method you can easily transfer a CAS table to your Python client as a DataFrame. When using the to_frame method YOU MUST USE CAUTION. It will attempt to pull all of the data down from the CAS server regardless of size. If the data is large, this can be time consuming or overwhelm your Python client depending on your available memory.

Load the demonstration data into memory

I've created a connection to my CAS server in the variable conn. I'll use my conn connection object and the SWAT read_csv method to load the cars.csv file from the Example Data Sets for the SAS® Viya® Platform website into memory on the CAS server. I'll add some formats and labels to the CAS table. This is a small table for demonstration purposes.

fileurl = 'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/cars.csv'
castbl = conn.upload_file(fileurl, 
                          casout = {'name':'cars_cas_table', 
                                    'caslib':'casuser', 
                                    'replace':True},
                          importoptions = {
                              'fileType':'csv',
                              'guessRows':100,
                              'vars': {
                                  'Invoice':{'format':'dollar16.2','label':'Invoice Price'},
                                  'MSRP':{'format':'dollar16.2', 'label':'Manufacturer Suggested Retail Price'},
                                  'Weight':{'format':'comma16.'},
                                  'EngineSize':{'label':'Engine Size (L)'},
                                  'Length':{'label':'Length (IN)'},
                                  'MPG_City':{'label':'MPG (City)}'},
                                  'MPG_Highway':{'label':'MPG (Highway)'}
                              }
                          })
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table CARS_CAS_TABLE in caslib CASUSER(Peter).
NOTE: The table CARS_CAS_TABLE has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

Next, I'll execute the tableInfo action to confirm the table was loaded into memory.

conn.tableInfo(caslib = 'casuser')


Lastly I'll execute the columnInfo action to view the column information of the CAS table.

castbl.columnInfo()

 

The results above show the CARS_CAS_TABLE has 15 columns. Some of the columns contain SAS labels and SAS formats.

Pull the entire CAS table to the client as a DataFrame

We saw earlier that the CARS_CAS_TABLE is a small in-memory table on the CAS server. Let's pretend the data was connected to some cloud data storage and was loaded into memory. I noticed that the data is small, and I want to use familiar Python packages to process it.

The first thing I can do is use the to_frame SWAT method on my castbl object to convert the CAS table to a DataFrame. I'll store the result in df and then view it's type.

df = castbl.to_frame()
display(type(df))
 
# and the results
swat.dataframe.SASDataFrame

 

The results show I now have a SASDataFrame. A SASDataFrame lives on the Python client and is a subclass of pandas.DataFrame. Therefore, anything you can do with a pandas.DataFrame will also work with a SASDataFrame. The only difference is that a SASDataFrame object contain extra metadata from the CAS table. For more information on CASTable vs DataFrame vs SASDataFrame check out the SWAT documentation.

You can view the extra metadata of a SASDataFrame with the colinfo attribute.

df.colinfo

The results show the SASDataFrame contains a variety of information about the columns like the data type, name and width. If the column contains a SAS label or format, that is also stored.

Once you have the SASDataFrame on your Python client, you can use pandas. For example, here I'll execute the pandas plot method to create a scatter plot of EngineSize by MPG_City.

df.plot.scatter(x = 'EngineSize', y = 'MPG_City', title = "Engine Size by MPG City");y');

The results show a simple scatter plot using pandas.

Apply column labels to a SASDataFrame

Sometimes you might want to use column labels, as they are more descriptive. Traditional pandas DataFrames don't have the concept of labels, but CAS tables do.  CAS enables you to add more descriptive labels to columns. Since a SASDataFrame stores information from the CAS table, you can use the SWAT apply_labels method to apply labels to the SASDataFrame. For more information on adding labels to a CAS table, check on my previous post Renaming columns.

df.apply_labels()

Notice the apply_labels method applies the SAS column labels (Manufacturer Suggested Retail Price, Invoice Price, Engine Size (L), Length (IN), MPG (City) and MPG (Highway)) to the SASDataFrame.

Pull a sample of the CAS table to the client as a DataFrame

What if the CAS table is big, but you want to work with a sample of the data using a specific Python package? Well, you can use the sample_pct parameter in the to_frame SWAT method to pull a smaller subset of data to your Python client.

Here I'll specify I want ten percent of the data from the CAS table. I'll also add the sample_seed parameter to pull the same random subset of data. Then I'll execute the shape method to count the number of rows and columns in the new SASDataFrame.

df = castbl.to_frame(sample_pct = .1, sample_seed = 99)
df.shape
 
# and the results
(43, 15)

The results show that only 43 rows of data were returned from the CAS table as a SASDataFrame on the Python client.

Apply SAS formats to the SASDataFrame

Lastly, what if you want to apply the SAS formats of a CAS table to your SASDataFrame? You can do that using the format parameter in the SWAT to_frame method, as seen in the code below. I'll also display five rows of the SASDataFrame and the data types of the columns.

df_formats = castbl.to_frame(format=True)
 
display(df_formats.head(), df_formats.dtypes)

The results show the DOLLAR format was applied to the MSRP and Invoice columns, and the COMMA format was applied to the Weight column.

The dtype method shows that when using the format parameter in the to_frame method all columns are returned to your Python client as objects, even if they do not contain a format.

Summary

The SWAT package blends the world of Pandas and CAS. It enables you to use the CAS server's massively parallel processing engine for data exploration, preprocessing and analytics. It also enables you to easily transfer your distributed CAS tables back to your Python client as SASDataFrames for additional processing using other Python packages.

The one thing you must remember when using the to_frame method is that transferring large data can be time consuming or take up all of your Python client's resources. USE IT WITH CAUTION.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 14 - CAS Table to DataFrame was published on SAS Users.

2月 072023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how your data is organized on the CAS server. In this post I'll discuss loading client-side CSV files into CAS.

Loading data from a client-side file is appropriate for smaller data. If the data sets become larger, it is more efficient to use a server-side load with the loadTable action to access and load the data. For more information check out my previous post loading server-side files into memory,

In this example, I'm using Python on my laptop to connect to the CAS server. Client-side files consist of files on my laptop (client). In these examples I'll load the heart.csv file from the SAS® Viya® Example Data Sets webpage to the CAS server. There are multiple ways to load client-side CSV files into CAS. I'll show you how to do it using the familiar Pandas API in the SWAT package with the read_csv method. I'll also show you using the upload_file and upload_frame SWAT methods.

Using the Pandas API in the SWAT package - read_csv

First, I'll begin with the familiar read_csv method from the SWAT package and store the results in castbl. The SWAT read_csv method calls the Pandas read_csv method, creates a client-side DataFrame, then uploads the DataFrame to the CAS server as an in-memory table. The main difference is the casout parameter. The casout parameter is specific to the SWAT package. It enables you to specify the output CAS table information like the name of the new distributed table and the in-memory location. Here, I'll name the CAS table heart_read_csv and place it in the Casuser caslib. I'll also specify the replace parameter to replace the CAS table if it already exists.

castbl = conn.read_csv(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv',
                      casout = {'name':'heart_read_csv', 
                                'caslib':'casuser', 
                                'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_READ_CSV in caslib CASUSER(Peter).
NOTE: The table HEART_READ_CSV has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

Next, I'll execute the tableInfo CAS action to view available CAS tables in the Casuser caslib.

conn.tableInfo(caslib = 'casuser')

The results above show that the table was loaded to the CAS server.

Let's view the type and value of the castbl object.

display(type(castbl), castbl)
 
# and the results
swat.cas.table.CASTable
CASTable('HEART_READ_CSV', caslib='CASUSER(Peter)')

The results show that castbl is a CASTable object and simply references the CAS table on the CAS server.

Lastly, I'll preview  the CAS table with the SWAT head method.

castbl.head()

The SWAT head method returns 5 rows from the CAS table to the client.

The read_csv method enables you to use all of the available Pandas parsers through the SWAT package to upload data to the distributed CAS server. The read_csv method parses the data on the client into a DataFrame, and then uploads the DataFrame to the CAS server.

Using the upload_file method in the SWAT package

Instead of using the Pandas API through the SWAT package you can also use the upload_file SWAT method. The upload_file method transfers the file to CAS server and then all parsing is done on the server. The upload_file method is not as robust as the read_csv method, but can be a bit faster than client-side parsing. Upload_file can be used to upload other file types to the CAS server, not just CSV files.

Here, I'll load the same CSV file to the CAS server as in the previous example. This time I'll use upload_file method and name the CAS table heart_upload_file. I'll store the results in castbl2.

castbl2 = conn.upload_file(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv', 
                           casout = {'name':'heart_upload_file', 
                                     'caslib':'casuser',
                                     'replace':True})     
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_UPLOAD_FILE in caslib CASUSER(Peter).
NOTE: The table HEART_UPLOAD_FILE has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the CSV file was uploaded to CAS successfully.

I'll view the available CAS table using the tableInfo action.

conn.tableInfo(caslib = 'casuser')


The results show that now I have two CAS tables in memory.

Using the SWAT upload_frame method

Another useful SWAT method is the upload_frame method. I like using this method if I'm preparing a DataFrame on my client using Pandas, then need to transfer the Pandas DataFrame to the CAS server for additional processing or to use another SAS Viya application like SAS Visual Analytics.

For example, here is some traditional Pandas code to read the same CSV file as before and then prepare it using traditional Pandas. The code is renaming columns, creating calculated columns and dropping a column.

conn.tableInfo(caslib = 'casuser')
## Read the data into a DataFrame
df_raw = pd.read_csv(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv')
 
## Prepare the DataFrame
df = (df_raw
      .rename(columns = lambda colName: colName.upper())
      .assign(
         STATUS = lambda _df: _df.STATUS.str.upper(),
         DEATHCAUSE = lambda _df: _df.DEATHCAUSE.fillna('Still Alive').str.lower()
        )
     .drop('AGEATSTART', axis=1)
)
df.head()

Now that I have the final DataFrame, I can simply upload it to the CAS server using the upload_frame method. Again, the casout parameter specifies output CAS table information.

castbl3 = conn.upload_frame(df, casout = {'name':'heart_upload_frame', 
                                          'caslib':'casuser', 
                                          'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_UPLOAD_FRAME in caslib CASUSER(Peter).
NOTE: The table HEART_UPLOAD_FRAME has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the Pandas DataFrame was successfully uploaded to the CAS server.

Lastly I'll view available CAS tables.

Summary

In this post I discussed using the read_csv, upload_file and upload_frame methods for loading client-side CSV files into CAS. Depending on the size of your data and your parsing needs, you may consider one method over the other.

Loading data from the client side into memory onto the CAS server will be slower than loading server-side files into memory. Remember, server-side files are data sources that the CAS server has direct access to like a network path or database. Client-side files are available on your Python client.  Client-side data loading is intended for smaller data sets. For more information check out the Client-Side Data Files and Sources section in the SWAT documentation page.

Additional and related resources

Getting started with Python integration to SAS® Viya® - Part 13 - Loading a Client-Side CSV File into CAS was published on SAS Users.

1月 312023
 

SAS has released SAS 9.4 Maintenance 8, a major update to SAS 9.4.

Security is the primary focus of the Maintenance 8 update. This release contains updates for many of the third-party technologies that are used by the platform, including the Java runtime environment (JRE) and many of the third-party JAR files. This release also adds support for major releases of supported operating systems, while limiting support for operating systems that are no longer supported by their respective suppliers. My colleague Margaret Crevar has summarized these changes in this SAS Communities post.

As with all SAS maintenance releases, this release "rolls up" the hotfixes and enhancements delivered since the last major update (SAS 9.4 Maintenance 7). Most SAS platform products and solutions have also been updated to remain compatible with this release and take advantage of enhancements. However, there are some products and solutions that will not be available immediately, or that will not deliver support for SAS 9.4 Maintenance 8.

Because SAS 9.4 Maintenance 8 is a major software release for SAS 9.4, it is covered by the SAS Support policy for the "Standard Support" timeframe according to its general availability date: Jan 31, 2023. (Current policy offers Standard Support for 5 years from the GA date.)

While this maintenance release doesn't contain new features, it does demonstrate the commitment of SAS to support users of the SAS 9.4 platform for many years to come. (See "Your platform, your way" in "Your analytics, your way" from Shadi Shahin.) New data and analytics capabilities are delivered in the SAS Viya platform, which offers monthly cadence releases via its continuous delivery model.

For an overview of all product changes and updates in SAS 9.4 Maintenance 8, see the What's New topic in the SAS documentation.

SAS 9.4 Maintenance 8 is available was published on SAS Users.

1月 172023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS server, working with CAS actions and CASResults objects, and how to summarize columns. Now it's time to focus on how to get the count of unique values in a CAS table column.

Load and prepare data

First, I connected my Python client to the distributed CAS server and named my connection conn. Then I created a function to load and prepare my CAS table. The custom function loads the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory, renames the columns using the column labels and drops unnecessary columns. This simplifies the table for the demonstration.

The Samples caslib should be available in your SAS Viya environment and contains sample tables. For more information on how to rename columns in a CAS table view Part 11 - Rename Columns.

## Packages
import swat
import pandas as pd
 
## Options
pd.set_option('display.max_columns', 50)
 
## Connect to CAS
conn = ## your connection information
 
def prep_data():
    ## Load the data into CAS
    conn.loadTable(path='WARRANTY_CLAIMS_0117.sashdat', caslib='samples',
                   casout={'name':'warranty_claims', 'caslib':'casuser'})
 
    ## Reference the CAS table in an object
    castbl = conn.CASTable('warranty_claims', caslib = 'casuser')
 
    ## Store the column names and labels in a dataframe
    df_col_names = castbl.columnInfo()['ColumnInfo'].loc[:,['Column','Label']]
 
    ## Create a list of dictionaries of how to rename each column using the column labels
    renameColumns = []
    for row in df_col_names.iterrows():
        colName = row[1].values[0]
        labelName = row[1].values[1].replace(' ','_')
        renameColumns.append(dict(name=colName, rename=labelName))
 
    ## List of columns to keep in the CAS table
    keepColumns = {'Campaign_Type', 'Platform','Trim_Level','Make','Model_Year','Engine_Model',
                   'Vehicle_Assembly_Plant','Claim_Repair_Start_Date', 'Claim_Repair_End_Date'}
 
    ## Rename and drop columns to make the table easier to use
    castbl.alterTable(columns = renameColumns, keep = keepColumns)
 
    return castbl

Next, I'll execute the user defined function and store the CAS table object in the variable tbl  and view it's type.

tbl = prep_data()
type(tbl)
 
# and the results
NOTE: Cloud Analytic Services made the file WARRANTY_CLAIMS_0117.sashdat available as table WARRANTY_CLAIMS in caslib CASUSER(Peter).
swat.cas.table.CASTable

The results show that the WARRANTY_CLAIMS_0117.sashdat is available in the CAS server, and tbl is a CASTable object.

Lastly, I'll preview the distributed CAS table using the SWAT package head method.

tbl.head()

The results show a preview of the WARRANTY_CLAIMS CAS table. The table provides data on warranty claims for car repairs. The data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data.

Using the Pandas API in the SWAT package - value_counts method

I'll begin by using the Pandas API in the SWAT package which provides the value_counts method. The value_counts method works like it's Pandas counterpart. For example, I'll obtain the count of unique values in the Engine_Model CAS table column. I'll store the results of in vc, then display the type and value of vc.

vc = (tbl               ## CAS table
      .Engine_Model     ## CAS table column
      .value_counts()   ## SWAT value_counts method
     )
 
## Display the type and value
display(type(vc),vc)

The SWAT value_counts method summarizes the data in the distributed CAS server and returns a Pandas Series to the Python client. Once you have the Pandas Series on the client, you can work with it as you normally would. For example, I'll plot the Series using the Pandas plot method.

vc.plot(kind = 'bar', figsize=(8,6));

In this example, I used the Pandas API in the SWAT package to summarize data on the CAS server's massively parallel processing environment to return smaller, summarized results to the Python client. Once the summarized results are on the client, I'll work with them using other Python packages like Pandas.

Using the freq CAS action

Instead of using the Pandas API in the SWAT package you can achieve similar results using native CAS actions. In SWAT, CAS actions are simply specified as a method. One action that provides the count of unique values is the simple.freq CAS action.

For example, I can find the count of unique values for multiple columns within the freq action. Here, I'll specify the Engine_Model, Model_Year and Campaign_Type columns in the the inputs parameter. Then, I'll call the Frequency key after the action to obtain the SASDataFrame stored in the dictionary returned to the Python client. Remember, CAS actions always return a dictionary, or CASResults object, to the Python client. You must use familiar dictionary manipulation techniques to work with the results of an action. For more information on working with results of CAS actions, check out Part 2 - Working with CAS Actions and CASResults Objects.

## Columns to analyze
colNames = ['Engine_Model', 'Model_Year', 'Campaign_Type']
 
## Execute the freq CAS action and store the SASDataFrame
freq_df = tbl.freq(inputs = colNames)['Frequency']
 
## Display the type and DataFrame
display(type(freq_df), freq_df)

Again, the action processes the data in the distributed CAS server and returns results to the Python client. The results show the freq action counts the unique values of each column and stores the results in a single SASDataFrame. Once you have the SASDataFrame on the client, you can work with it like you would a Pandas DataFrame.

For example, I'll loop over each analysis column, query the SASDataFrame for the specific column name, and then plot the count of unique values of each column using the familiar Pandas package.

for column in colNames:
    (freq_df
     .query('Column == @column')
     .loc[:,['CharVar','Frequency']]
     .set_index('CharVar')
     .sort_values(by = 'Frequency', ascending=False)
     .plot(kind='bar', figsize=(8,6), title = f'The {column} Column')
    )

The loop produces a visualization of the count of unique values for each analysis column. This was all done using familiar Pandas code on the client side. Remember, the distributed CAS server did all of the processing and summarization, then returned smaller summarized results to the Python client.

Using the freqTab CAS action

Lastly, you can use the freqTab.freqTab CAS action to construct frequency and crosstabulation tables. The freqTab action provides a variety of additional features and information. The action is not loaded by default, so I'll begin by loading the action set.

conn.loadActionSet('freqTab')

Then I'll use the freqTab action in the freqTab action set to count the unique values for the Model_Year and Engine_Model columns, and also count the unique values of Engine_Model by Model_Year.

tbl.freqTab(tabulate = [
                'Model_Year',
                'Engine_Model',
                {'vars':['Engine_Model','Model_Year']}
            ]
    )

The results above show the freqTab action returns a dictionary with a variety of information. The first SASDataFrame is level information, the second SASDataFrame shows the number of observations used, and the remaining SASDataFrames show the two one-way frequency tables for Model_Year, and Engine_Model, and the crosstabulation between Engine_Model by Model_Year (also includes the totals).

With the results on the Python client, you can begin accessing and manipulating the SASDataFrames as needed.

Summary

The SWAT package blends the world of Pandas and CAS. You can use many of the familiar Pandas methods within the SWAT package like value_counts, or the flexible, highly optimized CAS actions like simple.freq and freqTab.freqTab to obtain counts of unique values in the massively parallel processing CAS engine. For more examples on the freq or freqTab CAS actions, check out my CAS action four part series (part 1, part 2, part 3 and part 4). The four part series executes CAS actions using the native CAS language. However, with some small changes to the syntax you can execute the same actions using Python.

Additional and related resources

Getting started with Python integration to SAS® Viya® - Part 12 - Count of Unique Values was published on SAS Users.