Peter Styliadis

3月 222023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to count missing values in a CAS table using the Python SWAT package.

Load and prepare data

First, I connect my Python client to the distributed CAS server and named my connection conn. Next, I create a small pandas DataFrame with missing values for the demonstration. Further, I load the DafaFrame to the CAS server using the upload_frame method and and name the CAS table MISSING_DATA and place it in the Casuser caslib. The upload_frame method returns a reference to the CAS table in the variable castbl.

For more information about uploading client-side data to the CAS server, check out my previous post.

conn = ## your CAS connection information
 
## Create a simple dataframe
df = pd.DataFrame([
            [np.nan, 2, 45, 0, 'A'],
            [3, 4, np.nan, 1,'A'],
            [np.nan, np.nan, 50, np.nan,'B'],
            [np.nan, 3, np.nan, 4,],
            [2, 2, np.nan, 0, 'A'],
            [3, 4, np.nan, 1,'A'],
            [np.nan, np.nan, 75, np.nan,'B'],
            [np.nan, 3, 60, 4,]
            ],
            columns=['col1','col2','col3','col4','col5'])
 
## Upload the dataframe to the CAS server as a CAS table
castbl = conn.upload_frame(df,
                           casout = {'name':'missing_data', 
                                     'caslib':'casuser', 
                                     'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table MISSING_DATA in caslib CASUSER(Peter).
NOTE: The table MISSING_DATA has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the MISSING_DATA CAS table has been loaded to the CAS server.

Lastly, I'll preview the distributed CAS table using the SWAT package head method.

castbl.head(10)

Count missing values in a CAS table

There are a variety of ways to count the number missing values in a CAS table. Counting missing values in CAS tables is not exactly the same in the SWAT package as it is in pandas. However, it's just as easy. Let's look at a few different methods.

Using the SWAT package nmiss method

For users experienced with pandas, you traditionally run the isna and sum methods to find the number of missing values in each column of a DataFrame. For CAS tables it's even easier. You can use the SWAT package nmiss method. The nmiss method returns the total number of missing values in each column.

Here I'll specify the CAS table object castbl, then the SWAT nmiss method.

castbl.nmiss()

The distributed CAS server counts the number of missing values in each column and returns a Series to the Python client. Once you have the Series object you can use traditional pandas for additional processing. For example, I'll chain the pandas plot method after the SWAT nmiss method to plot the series. The summarization occurs in the distributed CAS server and the plotting occurs on the client.

(castbl                    ## CAS table
 .nmiss()                  ## SWAT method
 .plot(kind = 'bar'))      ## pandas plot

You can also use the nmiss method to find the number of missing values in specific columns. Simply specify the CASTable object, the columns, then the method.

colNames = ['col1','col5']
 
castbl[colNames].nmiss()

Using the Distinct CAS action

The distinct CAS action is one of my favorite actions. It not only finds the number of missing values, but it gives you the number of distinct values in each column. Now, depending on the size of your data and the number of distinct values, this one can be a bit more resource intensive since finding the number of distinct values in a column can be more time consuming.

To use the distinct CAS action specify the CASTable object, castbl, then the distinct CAS action.

castbl.distinct()

The distinct action returns a CASResults object (dictionary) back to the Python client with the number of distinct and missing values.

In the distinct action you can use the inputs parameter to specify the columns you want to summarize.

castbl.distinct(inputs = colNames)

The results show the number of distinct and missing values in the specified columns. Now with the summarized data form the CAS server in a CASResults object you can use your Python client to continue working with the data. For more information on working with CASResults objects, check out my previous post.

Using the summary CAS action

You can also use the summary CAS action to find the number of missing values for numeric columns. The summary action will also generate a variety of descriptive statistics such as the mean, variance, size, sum of squares and more.

castbl.summary()

The results show the descriptive statistics. The NMiss column contains the number of missing values for the numeric columns.

Within the summary action you can also specify the columns to analyze with the inputs parameter, and the summary statistics to generate with the subSet parameter.

castbl.summary(inputs = ['col1','col2'], 
               subSet = ['min','max','nmiss'])

Summary

The SWAT package blends the world of pandas and CAS to process your distributed data. In this example I focused on using specific SWAT methods and CAS actions to count the number of missing values in CAS table columns.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 15 - Count Missing Values in a CAS Table was published on SAS Users.

2月 142023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to bring a distributed CAS table back to your Python client as a DataFrame.

In this example, I'm using Python on my laptop (Python client) to connect to the CAS server. Bringing a distributed CAS table back to the Python client as a local DataFrame is typically appropriate for smaller data. If the data sets become larger, it is more efficient to process the data in the CAS server, or bring a smaller subset of data back to the client. The big question is, why would I want to bring a CAS table out of the CAS server's massively parallel processing (MPP) environment back to my client as a DataFrame?

 

The CAS server in SAS Viya is setup to connect to a variety of data sources throughout your organization like databases, folder paths, cloud data and more. Typically these connections are setup by your administrator. Having these connections setup makes it easy to access data. In this example, maybe the data you need is available to the CAS server and is small enough where you don't need to use the power of the CAS server's MPP environment. Maybe you simply want to pull a CAS table back to your client as a DataFrame and use a familiar Python package like numpy, pandas, seaborn, matplotlib, or scikit-learn for data exploration, preprocessing, visualization or machine learning.

Use caution when bringing data back to your Python client

With the to_frame SWAT method you can easily transfer a CAS table to your Python client as a DataFrame. When using the to_frame method YOU MUST USE CAUTION. It will attempt to pull all of the data down from the CAS server regardless of size. If the data is large, this can be time consuming or overwhelm your Python client depending on your available memory.

Load the demonstration data into memory

I've created a connection to my CAS server in the variable conn. I'll use my conn connection object and the SWAT read_csv method to load the cars.csv file from the Example Data Sets for the SAS® Viya® Platform website into memory on the CAS server. I'll add some formats and labels to the CAS table. This is a small table for demonstration purposes.

fileurl = 'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/cars.csv'
castbl = conn.upload_file(fileurl, 
                          casout = {'name':'cars_cas_table', 
                                    'caslib':'casuser', 
                                    'replace':True},
                          importoptions = {
                              'fileType':'csv',
                              'guessRows':100,
                              'vars': {
                                  'Invoice':{'format':'dollar16.2','label':'Invoice Price'},
                                  'MSRP':{'format':'dollar16.2', 'label':'Manufacturer Suggested Retail Price'},
                                  'Weight':{'format':'comma16.'},
                                  'EngineSize':{'label':'Engine Size (L)'},
                                  'Length':{'label':'Length (IN)'},
                                  'MPG_City':{'label':'MPG (City)}'},
                                  'MPG_Highway':{'label':'MPG (Highway)'}
                              }
                          })
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table CARS_CAS_TABLE in caslib CASUSER(Peter).
NOTE: The table CARS_CAS_TABLE has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

Next, I'll execute the tableInfo action to confirm the table was loaded into memory.

conn.tableInfo(caslib = 'casuser')


Lastly I'll execute the columnInfo action to view the column information of the CAS table.

castbl.columnInfo()

 

The results above show the CARS_CAS_TABLE has 15 columns. Some of the columns contain SAS labels and SAS formats.

Pull the entire CAS table to the client as a DataFrame

We saw earlier that the CARS_CAS_TABLE is a small in-memory table on the CAS server. Let's pretend the data was connected to some cloud data storage and was loaded into memory. I noticed that the data is small, and I want to use familiar Python packages to process it.

The first thing I can do is use the to_frame SWAT method on my castbl object to convert the CAS table to a DataFrame. I'll store the result in df and then view it's type.

df = castbl.to_frame()
display(type(df))
 
# and the results
swat.dataframe.SASDataFrame

 

The results show I now have a SASDataFrame. A SASDataFrame lives on the Python client and is a subclass of pandas.DataFrame. Therefore, anything you can do with a pandas.DataFrame will also work with a SASDataFrame. The only difference is that a SASDataFrame object contain extra metadata from the CAS table. For more information on CASTable vs DataFrame vs SASDataFrame check out the SWAT documentation.

You can view the extra metadata of a SASDataFrame with the colinfo attribute.

df.colinfo

The results show the SASDataFrame contains a variety of information about the columns like the data type, name and width. If the column contains a SAS label or format, that is also stored.

Once you have the SASDataFrame on your Python client, you can use pandas. For example, here I'll execute the pandas plot method to create a scatter plot of EngineSize by MPG_City.

df.plot.scatter(x = 'EngineSize', y = 'MPG_City', title = "Engine Size by MPG City");y');

The results show a simple scatter plot using pandas.

Apply column labels to a SASDataFrame

Sometimes you might want to use column labels, as they are more descriptive. Traditional pandas DataFrames don't have the concept of labels, but CAS tables do.  CAS enables you to add more descriptive labels to columns. Since a SASDataFrame stores information from the CAS table, you can use the SWAT apply_labels method to apply labels to the SASDataFrame. For more information on adding labels to a CAS table, check on my previous post Renaming columns.

df.apply_labels()

Notice the apply_labels method applies the SAS column labels (Manufacturer Suggested Retail Price, Invoice Price, Engine Size (L), Length (IN), MPG (City) and MPG (Highway)) to the SASDataFrame.

Pull a sample of the CAS table to the client as a DataFrame

What if the CAS table is big, but you want to work with a sample of the data using a specific Python package? Well, you can use the sample_pct parameter in the to_frame SWAT method to pull a smaller subset of data to your Python client.

Here I'll specify I want ten percent of the data from the CAS table. I'll also add the sample_seed parameter to pull the same random subset of data. Then I'll execute the shape method to count the number of rows and columns in the new SASDataFrame.

df = castbl.to_frame(sample_pct = .1, sample_seed = 99)
df.shape
 
# and the results
(43, 15)

The results show that only 43 rows of data were returned from the CAS table as a SASDataFrame on the Python client.

Apply SAS formats to the SASDataFrame

Lastly, what if you want to apply the SAS formats of a CAS table to your SASDataFrame? You can do that using the format parameter in the SWAT to_frame method, as seen in the code below. I'll also display five rows of the SASDataFrame and the data types of the columns.

df_formats = castbl.to_frame(format=True)
 
display(df_formats.head(), df_formats.dtypes)

The results show the DOLLAR format was applied to the MSRP and Invoice columns, and the COMMA format was applied to the Weight column.

The dtype method shows that when using the format parameter in the to_frame method all columns are returned to your Python client as objects, even if they do not contain a format.

Summary

The SWAT package blends the world of Pandas and CAS. It enables you to use the CAS server's massively parallel processing engine for data exploration, preprocessing and analytics. It also enables you to easily transfer your distributed CAS tables back to your Python client as SASDataFrames for additional processing using other Python packages.

The one thing you must remember when using the to_frame method is that transferring large data can be time consuming or take up all of your Python client's resources. USE IT WITH CAUTION.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 14 - CAS Table to DataFrame was published on SAS Users.

2月 072023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how your data is organized on the CAS server. In this post I'll discuss loading client-side CSV files into CAS.

Loading data from a client-side file is appropriate for smaller data. If the data sets become larger, it is more efficient to use a server-side load with the loadTable action to access and load the data. For more information check out my previous post loading server-side files into memory,

In this example, I'm using Python on my laptop to connect to the CAS server. Client-side files consist of files on my laptop (client). In these examples I'll load the heart.csv file from the SAS® Viya® Example Data Sets webpage to the CAS server. There are multiple ways to load client-side CSV files into CAS. I'll show you how to do it using the familiar Pandas API in the SWAT package with the read_csv method. I'll also show you using the upload_file and upload_frame SWAT methods.

Using the Pandas API in the SWAT package - read_csv

First, I'll begin with the familiar read_csv method from the SWAT package and store the results in castbl. The SWAT read_csv method calls the Pandas read_csv method, creates a client-side DataFrame, then uploads the DataFrame to the CAS server as an in-memory table. The main difference is the casout parameter. The casout parameter is specific to the SWAT package. It enables you to specify the output CAS table information like the name of the new distributed table and the in-memory location. Here, I'll name the CAS table heart_read_csv and place it in the Casuser caslib. I'll also specify the replace parameter to replace the CAS table if it already exists.

castbl = conn.read_csv(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv',
                      casout = {'name':'heart_read_csv', 
                                'caslib':'casuser', 
                                'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_READ_CSV in caslib CASUSER(Peter).
NOTE: The table HEART_READ_CSV has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

Next, I'll execute the tableInfo CAS action to view available CAS tables in the Casuser caslib.

conn.tableInfo(caslib = 'casuser')

The results above show that the table was loaded to the CAS server.

Let's view the type and value of the castbl object.

display(type(castbl), castbl)
 
# and the results
swat.cas.table.CASTable
CASTable('HEART_READ_CSV', caslib='CASUSER(Peter)')

The results show that castbl is a CASTable object and simply references the CAS table on the CAS server.

Lastly, I'll preview  the CAS table with the SWAT head method.

castbl.head()

The SWAT head method returns 5 rows from the CAS table to the client.

The read_csv method enables you to use all of the available Pandas parsers through the SWAT package to upload data to the distributed CAS server. The read_csv method parses the data on the client into a DataFrame, and then uploads the DataFrame to the CAS server.

Using the upload_file method in the SWAT package

Instead of using the Pandas API through the SWAT package you can also use the upload_file SWAT method. The upload_file method transfers the file to CAS server and then all parsing is done on the server. The upload_file method is not as robust as the read_csv method, but can be a bit faster than client-side parsing. Upload_file can be used to upload other file types to the CAS server, not just CSV files.

Here, I'll load the same CSV file to the CAS server as in the previous example. This time I'll use upload_file method and name the CAS table heart_upload_file. I'll store the results in castbl2.

castbl2 = conn.upload_file(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv', 
                           casout = {'name':'heart_upload_file', 
                                     'caslib':'casuser',
                                     'replace':True})     
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_UPLOAD_FILE in caslib CASUSER(Peter).
NOTE: The table HEART_UPLOAD_FILE has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the CSV file was uploaded to CAS successfully.

I'll view the available CAS table using the tableInfo action.

conn.tableInfo(caslib = 'casuser')


The results show that now I have two CAS tables in memory.

Using the SWAT upload_frame method

Another useful SWAT method is the upload_frame method. I like using this method if I'm preparing a DataFrame on my client using Pandas, then need to transfer the Pandas DataFrame to the CAS server for additional processing or to use another SAS Viya application like SAS Visual Analytics.

For example, here is some traditional Pandas code to read the same CSV file as before and then prepare it using traditional Pandas. The code is renaming columns, creating calculated columns and dropping a column.

conn.tableInfo(caslib = 'casuser')
## Read the data into a DataFrame
df_raw = pd.read_csv(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv')
 
## Prepare the DataFrame
df = (df_raw
      .rename(columns = lambda colName: colName.upper())
      .assign(
         STATUS = lambda _df: _df.STATUS.str.upper(),
         DEATHCAUSE = lambda _df: _df.DEATHCAUSE.fillna('Still Alive').str.lower()
        )
     .drop('AGEATSTART', axis=1)
)
df.head()

Now that I have the final DataFrame, I can simply upload it to the CAS server using the upload_frame method. Again, the casout parameter specifies output CAS table information.

castbl3 = conn.upload_frame(df, casout = {'name':'heart_upload_frame', 
                                          'caslib':'casuser', 
                                          'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_UPLOAD_FRAME in caslib CASUSER(Peter).
NOTE: The table HEART_UPLOAD_FRAME has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the Pandas DataFrame was successfully uploaded to the CAS server.

Lastly I'll view available CAS tables.

Summary

In this post I discussed using the read_csv, upload_file and upload_frame methods for loading client-side CSV files into CAS. Depending on the size of your data and your parsing needs, you may consider one method over the other.

Loading data from the client side into memory onto the CAS server will be slower than loading server-side files into memory. Remember, server-side files are data sources that the CAS server has direct access to like a network path or database. Client-side files are available on your Python client.  Client-side data loading is intended for smaller data sets. For more information check out the Client-Side Data Files and Sources section in the SWAT documentation page.

Additional and related resources

Getting started with Python integration to SAS® Viya® - Part 13 - Loading a Client-Side CSV File into CAS was published on SAS Users.

1月 172023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS server, working with CAS actions and CASResults objects, and how to summarize columns. Now it's time to focus on how to get the count of unique values in a CAS table column.

Load and prepare data

First, I connected my Python client to the distributed CAS server and named my connection conn. Then I created a function to load and prepare my CAS table. The custom function loads the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory, renames the columns using the column labels and drops unnecessary columns. This simplifies the table for the demonstration.

The Samples caslib should be available in your SAS Viya environment and contains sample tables. For more information on how to rename columns in a CAS table view Part 11 - Rename Columns.

## Packages
import swat
import pandas as pd
 
## Options
pd.set_option('display.max_columns', 50)
 
## Connect to CAS
conn = ## your connection information
 
def prep_data():
    ## Load the data into CAS
    conn.loadTable(path='WARRANTY_CLAIMS_0117.sashdat', caslib='samples',
                   casout={'name':'warranty_claims', 'caslib':'casuser'})
 
    ## Reference the CAS table in an object
    castbl = conn.CASTable('warranty_claims', caslib = 'casuser')
 
    ## Store the column names and labels in a dataframe
    df_col_names = castbl.columnInfo()['ColumnInfo'].loc[:,['Column','Label']]
 
    ## Create a list of dictionaries of how to rename each column using the column labels
    renameColumns = []
    for row in df_col_names.iterrows():
        colName = row[1].values[0]
        labelName = row[1].values[1].replace(' ','_')
        renameColumns.append(dict(name=colName, rename=labelName))
 
    ## List of columns to keep in the CAS table
    keepColumns = {'Campaign_Type', 'Platform','Trim_Level','Make','Model_Year','Engine_Model',
                   'Vehicle_Assembly_Plant','Claim_Repair_Start_Date', 'Claim_Repair_End_Date'}
 
    ## Rename and drop columns to make the table easier to use
    castbl.alterTable(columns = renameColumns, keep = keepColumns)
 
    return castbl

Next, I'll execute the user defined function and store the CAS table object in the variable tbl  and view it's type.

tbl = prep_data()
type(tbl)
 
# and the results
NOTE: Cloud Analytic Services made the file WARRANTY_CLAIMS_0117.sashdat available as table WARRANTY_CLAIMS in caslib CASUSER(Peter).
swat.cas.table.CASTable

The results show that the WARRANTY_CLAIMS_0117.sashdat is available in the CAS server, and tbl is a CASTable object.

Lastly, I'll preview the distributed CAS table using the SWAT package head method.

tbl.head()

The results show a preview of the WARRANTY_CLAIMS CAS table. The table provides data on warranty claims for car repairs. The data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data.

Using the Pandas API in the SWAT package - value_counts method

I'll begin by using the Pandas API in the SWAT package which provides the value_counts method. The value_counts method works like it's Pandas counterpart. For example, I'll obtain the count of unique values in the Engine_Model CAS table column. I'll store the results of in vc, then display the type and value of vc.

vc = (tbl               ## CAS table
      .Engine_Model     ## CAS table column
      .value_counts()   ## SWAT value_counts method
     )
 
## Display the type and value
display(type(vc),vc)

The SWAT value_counts method summarizes the data in the distributed CAS server and returns a Pandas Series to the Python client. Once you have the Pandas Series on the client, you can work with it as you normally would. For example, I'll plot the Series using the Pandas plot method.

vc.plot(kind = 'bar', figsize=(8,6));

In this example, I used the Pandas API in the SWAT package to summarize data on the CAS server's massively parallel processing environment to return smaller, summarized results to the Python client. Once the summarized results are on the client, I'll work with them using other Python packages like Pandas.

Using the freq CAS action

Instead of using the Pandas API in the SWAT package you can achieve similar results using native CAS actions. In SWAT, CAS actions are simply specified as a method. One action that provides the count of unique values is the simple.freq CAS action.

For example, I can find the count of unique values for multiple columns within the freq action. Here, I'll specify the Engine_Model, Model_Year and Campaign_Type columns in the the inputs parameter. Then, I'll call the Frequency key after the action to obtain the SASDataFrame stored in the dictionary returned to the Python client. Remember, CAS actions always return a dictionary, or CASResults object, to the Python client. You must use familiar dictionary manipulation techniques to work with the results of an action. For more information on working with results of CAS actions, check out Part 2 - Working with CAS Actions and CASResults Objects.

## Columns to analyze
colNames = ['Engine_Model', 'Model_Year', 'Campaign_Type']
 
## Execute the freq CAS action and store the SASDataFrame
freq_df = tbl.freq(inputs = colNames)['Frequency']
 
## Display the type and DataFrame
display(type(freq_df), freq_df)

Again, the action processes the data in the distributed CAS server and returns results to the Python client. The results show the freq action counts the unique values of each column and stores the results in a single SASDataFrame. Once you have the SASDataFrame on the client, you can work with it like you would a Pandas DataFrame.

For example, I'll loop over each analysis column, query the SASDataFrame for the specific column name, and then plot the count of unique values of each column using the familiar Pandas package.

for column in colNames:
    (freq_df
     .query('Column == @column')
     .loc[:,['CharVar','Frequency']]
     .set_index('CharVar')
     .sort_values(by = 'Frequency', ascending=False)
     .plot(kind='bar', figsize=(8,6), title = f'The {column} Column')
    )

The loop produces a visualization of the count of unique values for each analysis column. This was all done using familiar Pandas code on the client side. Remember, the distributed CAS server did all of the processing and summarization, then returned smaller summarized results to the Python client.

Using the freqTab CAS action

Lastly, you can use the freqTab.freqTab CAS action to construct frequency and crosstabulation tables. The freqTab action provides a variety of additional features and information. The action is not loaded by default, so I'll begin by loading the action set.

conn.loadActionSet('freqTab')

Then I'll use the freqTab action in the freqTab action set to count the unique values for the Model_Year and Engine_Model columns, and also count the unique values of Engine_Model by Model_Year.

tbl.freqTab(tabulate = [
                'Model_Year',
                'Engine_Model',
                {'vars':['Engine_Model','Model_Year']}
            ]
    )

The results above show the freqTab action returns a dictionary with a variety of information. The first SASDataFrame is level information, the second SASDataFrame shows the number of observations used, and the remaining SASDataFrames show the two one-way frequency tables for Model_Year, and Engine_Model, and the crosstabulation between Engine_Model by Model_Year (also includes the totals).

With the results on the Python client, you can begin accessing and manipulating the SASDataFrames as needed.

Summary

The SWAT package blends the world of Pandas and CAS. You can use many of the familiar Pandas methods within the SWAT package like value_counts, or the flexible, highly optimized CAS actions like simple.freq and freqTab.freqTab to obtain counts of unique values in the massively parallel processing CAS engine. For more examples on the freq or freqTab CAS actions, check out my CAS action four part series (part 1, part 2, part 3 and part 4). The four part series executes CAS actions using the native CAS language. However, with some small changes to the syntax you can execute the same actions using Python.

Additional and related resources

Getting started with Python integration to SAS® Viya® - Part 12 - Count of Unique Values was published on SAS Users.

12月 222022
 

The addition of the PYTHON procedure and Python editor in SAS Viya enables users to execute Python code in SAS Studio. This new capability in SAS Viya adds another tool to SAS's existing collection. With this addition I thought, how can I utilize this new found power?

In this example, I'll keep it simple. I want to create a Microsoft Excel report using a combination of SAS, Python and SQL. I'll use data that's stored in a SAS library; however, the library could be using data stored anywhere, like a path, database or in the cloud. I'll write a program that executes the following:

All code used in this post is located on GitHub, here.

Set folder path and file name

To begin, I'll create a macro variable to specify the output folder path and Microsoft Excel workbook name.

%let path=/*Enter your output folder path*/;
%let xlFileName = myExcelReport.xlsx;

Prepare data

Further, I'll prepare the data using the SAS DATA step. I'll use the available sashelp.cars table, create a new column named MPG_Avg, and drop unnecessary columns. Instead of using the DATA step you can use Python or SQL to prepare the data. Whatever tool works best for you.

data work.cars;
    set sashelp.cars;
    MPG_Avg=mean(MPG_City, MPG_Highway);
    drop Wheelbase Weight Length;
run;

Create the Microsoft Excel workbook

After the data is ready, I'll use the ODS EXCEL statement to create the Excel spreadsheet. The following ODS options are used:

  • FILE - specifies the file path and name.
  • STYLE - modifies the appearance of the SAS output
  • EMBEDDED_TITLES - specifies titles should appear in the worksheet
  • SHEET_INTERVAL - enables manual control when to create a new worksheet
ods excel file="&path./&xlFileName" 
		  style=ExcelMidnight   
		  options(embedded_titles="on");

Worksheet 1

Print the data using SAS

With the ODS EXCEL destination open I'll name the first worksheet Data, and manually specify when a new sheet is created. Next, I'll use the PRINT procedure to print the detailed data to Excel. The PRINT procedure will print the entire SAS data set with the associated formats and styles to Excel.

* Sheet 1 - Print the data using SAS *;
ods excel options(sheet_name='Data' sheet_interval='none');
title height=16pt color=white "Detailed Car Data";
proc print data=work.cars noobs;
run;

Worksheet 2

Create violin plots using Python

Next, I want to create violin plots on a new worksheet named Origin_MPG. Now, these can be created in SAS, but I personally found the matplotlib package in Python a bit easier to use . With the PYTHON procedure, I can include the Python code within the SAS program (or you can reference a .py file) to create the visualization. Then I'll use the SAS.pyplot method to save and render the visualization. Since the pyplot callback renders the image in the results tab, it exports the image to the Excel workbook by default.

First I'll use ODS EXCEL to create the new worksheet and the TITLE statement to add a title to the Excel worksheet.

ods excel options(sheet_name='Origin_MPG' sheet_interval='now');
title justify=left height=16pt color=white "Analyzing MPG by Each Car Origin";

Then I'll execute the PYTHON procedure to execute my Python code.

* Create violin plots using Python *;
proc python;
submit;
 
##
## Import packages and options
##
 
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
outpath = SAS.symget('path')
 
##
## Data prep for the visualization
##
 
## Load the SAS table as a DataFrame
df = (SAS
      .sd2df('work.cars')                 ## SAS callback method to load the SAS data set as a DataFrame
      .loc[:,['Origin','MPG_Avg']]        ## Keep the necessary columns
)
 
 
## Create a series of MPG_Avg for each distinct origin for the violin plots
listOfUniqueOrigins = df.Origin.unique().tolist()
 
mpg_by_origin = {}
for origin in listOfUniqueOrigins:
    mpg_by_origin[origin] = df.query(f'Origin == @origin ').MPG_Avg
 
 
##
## Create the violin plots
##
 
## Violin plot
fig, ax = plt.subplots(figsize = (8,6))
ax.violinplot(mpg_by_origin.values(), showmedians=True)
 
## Plot appearance
ax.set_title('Miles per Gallon (MPG) by Origin')
rename_x_axis = {'position': [1,2,3], 'labels':listOfUniqueOrigins}
ax.set_xticks(rename_x_axis['position'])
ax.set_xticklabels(rename_x_axis['labels']);
 
## Save and render the image file
SAS.pyplot(plt, filename='violinPlot',filepath=outpath)
 
endsubmit;
quit;
title;

SQL Aggregation

SQL is an extremely common and useful language for data analysts and scientists. I find using SQL for aggregation easy, so I will create a simple aggregation and add it below the visualization on the same worksheet in the the Excel report.

* SQL Aggregation *;
title justify=left "Average MPG by Car Makes";
proc sql;
select Origin, round(mean(MPG_Avg)) as AverageMPG
	from work.cars
	group by Origin
	order by AverageMPG desc;
quit;
title;

Add text

At the end of the same worksheet I'll add some simple text using the ODSTEXT procedure to give some information about the data.

proc odstext;
   heading 'NOTES';
   p 'Using the SASHELP.CARS data. The following car Origins were analyzed:';
   list ;
      item 'Asia';
      item 'Europe';
      item 'USA';
   end;    
   p 'Created by Peter S';
quit;

Close the Excel workbook

Lastly, I'll close the ODS EXCEL destination since I am done writing out to Excel.

ods excel close;

Results

That's it! Now I'll execute the entire program and view the Excel workbook.

Summary

With the capabilities of SAS and the new ability to execute Python code in SAS Studio, teams have a variety of tools in SAS Viya for their analytic needs.

Additional resources

PYTHON Procedure documentation
SAS opens its code editor interface to Python users
Using PROC PYTHON to augment your SAS programs
ODS Excel Statement

Creating a Microsoft Excel report using SAS, Python and SQL! was published on SAS Users.

12月 202022
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. The previous posts show how to use the simple.freq CAS action to generate, save and group simple frequency tables. In this post I will show you how to use the freqTab.freqTab CAS action to generate more advanced one-way frequency and crosstabulation tables.

In this example, I will use the CAS language (CASL) to execute the freqTab CAS action. Instead of using CASL, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language. Refer to the documentation for syntax in other languages.

Load the demonstration data into memory

I'll start by executing the loadTable action to load the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory. By default the Samples caslib should be available in your SAS Viya environment. I'll load the table to the Casuser caslib and then I'll clean up the CAS table by renaming and dropping columns to make the table easier to use. For more information how to rename columns check out my previous post. Lastly I'll execute the fetch action to preview 5 rows.

proc cas;
   * Specify the input/output CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Load the CAS table into memory *;
    table.loadtable / 
        path = "WARRANTY_CLAIMS_0117.sashdat", caslib = "samples",
        casOut = casTbl + {replace=TRUE};
 
* Rename columns with the labels. Spaces replaced with underscores *;
 
   *Store the results of the columnInfo action in a dictionary *;
   table.columnInfo result=cr / table = casTbl;
 
   * Loop over the columnInfo result table and create a list of dictionaries *;
   listElementCounter = 0;
   do columnMetadata over cr.ColumnInfo;
	listElementCounter = listElementCounter + 1;
	convertColLabel = tranwrd(columnMetadata['Label'],' ','_');
	renameColumns[listElementCounter] = {name = columnMetadata['Column'], rename = convertColLabel, label=""};
   end;
 
   * Rename columns *;
   keepColumns = {'Campaign_Type', 'Platform','Trim_Level','Make','Model_Year','Engine_Model',
                  'Vehicle_Assembly_Plant','Claim_Repair_Start_Date', 'Claim_Repair_End_Date'};
   table.alterTable / 
	name = casTbl['Name'], caslib = casTbl['caslib'], 
	columns=renameColumns,
	keep = keepColumns;
 
   * Preview CAS table *;
   table.fetch / table = casTbl, to = 5;
quit;

The results above show a preview of the warranty_claims CAS table.

One-way frequency tables

To create more advanced one-way frequency tables, use the freqTab.freqTab CAS action. In the freqTab action, use the table parameter to specify the CAS table and the tabulate parameter to specify the column, or columns, to analyze. The tabulate parameter is extremely flexible and provides a variety of ways to analyze your data. In this example, I'll specify the warranty_claims CAS table and the Campaign_Type and Make columns as a list in the tabulate parameter.

proc cas;
   * CAS table reference *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * One-way frequency tables *;
   freqTab.freqTab / 
       table = casTbl,
       tabulate = {'Campaign_Type','Make'};
quit;

Results

The results above show the freqTab action generates a variety additional information compared to the freq action. Information including the number of observations used, variable level and timing. The freqTab action frequency tables contain the expected total frequency of each value; moreover, it also includes the total percentage, cumulative frequency and cumulative percent.

Two-way crosstabulation tables

Instead of producing one-way frequency tables, you can also create two-way crosstabulation tables. One way is to continue to add elements in the list in the tabulate parameter. Here a one-way frequency table will be created for Campaign_Type and Make as seen before. Then, I'll add a dictionary in the list. The vars key specifies the Make column, and the cross key specifies the Campaign_Type and Model_Year columns. The columns in the cross key are paired with those specified in vars. This example will produce a two-way crosstabulation between Make by Campaign_Type and Make by Model_Year.

proc cas;
   * CAS table reference *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * One-way frequency and two-way crosstabulation tables *;
   freqTab.freqTab / 
      table = casTbl,
      tabulate = {
 		'Campaign_Type',
		'Make',
              	{vars = 'Make', cross = {'Campaign_Type', 'Model_Year'}}
      };
quit;

Partial results

The results above show the freqTab action returns the one-way frequency tables for Campaign_Type and Make as shown earlier and displays the two-way crosstabulation between Make by Campaign_Type and Make by Model_Year.

While this is great, what if I want to avoid the total row and display the Make for each row in the two-way crosstabulation as a single table?

Two-way crosstabulation as a single table

To present the results as a single table and remove the total row, add the tabDisplay parameter with the value list. In the code, I'll remove the one-way frequencies and add the tabDisplay parameter.

proc cas;
   * CAS table reference *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Two-way crosstabulation as a single table *;
   freqTab.freqTab / 
       table = casTbl,
       tabulate = {
              	{vars = 'Make', cross = {'Campaign_Type', 'Model_Year'}}
       }, 
       tabDisplay='list';
quit;

Partial results

The results show each two-way crosstabulation as a single table and the total row is removed.

Next, what if you want to create a three-way crosstabulation table? Well this is a bit tricky.

Three-way crosstabulation

To produce a three-way crosstabulation, specify the columns as a list within the vars parameter, with the tabDisplay parameter equal to list. If you do not specify the tabDisplay parameter, the freqTab action will return three two-way crosstabulation tables. Each table will be created for each distinct group of the first column specified in the list. This example will create a crosstabulation of Model_Year by Campaign_Type by Make.

proc cas;
   * CAS table reference *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Three-way crosstabulation table *;
   freqTab.freqTab / 
       table = casTbl,
       tabulate = {
		{vars={'Model_Year','Campaign_Type','Make'}}
       }, 
       tabDisplay='list';
quit;

Partial results

Notice in the results above a three-way crosstabulation table was created. The order of the columns is created by the order of the columns specified in the list.

Summary

The freqTab.freqTab CAS action provides a variety of ways to create frequency and crosstabulation tables in the distributed CAS server. There are a variety of parameters you can add to modify it to meet your objectives. We've only scratched the surface!

Additional resources

freqTab CAS action
SAS® Cloud Analytic Services: CASL Programmer’s Guide 
CAS Action! - a series on fundamentals
Getting Started with Python Integration to SAS® Viya® - Index
SAS® Cloud Analytic Services: Fundamentals

CAS-Action! Advanced Frequency Tables - Part 4 was published on SAS Users.

12月 162022
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. In my previous part 1 and part 2 posts I reviewed how to use the simple.freq CAS action to generate frequency distributions for one or more columns and how to save the results. In this post I will show you how to group the results of the freq action.

In this example, I will use the CAS language (CASL) to execute the freq CAS action. Be aware, instead of using CASL, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language. Refer to the documentation for syntax in other languages.

Load the demonstration data into memory

I'll start by executing the loadTable action to load the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory. By default the Samples caslib should be available in your SAS Viya environment. I'll load the table to the Casuser caslib and then I'll clean up the CAS table by renaming and dropping columns to make the table easier to use. For more information how to rename columns check out my previous post. Lastly I'll execute the fetch action to preview 5 rows.

proc cas;
   * Specify the input/output CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Load the CAS table into memory *;
    table.loadtable / 
        path = "WARRANTY_CLAIMS_0117.sashdat", caslib = "samples",
        casOut = casTbl + {replace=TRUE};
 
* Rename columns with the labels. Spaces replaced with underscores *;
 
   *Store the results of the columnInfo action in a dictionary *;
   table.columnInfo result=cr / table = casTbl;
 
   * Loop over the columnInfo result table and create a list of dictionaries *;
   listElementCounter = 0;
   do columnMetadata over cr.ColumnInfo;
	listElementCounter = listElementCounter + 1;
	convertColLabel = tranwrd(columnMetadata['Label'],' ','_');
	renameColumns[listElementCounter] = {name = columnMetadata['Column'], rename = convertColLabel, label=""};
   end;
 
   * Rename columns *;
   keepColumns = {'Campaign_Type', 'Platform','Trim_Level','Make','Model_Year','Engine_Model',
                  'Vehicle_Assembly_Plant','Claim_Repair_Start_Date', 'Claim_Repair_End_Date'};
   table.alterTable / 
	name = casTbl['Name'], caslib = casTbl['caslib'], 
	columns=renameColumns,
	keep = keepColumns;
 
   * Preview CAS table *;
   table.fetch / table = casTbl, to = 5;
quit;

The results above show a preview of the warranty_claims CAS table.

Add a grouping column

What if you want a frequency distribution of each Model_Year by Make? You can easily do that with the freq action. The key is adding the groupBy sub parameter when referencing the CAS table. Then use the grouped CAS table in the freq action and specify the column to analyze. In this example, the CAS table is grouped by Model_Year and the freq action specifies the Make column.

proc cas;
   * Reference the CAS table and group by Model_Year *;
   casTbl = {name = "WARRANTY_CLAIMS", 
             caslib = "casuser",
	     groupby = "Model_Year"};
 
   * Model_Year by Make frequency *;
   simple.freq / table = casTbl, input = 'Make';
quit;

 

Partial results

The above results show that the freq action returns a separate table for each distinct Model_Year. While this is great information, what if I want a single table with the Model_Year by Make?

Saving the results as a CAS table

One option is saving the results as a CAS table. This works similarly to my previous post CAS-Action! Saving Frequency Tables - Part 2. Simply add the casOut parameter to the freq action. I'll add a label to the new CAS table to give it a description and then preview the new CAS table with the fetch action.

proc cas;
   * Reference the CAS table and group by Model_Year *;
   casTbl = {name = "WARRANTY_CLAIMS", 
             caslib = "casuser",
	     groupby = "Model_Year"};
 
   * Specify the output CAS table information *;
   outputTbl = {name = "yearByMake", caslib = "casuser"};
 
   * Get a frequency of Model_Year by Make and create a CAS table *;
   simple.freq / 
	table = casTbl, 
	input = 'Make',
	casOut = outputTbl || {label = "Year by Make frequency table"};
 
   * Preview the CAS table *;
   table.fetch / table = outputTbl;
quit;

The results above show the freq action with the casOut parameter returns information about the newly created CAS table, and the fetch action returns a preview of the new CAS table. Notice the analysis is grouped by Model_Year and is consolidated into a single table.

Saving the results as a SAS data set

Instead of saving the results back to the CAS server, you can save them as a SAS data set. I did an example in my previous post CAS-Action! Saving Frequency Tables - Part 2. However, when you save the results of an action summarized by groups, you need to combine each individual group into a single result table. To do that you need to use the COMBINE_TABLES function on the dictionary returned from the CAS server. The COMBINE_TABLES function will combine each individual table in a dictionary and return a single result table. Then you can save the new table by using the SAVERESULT statement. Lastly, I'll view the SAS data set using the PRINT procedure.

proc cas;
   * Reference the CAS table and group by Model_Year *;
   casTbl = {name = "WARRANTY_CLAIMS", 
             caslib = "casuser",
	     groupby = "Model_Year"};
 
   * Specify the output CAS table information *;
   outputTbl = {name = "yearByMake", caslib = "casuser"};
 
   * Get a frequency of Model_Year by Make and store the results in a dictionary *;
   simple.freq result=freq_cr / 
	table = casTbl, 
	input = 'Make';
 
   * Combine all the tables in the dictionary and create a result table *;
   freqTbl = combine_tables(freq_cr);
 
   * Save the result table as a SAS data set *;
   saveresult freqTbl dataout=work.yearByMake;
quit;
 
* Preview the SAS data set *;
proc print data=work.yearByMake;
run;

The results above show the new SAS data set. Once you save the results of the CAS server as a SAS data set, you can use familiar SAS knowledge to continue processing the data on the compute server.

Plot the results of the freq action

Now, let's explore and visualize this data. You can use the SGPLOT procedure to visualize the summarized results from the CAS server that you saved as a SAS data set.

title height=14pt justify=left color=charcoal "Total Number of Warranty Claims by Model Year and Car Make";
title2 "";
proc sgplot data=work.yearByMake
			noborder;
	vline Model_Year / 
			group = CharVar 
			Response=Frequency
			markers;
	format Frequency comma16.;
	keylegend / position=topleft title='Car Makes';
	label Frequency='Warranty Claims';
	xaxis display=(nolabel);
run;

The visualization above shows the Zeus car Make is the primary cause of warranty claims in all years, and in the years 2016, 2017 and 2018 had a huge increase in warranty claims.

Summary

Using the groupBy sub parameter when referencing a CAS table enables you to easily group the results of a CAS action. When using the groupBy parameter:

  • the action will return separate result tables for each distinct grouped value
  • the casOut parameter in the action to creates a single CAS table with all the groups
  • the COMBINE_TABLES function combines each distinct group by result table in the dictionary to create a single result table, and then saves that as a SAS data set

Additional resources

freq action
COMBINE_TABLES function
SAVERESULT statement
Plotting a Cloud Analytic Services (CAS) In-Memory Table
SAS® Cloud Analytic Services: CASL Programmer’s Guide 
SAS® Cloud Analytic Services: Fundamentals
CAS Action! - a series on fundamentals
Getting Started with Python Integration to SAS® Viya® - Index

CAS-Action! Grouping Frequency Tables - Part 3 was published on SAS Users.

12月 122022
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. In my previous post CAS-Action! Simple Frequency Tables - Part 1, I reviewed how to use the simple.freq CAS action to generate frequency distributions for one or more columns using the distributed CAS server. In this post I will show you how to save the results of the freq action as a SAS data set or a distributed CAS table.

In this example, I will use the CAS language (CASL) to execute the freq CAS action. Be aware, instead of using CASL, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language. Refer to the documentation for syntax in other languages.

Load the demonstration data into memory

I'll start by executing the loadTable action to load the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory. By default the Samples caslib should be available in your SAS Viya environment. I'll load the table to the Casuser caslib and then I'll clean up the CAS table by renaming and dropping columns to make the table easier to use. For more information how to rename columns check out my previous post. Lastly I'll execute the fetch action to preview 5 rows.

proc cas;
   * Specify the input/output CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Load the CAS table into memory *;
    table.loadtable / 
        path = "WARRANTY_CLAIMS_0117.sashdat", caslib = "samples",
        casOut = casTbl + {replace=TRUE};
 
* Rename columns with the labels. Spaces replaced with underscores *;
 
   *Store the results of the columnInfo action in a dictionary *;
   table.columnInfo result=cr / table = casTbl;
 
   * Loop over the columnInfo result table and create a list of dictionaries *;
   listElementCounter = 0;
   do columnMetadata over cr.ColumnInfo;
	listElementCounter = listElementCounter + 1;
	convertColLabel = tranwrd(columnMetadata['Label'],' ','_');
	renameColumns[listElementCounter] = {name = columnMetadata['Column'], rename = convertColLabel, label=""};
   end;
 
   * Rename columns *;
   keepColumns = {'Campaign_Type', 'Platform','Trim_Level','Make','Model_Year','Engine_Model',
                  'Vehicle_Assembly_Plant','Claim_Repair_Start_Date', 'Claim_Repair_End_Date'};
   table.alterTable / 
	name = casTbl['Name'], caslib = casTbl['caslib'], 
	columns=renameColumns,
	keep = keepColumns;
 
   * Preview CAS table *;
   table.fetch / table = casTbl, to = 5;
quit;

The results above show a preview of the warranty_claims CAS table.

One Way Frequency for Multiple Columns

Next, I'll execute the freq action to generate a frequency distribution for multiple columns.

proc cas;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
   colNames = {'Model_Year', 
               'Vehicle_Assembly_Plant', 
	       {name = 'Claim_Repair_Start_Date', format = 'yyq.'}
   };
   simple.freq / table= casTbl, inputs = colNames;
quit;

The freq CAS action returns the frequency distribution of each column in a single result. While this is great,  what if you want to create a visualization with the data? Or continue processing the summarized data? How do you save this as a table? Well, you have a few options.

Save the results as a SAS data set

First, you can save the results of a CAS action as a SAS data set. The idea here is the CAS action will process the data in the distributed CAS server, and then the CAS server returns smaller, summarized results to the client (SAS Studio). The summarized results can then be saved as a SAS data set.

To save the results of a CAS action simply add the result option after the action with a variable name. The results of an action return a dictionary to the client and store it in the specified variable. For example, to save the results of the freq action as a SAS data set complete the following steps:

  1. Execute the same CASL code from above, but this time specify the result option with a variable name to store the results of the freq action. Here i'll save the results in the variable freq_cr.
  2. Use the DESCRIBE statement to view the structure and data type of the CASL variable freq_cr in the log (not required).
  3. Use the SAVERESULT statement to save the CAS action result table from the dictionary freq_cr as a SAS data set named warranty_freq. To do this specify the key Frequency that is stored in the dictionary freq_cr to obtain the result table.
proc cas;
   * Reference the CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Specify the columns to analyze *;
   colNames = {'Model_Year', 
               'Vehicle_Assembly_Plant', 
               {name = 'Claim_Repair_Start_Date', format = 'yyq.'}
   };
   * 1. Analyze the CAS table and store the results *;
   simple.freq result = freq_cr / table= casTbl, inputs = colNames;
 
   * 2. View the dictionary in the log *;
   describe freq_cr;
 
  * 3. Save the result table as a SAS data set *;
   saveresult freq_cr['Frequency'] dataout=work.warranty_freq;
quit;

SAS Log

In the log, the results of the DESCRIBE statement shows the variable freq_cr is a dictionary with one entry. It contains the key Frequency and the value is a result table. The table contains 22 rows and 6 columns. The NOTE in the log shows the SAVERESULT statement saved the result table from the dictionary as a SAS data set named warranty_freq in the work library.

Once the summarized results are stored in a SAS library, use your traditional SAS programming knowledge to process the SAS table. For example, now I can visualize the summarized data using the SGPLOT procedure.

* Plot the SAS data set *;
title justify=left height=16pt "Total Warranty Claims by Year";
proc sgplot data=work.warranty_freq noborder;
	where Column = 'Model_Year';
	vbar Charvar / 
		response = Frequency
		nooutline;
	xaxis display=(nolabel);
	label Frequency = 'Total Claims';
	format Frequency comma16.;
quit;

Save the Results as a CAS Table

Instead of saving the summarized results as a SAS data set, you can create a new CAS table on the CAS server. To do that all you need is to add the casOut parameter in the action. Here I'll save the results of the freq CAS action to a CAS table named warranty_freq in the Casuser caslib, and I will give the table a descriptive label.

proc cas;
   * Reference the CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Specify the columns to analyze *;
   colNames = {'Model_Year', 
               'Vehicle_Assembly_Plant', 
               {name = 'Claim_Repair_Start_Date', format = 'yyq.'}
   };
 
   * Analyze the CAS table and create a new CAS table *;
   simple.freq / 
	table= casTbl, 
	inputs = colNames,
	casOut = {
		name = 'warranty_freq',
		caslib = 'casuser',
		label = 'Frequency analysis by year, assembly plant and repair date by quarter'
	};
quit;

The results above show the freq action returned information about the newly created CAS table. Once you have a CAS table in the distributed CAS server you can continue working with it using CAS, or you can visualize the data like we did before using SGPLOT. The key concept here is the SGPLOT procedure does not visualize data on the CAS server. The SGPLOT procedure returns the entire CAS table back to SAS (compute server) as a SAS data set, then the visualization occurs on the client. This means if the CAS table is large, an error or slow processing might occur. However, in our scenario we created a smaller summarized CAS table, so sending 22 rows back to the client (compute server) isn't going to be an issue.

* Make a library reference to a Caslib *;
libname casuser cas caslib='casuser';
 
 
* Plot the SAS data set *;
title justify=left height=16pt "Total Warranty Claims by Year";
proc sgplot data=casuser.warranty_freq noborder;
	where _Column_ = 'Model_Year';
	vbar _Charvar_ / 
		response = _Frequency_
		nooutline;
	xaxis display=(nolabel);
	label _Frequency_ = 'Total Claims';
	format _Frequency_ comma16.;
quit;

Summary

Using the freq CAS action enables you to generate a frequency distribution for one or more columns and enables you to save the results as a SAS data set or a CAS table. They keys to this process are:

  • CAS actions execute on the distributed CAS server and return summarized results back to the client as a dictionary. You can store the dictionary using the result option.
  • Using dictionary manipulation techniques and the SAVERESULT statement you can save the summarized result table from the dictionary as a SAS data set. Once you have the SAS data set you can use all of your familiar SAS programming knowledge on the traditional compute server.
  • Using the casOut parameter in a CAS action enables you to save the summarized results in the distributed CAS server.
  • The SGPLOT procedure does not execute in CAS. If you specify a CAS table in the SGPLOT procedure, the entire CAS table will be sent back to SAS compute server for processing. This can cause an error or slow processing on large tables.
  • Best practice is to summarize large data in the CAS server, and then work with the summarized results on the compute server.

Additional resources

freq action
DESCRIBE statement
SAVERESULT statement
Plotting a Cloud Analytic Services (CAS) In-Memory Table
SAS® Cloud Analytic Services: CASL Programmer’s Guide 
SAS® Cloud Analytic Services: Fundamentals
CAS Action! - a series on fundamentals
Getting Started with Python Integration to SAS® Viya® - Index

 

CAS-Action! Saving Frequency Tables - Part 2 was published on SAS Users.

12月 072022
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. If you'd like to start by learning more about the distributed CAS server and CAS actions, please see CAS Actions and Action Sets - a brief intro. Otherwise, let's learn how to generate frequency distributions for one or more columns using the simple.freq CAS action.

In this example, I will use the CAS language (CASL) to execute the freq CAS action. Be aware, instead of using CASL, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language. Refer to the documentation for syntax in other languages.

Load the demonstration data into memory

I'll start by executing the loadTable action to load the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory. By default the Samples caslib should be available in your SAS Viya environment. I'll load the table to the Casuser caslib and then I'll clean up the CAS table by renaming and dropping columns to make the table easier to use. For more information how to rename columns check out my previous post. Lastly I'll execute the fetch action to preview 5 rows.

proc cas;
   * Specify the input/output CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Load the CAS table into memory *;
    table.loadtable / 
        path = "WARRANTY_CLAIMS_0117.sashdat", caslib = "samples",
        casOut = casTbl + {replace=TRUE};
 
* Rename columns with the labels. Spaces replaced with underscores *;
 
   *Store the results of the columnInfo action in a dictionary *;
   table.columnInfo result=cr / table = casTbl;
 
   * Loop over the columnInfo result table and create a list of dictionaries *;
   listElementCounter = 0;
   do columnMetadata over cr.ColumnInfo;
	listElementCounter = listElementCounter + 1;
	convertColLabel = tranwrd(columnMetadata['Label'],' ','_');
	renameColumns[listElementCounter] = {name = columnMetadata['Column'], rename = convertColLabel, label=""};
   end;
 
   * Rename columns *;
   keepColumns = {'Campaign_Type', 'Platform','Trim_Level','Make','Model_Year','Engine_Model',
                  'Vehicle_Assembly_Plant','Claim_Repair_Start_Date', 'Claim_Repair_End_Date'};
   table.alterTable / 
	name = casTbl['Name'], caslib = casTbl['caslib'], 
	columns=renameColumns,
	keep = keepColumns;
 
   * Preview CAS table *;
   table.fetch / table = casTbl, to = 5;
quit;

The results above show a preview of the warranty_claims CAS table.

One-way frequency table for a single column

To create a simple one-way frequency for a single column use the simple.freq CAS action. In the freq action, use the table parameter to specify the CAS table and the inputs parameter to specify the column to analyze. Here I'm using the warranty_claims CAS table and analyzing the Make column.

proc cas;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
   simple.freq / table= casTbl, inputs = 'Make';
quit;

The freq action generates a simple one-way frequency table in the distributed CAS server and returns the results to the client. The results of the freq action include:

  • the column that was analyzed in the Column column
  • the distinct values for that column are shown in the Character Value column
  • if a format is associated with that column it appears in the Formatted Value column; if no format exists, you see the same values
  • the Frequency column represents the number of times that value occurs in the column

In the results above, we see the Zeus car make has the most warranty claims.

One-way frequency for multiple columns

To specify multiples columns in the freq action, add a list of columns to the inputs parameter. Here, I'll create a variable named colNames to store a list. In the list, I'll specify the Model_Year, Vehicle_Assembly_Plant and Engine_Model columns, and then use the variable in the inputs parameter.

proc cas;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
   colNames = {'Model_Year', 'Vehicle_Assembly_Plant', 'Engine_Model'};
   simple.freq / table= casTbl, inputs = colNames;
quit;

In the results above, we see the action returns a single result table with all three columns summarized. The column that was analyzed is shown in the Column column.

Apply a SAS format in the freq action

What if you want to apply a SAS date format to a column during analysis? For example, the Claim_Repair_Start_Date column contains a SAS date value with the DATE9 format. Instead of the detailed DATE9 format, what if I want to see the total number of repairs by year and quarter? Or by year? Or by year and month? You can easily apply a SAS format when using CAS actions.

Let's start by executing the freq CAS action on the Claim_Repair_Start_Date column.

proc cas;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
   simple.freq / table= casTbl, inputs = 'Claim_Repair_Start_Date';
quit;

The results above show the action created a one way frequency of Claim_Repair_Start_Date using the DATE9 format and stored it in the CAS table. The Numeric Value column shows the raw SAS date values and the Formatted Value column shows the formatted dates.

Now, this analysis is too detailed. I don't want to see repairs by start date. Instead, I'll apply the YYQ format to summarize the dates by year and quarter. You can apply the format within the inputs parameter. In the inputs parameter specify a list of dictionaries. Here I'll use a single dictionary in the list and apply the YYQ format to the Claim_Repair_Start_Date column.

proc cas;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
   simple.freq / 
	table= casTbl, 
	inputs = {
		{name = 'Claim_Repair_Start_Date', format = 'yyq.'}
	};
quit;

The results above display the frequency by year and quarter. The ability to apply a SAS format during execution enables us to quickly summarize data in a variety of ways.

Create a calculated column in the freq action

Lastly, you can also create calculated columns within an action for ad-hoc analysis. Here I'll create a new column named Make_Platform that concatenates the Make and Platform columns by specifying an expression in the variable calculateMakePlatform. Then I'll add the calculation to the computedVarsProgram parameter in my CAS table reference. Finally, I'll add the new column name to the inputs parameter in the freq action. For more information about creating calculated columns in a CAS table, check out my previous post.

proc cas;
   calculateMakePlatform = 'Make_Platform = catx("-",Make,Platform)';
   casTbl = {name = "WARRANTY_CLAIMS", 
             caslib = "casuser",
             computedVarsProgram = calculateMakePlatform};
   simple.freq / 
	table= casTbl,
	inputs = 'Make_Platform';
quit;

 

The results above show the Zeus-XE has the highest amount of warranty claims.

While viewing the results of the analysis is great, how can a work with these results? Maybe I want to create a visualization? What about creating another CAS table or SAS data set with these results? What about an Excel report? How can we do this? Well, stay tuned for part 2!

Summary

Using the freq CAS action enables you to generate a frequency distribution for one or more columns, apply SAS formats during analysis, and even create calculated columns. CAS actions are optimized to run in the distributed CAS server, are flexible, and can be executed in a variety of languages like Python and R!

Additional resources

freq action
SAS® Cloud Analytic Services: CASL Programmer’s Guide 
CAS Action! - a series on fundamentals
Getting Started with Python Integration to SAS® Viya® - Index
SAS® Cloud Analytic Services: Fundamentals

CAS-Action! Simple Frequency Tables - Part 1 was published on SAS Users.

12月 022022
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. If you'd like to start by learning more about the distributed CAS server and CAS actions, please see CAS Actions and Action Sets - a brief intro. Otherwise, let's learn how to rename columns in CAS tables.

In this example, I will use the CAS language (CASL) to execute the alterTable CAS action. Be aware, instead of using CASL, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language. Refer to the documentation for syntax in other languages.

Load the demonstration data into memory

I'll start by executing the loadTable action to load the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory in the Casuser caslib. By default the Samples caslib should be available in your SAS Viya environment. Then I'll preview the CAS table using the columnInfo and fetch CAS actions.

* Connect to the CAS server and name the connection CONN *;
cas conn;
 
proc cas;
   * Specify the output CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Load the CAS table *;
   table.loadtable / 
      path = "WARRANTY_CLAIMS_0117.sashdat", caslib = "samples",
      casOut = casTbl;
 
    * Preview the CAS table *;
    table.columnInfo / table = casTbl;
    table.fetch / table = casTbl, to = 5;
quit;

The columnInfo action returns information about each column. Notice that the WARRANTY_CLAIMS CAS table has column names and columns labels.

The fetch CAS action returns five rows.

Notice that by default the fetch action uses columns labels in the header.

Rename columns in a CAS table

To rename columns in a CAS table, use the alterTable CAS action. In the alterTable action, specify the CAS table using the name and caslib parameters. Additionally, use the columns parameter to specify the columns to modify. The columns parameter requires a list of dictionaries, each dictionary specifies the column to modify.

Here, I'll rename the claim_attribute_1, seller_attribute_5 and product_attribute_1 columns. Then I'll execute the columnInfo action to view the updated column information.

proc cas;
   * Reference the CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Rename columns *;
   table.alterTable / 
      name = casTbl['name'], caslib = casTbl['caslib'],
      columns = {
	{name = 'claim_attribute_1', rename = 'Campaign_Type'},
	{name = 'seller_attribute_5', rename = 'Selling_Dealer'},
	{name = 'product_attribute_1', rename = 'Vehicle_Class'}
      };
 
   * View column metadata *;
   table.columnInfo / table = casTbl;
quit;

The results show that the alterTable CAS action renamed the columns to Campaign_Type, Selling_Dealer and Vehicle_Class. While this worked, what if you wanted to rename all columns in the CAS table using the column labels?

Rename all columns using the column labels

I'll dynamically rename the CAS table columns using the column labels. Since the column labels contain spaces, I'll also replace all spaces with an underscore. Now, I could manually specify each column and column label in the alterTable action, but why do all that work? Instead you can dynamically create a list of dictionaries for use in the alterTable action.

proc cas;
* Reference the CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
  * Rename columns with the labels. Spaces replaced with underscores *;
 
   *1. Store the results of the columnInfo action in a dictionary *;
   table.columnInfo result=cr / table = casTbl;
 
   *Loop over the columnInfo result table and create a list of dictionaries *;
   *2*;
   listElementCounter = 0;
   *3*;
   do columnMetadata over cr.ColumnInfo;
	*4.1*; listElementCounter = listElementCounter + 1;
	*4.2*; convertColLabel = tranwrd(columnMetadata['Label'],' ','_');
	*4.3*; renameColumns[listElementCounter] = {name = columnMetadata['Column'], rename = convertColLabel};
   end;
 
   *5. Rename columns *;
   table.alterTable / 
	name = casTbl['Name'], 
	caslib = casTbl['caslib'], 
	columns=renameColumns;
 
   *6. Preview CAS table *;
   table.columnInfo / table = casTbl;
quit;
  1. The columnInfo action will store the results in a dictionary named cr.
  2. The variable listElementCounter will act as a counter that can be used to append each dictionary to the list.
  3. Loop over the result table stored in the cr dictionary. When you loop over a result table, each row is treated as a dictionary. The key is the column name and it returns the value of that column.
  4. In the loop:
    1. accumulate the counter
    2. access the column label and replace all spaces with underscores using the tranwrd function
    3. create a list named renamedColumns that contains each dictionary with the column to rename and it's new name.
  5. The alterTable action will use the list of dictionaries to rename each column.
  6. The columnInfo action will display the new column information.

The results show that each column was dynamically renamed using the column label and the spaces replaced with underscores.

Summary

In summary, using the alterTable CAS action enables you to rename columns in a CAS table.  With some knowledge of lists, dictionaries and loops in the CAS language, you can dynamically use the column labels to rename the columns. When using the alterTable action remember that:

  • The name and caslib parameters specify the CAS table.
  • The columns parameter requires a list of dictionaries.
  • Each dictionary specifies the column to modify.

Want to learn how to do this using Python? Check out my post Getting started with Python integration to SAS® Viya® - Part 11 - Rename Columns.

Additional resources

simple.freq CAS action
SAS® Cloud Analytic Services: CASL Programmer’s Guide 
CAS Action! - a series on fundamentals
Getting Started with Python Integration to SAS® Viya® - Index
SAS® Cloud Analytic Services: Fundamentals

CAS-Action! Rename Columns in a CAS Table was published on SAS Users.