tips & techniques

5月 312023
 

Inserting various programming languages into SAS programSAS consistently expands its support for running non-SAS code inside SAS programs. It’s been a while since SAS introduced explicit SQL pass-through facility allowing SAS users to embed native Database Management Systems (DBMS) Sequel (SQL) code into its PROC SQL and PROC DS2.

Similarly, now you can run R code within PROC IML.

SAS Viya added PROC PYTHON enabling you to embed increasingly popular Python programming language code within your SAS programs.

However, while SAS developers welcome such cross-language functionality, its implementation often leads to making SAS programs bulky and overloaded, not easily readable and difficult to follow. In the spirit of modular programming, it would have been nice just referencing those “foreign” modules without actually inserting their code into SAS programs. Somewhat like %INCLUDE statement brings into SAS program pieces of SAS code stored in files.

Limitations of %INCLUDE

The problem with the %INCLUDE for bringing non-SAS code into a SAS program stems from the fact that % INCLUDE is not a SAS macro statement as it seems to be. That’s right. Despite prominently displaying % sign of distinction in front of itself, it has nothing to do with SAS macro facility. I know, it is misleading and it is a misnomer.  Let’s call it an exception. However, the fact is that %INCLUDE is a SAS global statement, and as such it is not processed by the SAS macro processor.

Being a global statement, %INCLUDE must be placed between other SAS statements (or be the very first or last statement in your SAS program). It cannot be positioned within and be part of other SAS statements.

In case of native SQL, its code resides within a SAS statement, namely inside parentheses of the FROM CONNECTION TO dbase ( ) clause of the explicit SQL pass-through construct. Therefore, placing %INCLUDE where the native SQL code is expected will generate a syntax error. Bummer!

Be %INCLUDE a true macro object the described above problem would not exist.

%EMBED macro function

Well, let’s leave the %INCLUDE alone and instead create a macro function (we'll call it %EMBED) that will do what %INCLUDE does - bringing the contents of an external code file into a SAS program.  The key difference though will be that we will make it a true macro function, which runs by the SAS macro processor during SAS program compilation (before execution). Then by the SAS program execution time, the contents of the external code file will be already injected where this macro function is called. Therefore, one can invoke (place) it inside other SAS statements.

%EMBED macro function does not care whether it embeds SAS code or non-SAS code; the contents of the external file determine what’s embedded. It is up to you, SAS developer, to place this macro call strategically where code from the external file is appropriate.

Here is how we can easily implement such a macro function:

/* -------------------------------------------------------------------
| DESCRIPTION | Macro to embed any code from a file into SAS program.
|---------------------------------------------------------------------
| INVOCATION  | %embed(full-file-name) or %embed(file-reference)
|---------------------------------------------------------------------
| AUTHOR      | Leonid Batkhan, May 2023
|-------------------------------------------------------------------*/
%macro embed(f);
  %local p ref rc fid;
  %let p = %sysfunc(findc(&f,/\:.));
  %if &p %then %let rc = %sysfunc(filename(ref, &f));
         %else %let ref = &f;
  %let fid = %sysfunc(fopen(&ref));
  %if &fid>0 %then
    %do %while(%sysfunc(fread(&fid))=0);
      %let rc = %sysfunc(fget(&fid, line, 32767));
      &line
    %end;
    %else %put ERROR: Macro &sysmacroname - file "&f" cannot be opened.;
  %let rc = %sysfunc(fclose(&fid));
  %if &p %then %let rc = %sysfunc(filename(ref));
%mend embed;

%EMBED macro function has a single argument (parameter) representing either a full-path file name (including extension) or a fileref assigned with a FILENAME statement or function. It is “smart enough” to distinguish between the file name and fileref and process them accordingly.

It returns as its value the contents of the external file specified as its argument (parameter). This value may contain multiple lines of code.

%EMBED macro function code highlights

The first %let p= statement determines whether argument f represents full-path file name (p>0) or a file reference (p=0). We deduce it from the fact that file name must contain at least one of the /\:. symbols, but fileref may not contain any of them.

Then the %if-%then-%else statement creates either its own fileref using %sysfunc(filename(ref, &f)) or assigns macro variable ref to &f (if &p>0). Since we do not provide an initial value for the ref macro variable, filename function will assign a unique system-generated fileref, which starts with #LN followed by 5 digits (e.g. #LN00009). This ensures that the fileref would not interfere with other potential filesref’s assigned outside of this macro.

Then we open this file. If file opening fails (fid=0) then we output an ERROR message in the SAS log.

If the file opens successfully (fid>0) then we loop through this file using fread( ) function (while fread=0) which loads one record per iteration into the file data buffer (FDB) and fget( ) function which copies data from the FDB to a macro variable line. Note, that in %sysfunc(fget(&fid, line, 32767)) second argument (line) do not need & in front of it.

The key here is the following line of code:

&line

This macro variable reference (not a %put &line) just "injects" the value of macro variable line into the return value of this macro function. For each iteration of the do-loop, we read a line of code from the external file and add it to the return value of the %embed macro function.

After the loop, we close the file and conditionally de-assign fileref (if we assigned it within the macro); we do not de-assign the fileref if it is assigned outside the macro in the calling program.

%EMBED macro function usage

While we developed %EMBED macro function to address limitations of %INCLUDE for inserting native SQL code into SAS pass-through clause, its usage is much broader. One can use %EMBED for embedding/including/inserting/injecting/delivering any code, SAS or non-SAS, anywhere in the SAS program. In essence, %embed is a more advanced and robust alternative to the %include.

Let’s explore several scenarios of the %embed macro function usage.

Embedding native DBMS SQL Code into SAS program

Suppose you have a native DBMS SQL query code stored in file C:\project\query1.sql. (File extension is arbitrary, and in this case just indicates what type of code we are going to embed.)

Then you can use SQL Procedure Pass-Through Facility of the SAS PROC SQL and %embed macro function as in the following example:

proc sql;
   connect to odbc (dsn="db_name");
   create table WORK.ABC as
   select * 
      from connection to odbc ( %embed(C:\project\query1.sql) );
   disconnect from odbc;
quit;

During compilation phase, SAS macro processor will replace %embed(C:\project\query1.sql) with the contents of the external file containing DBMS-specific native SQL code (instead of SAS PROC SQL code). Then during execution time, PROC SQL will pass this query code on to the DBMS for processing.  The DBMS will return to SAS the result of this query and SAS will create data table WORK.ABC.

You can also use %embed macro function for "injecting" SQL code from a file into explicit FedSQL pass-through and explicit SQL pass-through in DS2.

Embedding Python or Lua code into SAS program

In PROC PYTHON and PROC LUA, you can use their optional INFILE= clause to reference an external file that contains their statements to run within a SAS session. Alternatively, you can use %embed macro function instead. For example, if you have your Python code stored in a file C:\project\program.py then you can place your %embed(C:\project\program.py) macro function call between submit and endsubmit statements of the PROC PYTHON:

proc python;
  submit;
    %embed(C:\project\program.py)
  endsubmit;
run;

Similar for PROC LUA:

proc lua;
  submit;
    %embed(C:\project\program.lua)
  endsubmit;
run;

Embedding R code into SAS program

Here is an example of %embed usage for inserting R code into SAS using PROC IML:

proc iml;
  submit / R;
    %embed(C:\project\program.r)
  endsubmit;
run;

Embedding SAS code into SAS program

Finally, you can embed a SAS code from external file into SAS program. You can use it instead of %include statement, for example:

data TWO;
   set ONE;
run;
 
%embed(c:\project\data_three.sas)

Unlike %include global statement which can be placed only between SAS statements, %embed can be placed anywhere in a SAS program. You can use it not only within DATA or PROC steps, but also within SAS executable statements. That means %embed can bring in SAS code that can be executed conditionally. For example:

data a;
  set b;
  if x>0 then %embed(c:\project\code1.sas);
         else %embed(c:\project\code2.sas);
run;

Obviously, your embedded code must comply with the context of its surrounding.

Conclusion

As you can see, %embed macro function presents an elegant solution for embedding both “foreign” and SAS native code without cluttering the SAS program. Would you agree that %embed macro is a more robust alternative to the %include?

Questions? Thoughts? Comments?

Do you find this post useful? Do you have questions or your own tips and tricks for embedding other code into your programs? Please share with us below.

Additional resources

Embedding any code anywhere into SAS programs was published on SAS Users.

5月 302023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to load multiple CSV files into memory as a single table using the loadTable action.

Load and prepare data on the CAS server

To start, we need to create multiple CSV files in a folder. I created this script to load and prepare the CSV files in the CAS server for this post. This should work in your environment. The script does the following:

  • Loads the WARRANTY_CLAIMS_0117.sashdat files from the Samples caslib into memory.
  • Modifies the in-memory table by renaming and dropping columns.
  • Adds a sub directory in the Casuser caslib named csv_file_blogs.
  • Saves a CSV file for each distinct year in the csv_file_blogs folder.
  • Drops the original CAS table.
import swat
import pandas as pd
 
## Connect to CAS
conn = swat.CAS(Enter your CAS connection information)
 
##
## Data prep functions 
##
def prep_data():
    """
    Load and prepare the warranty_claims_0017.sashdat file in CAS
    """
    ## Load the WARRANTY_CLAIMS_0117.sashdat from the Samples caslib into memory in Casuser
    conn.loadTable(path='WARRANTY_CLAIMS_0117.sashdat', caslib='samples',
                   casout={'name':'warranty_claims', 
                           'caslib':'casuser',
                           'replace':True})
    ##
    ## DATA PREP
    ## 
 
    ## Reference the CAS table in an object
    castbl = conn.CASTable('warranty_claims', caslib = 'casuser')
 
    ## Store the column names and labels in a dataframe
    df_col_names = castbl.columnInfo()['ColumnInfo'].loc[:,['Column','Label']]
 
    ## Create a list of dictionaries of how to rename each column using the column labels
    renameColumns = []
    for row in df_col_names.iterrows():
        colName = row[1].values[0]
        labelName = row[1].values[1].replace(' ','_')
        renameColumns.append(dict(name=colName, rename=labelName))
 
    ## List of columns to keep in the CAS table
    keepColumns = {'Campaign_Type', 'Platform','Trim_Level','Make','Model_Year','Engine_Model',
                   'Vehicle_Assembly_Plant','Claim_Repair_Start_Date', 'Claim_Repair_End_Date'}
 
    ## Rename and drop columns to make the table easier to use
    castbl.alterTable(columns = renameColumns, keep = keepColumns)
 
    ## Return the CASTable object reference
    return castbl
 
 
 
def save_cas_table_as_csv_files(cas_table_reference):
    """
    Create a subdirectory in Casuser and save mutliple CSV files in it.
    """
    ## Create a subdirectory in the Casuser caslib named csv_file_blogs
    conn.addCaslibSubdir(name = 'casuser', path = 'csv_file_blogs')
 
    ## Create a CSV file for each year
    for year in list(castbl.Model_Year.unique()):      
        (cas_table_reference
         .query(f"Model_Year ='{year}'")
         .save(name = f'csv_file_blogs/warranty_claims_{year}.csv', 
               caslib = 'casuser',
               replace = True)
        )
 
    ## Drop the CAS Table
    cas_table_reference.dropTable()
 
    ## View files in the csv_file_blogs subdirectory
    fi = conn.fileInfo(allFiles = True, caslib = 'casuser')
    fi_subdir = conn.fileInfo(path = 'csv_file_blogs', caslib = 'casuser')
    display(fi, fi_subdir)
 
 
## Create the CAS table
castbl = prep_data()
 
## Save the CAS table as a CSV file for each year
save_cas_table_as_csv_files(castbl)
 
## and the results
NOTE: Cloud Analytic Services made the file WARRANTY_CLAIMS_0117.sashdat available as table WARRANTY_CLAIMS in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2015.csv in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2016.csv in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2017.csv in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2018.csv in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2019.csv in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services dropped table warranty_claims from caslib CASUSER(Peter.Styliadis).

The results show that five CSV files named warranty_claims<year>.csv were created in the subdirectory csv_file_blogs in the Casuser caslib.

Next, I'll use the fileInfo action to view the new csv_file_blogs subdirectory in the Casuser caslib. In the fileInfo CAS action use the includeDirectories parameter to view subdirectories.

conn.fileInfo(includeDirectories = True, caslib = 'casuser')

Lastly, I'll view the available files in the csv_file_blogs subdirectory. To view files in a subdirectory in a caslib add the folder name in the path parameter.

conn.fileInfo(path = 'csv_file_blogs', caslib = 'casuser')

The results show that the subdirectory has five CSV files.

Load all of the CSV files as a single table

To easily load multiple CSV files as a single in-memory CAS table use the table.loadTable CAS action. The only requirements when loading multiple CSV files:

  • All of the CSV files must end with .csv.
  • Set the multiFile option to True in the importOptions parameter.
  • The CSV files must have the same number of columns and the columns must have the same data type.

Here, I'll use the loadTable action with the path parameter to specify the subdirectory csv_file_blogs and the caslib parameter to specify the Casuser caslib. The importOptions parameter uses the fileType option to determine the other parameters that apply. The CSV file type has a multiFile option. If you set the multiFile option to True, it will try to load all CSV files from the path into a single in-memory table. The casOut parameter simply specifies output CAS table information.

conn.loadTable(path="csv_file_blogs", caslib = 'casuser',  ## Specify the subdirectory name (csv_file_blogs) and the input caslib name
               importOptions = {                           ## Specify the import options
                   'fileType' : 'CSV',
                   'multiFile' : True
               },
               casOut = {                                  ## Specify output CAS table information
                   'name' : 'allCSVFiles',
                   'caslib' : 'casuser',
                   'replace' : True
               })
 
## and the results
NOTE: The file, '/cas/data/caslibs/casuserlibraries/peter/csv_file_blogs/warranty_claims_2015.csv' was used to create the CAS Table column names.
NOTE: The CSV file table load for table, 'allCSVFiles' produced 153217 rows from 5 files.
NOTE: Cloud Analytic Services made the file csv_file_blogs available as table ALLCSVFILES in caslib CASUSER(Peter).

The action concatenated each CSV file and loaded them as a single distributed CAS table named ALLCSVFILES.

Next, I'll run the tableInfo action to view available in-memory tables in the Casuser caslib.

conn.tableInfo(caslib = 'casuser')

The action results show one CAS table is in memory.

Then I'll make a client-side reference to the distributed CAS table and run the head method from the SWAT package.

allcsvfilesTbl = conn.CASTable('allcsvfiles', caslib = 'casuser')
allcsvfilesTbl.head()

Finally, I'll run the SWAT value_counts on the Model_year column.

(allcsvfilesTbl   ## CAS table
 .Model_Year      ## CAS column
 .value_counts()  ## SWAT value_counts method
)
## and the results
2017.0    70479
2018.0    43975
2016.0    32707
2019.0     3510
2015.0     2546
dtype: int64

The results show all five years of data were imported into the CAS table, one for each CSV file.

Add file name and path columns to the CAS table

The importOptions parameter has a variety of CSV options you can use to modify how to import the files. Two useful options are showFile and showPath. The showFile option includes a column that shows the CSV file name. The showPath option includes a column that shows the fully-qualified path to the CSV file.

conn.loadTable(path="csv_file_blogs", caslib = 'casuser',  ## Specify the subdirectory name (csv_file_blogs) and the input caslib name
               importOptions = {                           ## Specify the import options
                   'fileType' : 'CSV',
                   'multiFile' : True,                     
                   'showFile' : True,
                   'showPath' : True
               },
               casOut = {                                  ## Specify output CAS table information
                   'name' : 'allCSVFiles_path_info',
                   'caslib' : 'casuser',
                   'replace' : True
               })
 
## and the results
NOTE: The file, '/cas/data/caslibs/casuserlibraries/peter/csv_file_blogs/warranty_claims_2015.csv' was used to create the CAS Table column names.
NOTE: The CSV file table load for table, 'allCSVFiles_path_info' produced 153217 rows from 5 files.
NOTE: Cloud Analytic Services made the file csv_file_blogs available as table ALLCSVFILES_PATH_INFO in caslib CASUSER(Peter).

I'll run the tableInfo action to view available CAS tables.

conn.tableInfo(caslib = 'casuser')


The action shows that two CAS tables are now in-memory.

I'll reference and preview the new CAS table .

allcsvfiles_path_infoTbl = conn.CASTable('allcsvfiles_path_info', caslib = 'casuser')
allcsvfiles_path_infoTbl.head()

The preview shows the new CAS table has a column named path and a column named fileName.

Lastly, I'll use the SWAT value_counts method to view how many rows came from each CSV file. I'll specify the CAS table reference, the column name, then the value_counts method.

(allcsvfiles_path_infoTbl
 .fileName
 .value_counts()
)
 
## and the results
warranty_claims_2017.csv    70479
warranty_claims_2018.csv    43975
warranty_claims_2016.csv    32707
warranty_claims_2019.csv     3510
warranty_claims_2015.csv     2546
dtype: int64

The results show the CSV files were concatenated into a single CAS table. We can see how many rows came from each file.

Summary

The SWAT package blends the world of pandas and CAS to process your distributed data. In this example I focused on using table.loadTable CAS action to concatenate multiple CSV files into a single distributed CAS table using a single method.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 19 - Load Multiple CSV Files was published on SAS Users.

4月 202023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to update rows in a distributed CAS table.

Load and prepare data in the CAS server

I created a script to load and prepare data in the CAS server. This should work in your environment. The script:

  • connects to the CAS server
  • loads the RAND_RETAILDEMO.sashdat file from the Samples caslib into memory in the Casuser caslib and names the CAS table rand_retaildemo
  • lower cases all column names
  • keeps specific columns
  • creates a new column named age_dup that is a duplicate of the age column
  • previews the new CAS table
## Packages
import swat
import os
import pandas as pd
pd.set_option('display.max_columns', 50)
import numpy as np
 
## General CAS connection syntax
conn = swat.CAS(host, port, username, password) ## Add your connection information
 
## Load the RAND_RETAILDEMO.sashdat file into memory on the CAS server
conn.loadTable(path = 'RAND_RETAILDEMO.sashdat', caslib = 'samples',
               casout = {
                      'name' : 'rand_retaildemo',
                      'caslib' : 'casuser',
                      'replace' : True
               })
 
## Reference the CAS table
retailTbl = conn.CASTable('rand_retaildemo', caslib = 'casuser')
 
## Create a copy of the table with a new column
(retailTbl
 .eval("age_dup = age", inplace = False)          ## create a duplicate of the age column
 .copyTable(casout = {'name':'rand_retaildemo',
                      'caslib':'casuser',
                      'replace':True})
)
 
 
## Create a list of columns to rename 
newColNames = [{'name':col,'rename':col.lower()} for col in retailTbl.columns.to_list()]
 
## List of columns to keep
keepColumns = ['custid','bucket','age','age_dup','loyalty_card','brand_name','channeltype','class']
 
## Rename and keep columns
retailTbl.alterTable(columns = newColNames, 
                     keep = keepColumns)
 
## Preview the new CAS table
display(retailTbl.shape, 
        retailTbl.tableDetails(),
        retailTbl.tableInfo(caslib = 'casuser'),
        retailTbl.head())
 
## and the results
NOTE: Cloud Analytic Services made the file RAND_RETAILDEMO.sashdat available as table RAND_RETAILDEMO in caslib CASUSER(Peter).

The results show:

  • the file RAND_RETAILDEMO.sashdat was loaded into memory in the Casuser caslib as the CAS table RAND_RETAILDEMO
  • the CAS table RAND_RETAILDEMO has 930,046 rows and 8 columns
  • the CAS table is separated into 383 blocks
  • the Casuser caslib has one table loaded into memory (RAND_RETAILDEMO)
  • a five row preview the RAND_RETAILDEMO CAS table

Update rows in a table

The preview of the CAS table shows the brand_name, channeltype and class columns contain character values. I want to standardize the casing of the character data in each of those columns by:

  • upper casing all values in the brand_name column
  • lower casing all values in the channteltype column
  • proper casing all values in the class column

I'll use the table.update CAS action to update rows in a CAS table. In the update action specify the set parameter with a list of dictionaries, each dictionary specifies the column to update and how to update it. In the dictionary var specifies the column name and value specifies how to update the column. The value key requires SAS functions or constants.

retailTbl.update(set = [
    {'var':'brand_name', 'value':'upcase(brand_name)'},
    {'var':'channeltype', 'value':'lowcase(channeltype)'},
    {'var':'class', 'value':'propcase(class)'}
])

The results show I updated all 930,046 rows with the update action. Next, I'll preview the updated CAS table.

retailTbl.head()

The preview shows I updated the case of the brand_name, channeltype and class columns.

View missing values

Next I'll execute the distinct action on the age and age_dup columns to confirm they have missing values. The age_dup column is simply a copy of the age column.

retailTbl.distinct(inputs = ['age', 'age_dup'])

The results show the columns have 673,447 missing values.

Update rows based on a condition

After finding missing values, I've determined I want to replace them with the mean age. I'll start by storing the mean age in the variable meanAge.

meanAge = retailTbl.age.mean().round(3)
meanAge
 
## and the results
43.577

Specifying the rows to update

One way to update values by a condition is by filtering the CAS table for the rows to update. You can do that by specifying the query method to filter the CAS table for all rows where age is missing. Then add the update action to update the age column using the meanAge variable value.

(retailTbl
 .query("age is null")
 .update(set = [
     {'var':'age', 'value':f'{meanAge}'}])
)

The update action updated 673,447 rows. This number matches the total number of missing values in the age column. Let's run the distinct action to confirm no missing values exist in age.

retailTbl.distinct(inputs = ['age', 'age_dup'])

The distinct action shows the age column does not have any missing values.  I'll confirm the unique values in the age column by running the value_counts method.

(retailTbl
 .age_dup
 .value_counts()
)

The value_counts method shows that 673,447 rows have the mean age value 43.577.

Update rows using conditional logic

Another way to update rows is using conditional logic. You can perform conditional logic with the SAS IFN function (IFC for character). The IFN function returns a numeric value based on whether an expression is true, false, or missing. Here I'll specify the IFN function and test if age_dup is missing. If age_dup is missing, return the meanAge value. Otherwise, return the original age_dup value.

(retailTbl
 .update(set = [
     {'var':'age_dup', 'value':f'ifn(age_dup = . , {meanAge}, age_dup)'}])
)

The update action updated all 930,046 rows. All rows are updated because conditional logic checks each row and updates the values. While this method works, for this scenario the previous method is a bit more efficient since only the necessary rows are updated.

Let's confirm no missing values exist the age_dup column.

The distinct action shows the age_dup column does not have any missing values.  I'll confirm the unique values in the age_dup column by running the value_counts method.

(retailTbl
 .age_dup
 .value_counts()
)

The value_counts method shows that 673,447 rows have the mean age value 43.577. These are the expected results.

Save the CAS table as a data source file

Lastly, I'll save the updated CAS table as a data source file using the table.save action. For more information on saving CAS tables check out my previous post Part 17 - Saving CAS tables.

retailTbl.save(name = 'retail_clean.parquet', caslib = 'casuser')
 
## and the results
NOTE: Cloud Analytic Services saved the file retail_clean.parquet in caslib CASUSER(Peter).

 

Summary

The SWAT package blends the world of pandas and CAS to process your distributed data. In this example I focused on using the Python SWAT package to update rows in a CAS table. Updating data in CAS tables is a bit different than updating DataFrames using pandas. Understanding how to update CAS tables enables you to avoid having to create a copy of a large in-memory CAS table, improving speed and conserving resources.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 18 - Update rows in a table was published on SAS Users.

4月 062023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss saving CAS tables to a caslib's data source as a file. This is similar to saving pandas DataFrames using to_ methods.

Load and preview the CAS table

First, I imported the necessary packages and connected my Python client to the distributed CAS server and named my connection conn. Second, I loaded the warranty_claims_0017.sashdat file from the Samples caslib into memory as a distributed CAS table. The Samples caslib should be available in your environment. Lastly, I referenced the CAS table in the variable castbl and preview the table.

For more information about loading server-side files into memory check out Part 5 - Loading Server-Side Files into Memory.

## Packages
import swat
import os
import pandas as pd
import numpy as np
 
## Options
pd.set_option('display.max_columns', 50)
 
## Connect to CAS
conn = swat.CAS() ## enter CAS connection information
 
## Load the data into CAS
conn.loadTable(path='WARRANTY_CLAIMS_0117.sashdat', caslib='samples',
               casout={'name':'warranty_claims', 
                       'caslib':'casuser', 
                       'replace':True})
 
## Reference the CAS table in a variable
castbl = conn.CASTable('warranty_claims', caslib = 'casuser')
 
## Preview the CAS table
df = castbl.head()
ci = castbl.columnInfo()
display (df, ci)
 
# and the results
NOTE: Cloud Analytic Services made the file WARRANTY_CLAIMS_0117.sashdat available as table WARRANTY_CLAIMS in caslib CASUSER(Peter).

Results show five rows and the column metadata of the CAS table. Notice the column names are vague but the column labels are detailed.

Prepare the CAS table

After the table is loaded into memory,  I'll prepare the CAS table by completing the following:

  • Replace column names with the column labels all lower cased and spaces replaced with underscores.
  • Drop unnecessary columns.
  • Create a new column named car_id that concatenates the make, product_model and platform columns.
  • Display the parameters of the CASTable object castbl, and preview five rows.

I won't go into detail of the code. I use a combination of familiar Python and SAS Viya techniques from Part 8 - Creating Calculated Columns and Part 11 - Rename Columns.

##
## RENAME AND DROP COLUMNS
## Use the current column labels as the column names
##
 
## Store the current column names and labels in a dataframe
df_col_names = castbl.columnInfo()['ColumnInfo'].loc[:,['Column','Label']]
 
## Create a list of dictionaries of how to rename each column using the column labels
renameColumns = []
for row in df_col_names.iterrows():
 
    ## Store current column name
    colName = row[1].values[0]
 
    ## Store current column label and clean up
    labelName = row[1].values[1].replace(' ','_').lower()
 
    ## Create a dictionary of the column to rename, and how to rename it
    columnDict = dict(name = colName, 
                      rename = labelName)
 
    ## Add a date format to columns if the column name contains the word date
    if columnDict['rename'].find('date') &gt; 0:
        columnDict['format'] = 'DATE9.'  
 
    ## Create a list of dictionaries, one dictionary for each column
    renameColumns.append(columnDict)
 
## List of columns to drop using the updated columns names
dropColumns = ['defect_code','repairing_division','usage_value','campaign_type','customer_country','ship_year',
               'product_unit_assembly_date','primary_replaced_material_group_code','primary_labor_group_code',
               'selling_dealer','vehicle_class','ship_date', 'total_claim_count', 'primary_labor_code', 'defect_key', 'primary_replaced__material_id']
 
## Apply new column names and drop columns to the CAS table
castbl.alterTable(columns = renameColumns,
                  drop = dropColumns)
 
 
##
## CREATE A NEW COLUMN
##
 
## Create a new column in the table by combining make, model and platform
castbl.eval("car_id = strip(make)|| '-' || strip(product_model) || '-' || strip(platform)")
 
 
##
## PREVIEW THE CHANGES 
##
 
## View the CASTable object parameters and 5 rows
display(castbl, castbl.head())

The CASTable object stores parameters on how to create the new CAS table and the table preview shows the updated column names.

Create a new CAS table

Once the castbl CASTable object is ready,  I'll create a new CAS table named warranty_final in the Casuser caslib using the copyTable action (in Viya 3.5 use partition instead).

## Create a new CAS table for the final table
castbl.copyTable(casout = {'name':'warranty_final', 
                           'caslib':'casuser', 
                           'replace':True})

 

Since this will be the final table, I'll reorganize the columns to make it easier to work with. I'll reference the warranty_final CAS table in the variable finalTbl, then rearrange the columns and preview the table.

## Reference the new CAS table and reorganize columns
finalTbl = conn.CASTable('warranty_final', caslib = 'casuser')
 
## Specify the order of the columns
newColumnOrder = ['warranty_claim_id', 'car_id', 'make', 'product_model', 'model_year', 'platform', 'trim_level', 'engine_model','vehicle_assembly_plant',
'primary_labor_group', 'primary_labor_description', 'gross_claim_amount', 'gross_labor_amount', 'gross_material_amount', 'gross_other_amount',
'product_unit_id', 'repairing_state_province', 'repairing_region', 'repairing_country', 'repairing_dealer', 'latitude', 'longitude',
'claim_repair_start_date', 'claim_repair_end_date', 'claim_processed_date', 'claim_submitted_date','service_year_date']
 
finalTbl.alterTable(columnOrder = newColumnOrder)
 
## Preview the CAS table
finalTbl.head()

Save the CAS table as a file

Just like pandas DataFrames, CAS tables are in-memory. There are times you will want to save them back to disk. In CAS use the table.save CAS action to save the in-memory CAS table back to disk as a file (or back to a database). The save action is similar to saving a DataFrame using pandas to_ methods. For example, I'll save the warranty_final CAS table as a CSV file named warranty_final.csv in the Casuser caslib.

finalTbl.save(name = 'warranty_final.csv', caslib = 'casuser')
 
# and the results
NOTE: Cloud Analytic Services saved the file warranty_final.csv in caslib CASUSER(Peter).

The save action enables you to easily save the CAS table as other file formats. Here I'll save it as a PARQUET file.

finalTbl.save(name = 'warranty_final.parquet', caslib = 'casuser')
 
# and the results
NOTE: Cloud Analytic Services saved the file warranty_final.parquet in caslib CASUSER(Peter).

You can also save them as a SASHDAT file. A SASHDAT file is an the in-memory copy of a distributed CAS table on disk. They also store additional SAS information like column labels and formats.

finalTbl.save(name = 'warranty_final.sashdat', caslib = 'casuser')
 
# and the results
NOTE: Cloud Analytic Services saved the file warranty_final.sashdat in caslib CASUSER(Peter).

I'll run the fileInfo action to view available files in the Casuser caslib. I'll use the % wildcard character to find all files that begin with warranty_final. Then manipulate the CASResults object to access the SASDataFrame and select the Permission, Name and Time columns. For more information working with results from CAS actions check out Part 2 - Working with CAS Actions and CASResults Objects.

(conn.fileInfo(caslib = 'casuser', path='warranty_final%')['FileInfo']
 .loc[:,['Permission','Name', 'Time']])

Summary

The SWAT package blends the world of pandas and CAS to process your distributed data in SAS Viya. In this example you learned how to save in-memory CAS tables back to disk using the save CAS action. The save action provides additional parameters that you can use to modify how you save your tables. Using the save CAS Action is similar to using a to_ method in pandas to save a DataFrame as a file.

Additional and related resources

 

Getting Started with Python Integration to SAS® Viya® - Part 17 - Saving CAS tables was published on SAS Users.

4月 032023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to execute SQL with the Python SWAT package in the distributed CAS server.

Prepare and load data to the CAS server

I created a Python function named createDemoData to prepare and load data to the CAS server. The function takes your CAS connection object as the parameter. The function will :

  • load the RAND_RETAILDEMO.sashdat file from the Samples caslib into memory in the Casuser caslib. This is the main table.
  • create a DataFrame on the Python client and load it to the CAS server as the table LOYALTY_LOOKUP_TBL in the Casuser caslib. This is the lookup table.
  • run the tableInfo CAS action to confirm the CAS tables RAND_RETAILDEMO  and LOYALTY_LOOKUP_TBL  are in-memory in the Casuser caslib.
## Import packages
import swat
import os
import pandas as pd
import numpy as np
 
conn = ## Specify your CAS connection information
 
 
def createDemoData(connectionObject):
    '''Specify your CAS connectin object as the parameter'''
 
    connectionObject.loadTable(path = 'RAND_RETAILDEMO.sashdat', caslib = 'samples',
                               casout = {
                                   'caslib':'casuser',
                                   'replace':True
                               })
 
    myData = {
        'loyalty_id':[0,1],
        'loyalty_value':['No','Yes'],
    }
 
    df = pd.DataFrame(myData)
    connectionObject.upload_frame(df, 
                                  casout = {
                                        'name':'loyalty_lookup_tbl',
                                        'caslib':'casuser',
                                        'replace':True
                                   })
 
    tableInfo = connectionObject.tableInfo(caslib = 'casuser')
    return tableInfo
 
 
## Execute function using CAS connection object
createDemoData(conn) 
 
 
# and the results
NOTE: Cloud Analytic Services made the file RAND_RETAILDEMO.sashdat available as table RAND_RETAILDEMO in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services made the uploaded file available as table LOYALTY_LOOKUP_TBL in caslib CASUSER(Peter).
NOTE: The table LOYALTY_LOOKUP_TBL has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

Load the FedSQL CAS action set

To use SQL with the Python SWAT package in the CAS server you will need to load the FedSQL CAS action set.

conn.loadActionSet('fedSQL')

To view more information about the FedSQL action set you can run the help method.

conn.help(actionset='fedSQL')

The FedSQL action set contains a single action named execDirect. We can use this action to submit SQL queries to the distributed CAS server.

Perform a simple query

Let's preview our CAS tables using a simple query.  First I'll view 5 rows from the RAND_RETAILDEMO CAS table from the Casuser caslib. I'll store my query as a string in the variable previewRetail. Within the string I'll add the SELECT, FROM and LIMIT clauses. In the FROM clause I'll use a two-level naming convention to specify the table. This specifies the caslib and CAS table name. Lastly, execute the execDirect action and add the query parameter with the value previewRetail (the query as a string).

previewRetail = '''
    select *
        from casuser.RAND_RETAILDEMO
        limit 5;
'''
conn.execDirect(query = previewRetail)

The query returns five rows from the RAND_RETAILDEMO CAS table from the Casuser caslib. Notice the loyalty_card column. It contains a 0 and 1. We will use that column as our join criteria to obtain the loyalty values later in the post.

I'll do the same thing to preview the LOYALTY_LOOKUP_TBL CAS table from the Casuser caslib.

previewLoyalty = '''
    select *
        from casuser.LOYALTY_LOOKUP_TBL
        limit 5;
'''
conn.execDirect(query = previewLoyalty)

This query returns two rows, one for each loyalty_id. We will use this column to join with the loyalty_card column from the previous table to obtain the loyalty_value.

Perform a join

Let's perform a inner join using the following:

  • the RAND_RETAILDEMO CAS table from the Casuser caslib using the loyalty_card column
  • the LOYALTY_LOOKUP_TBL CAS table from the Casuser caslib using the loyalty_id column
  • keep the CustID, age, Department, loyalty_card and loyalty_value columns
  • limit the join to five rows to confirm it works as expected
joinData = '''
    select retail.CustID,
           retail.age,
           retail.Department,
           retail.loyalty_card,
           loyalty.loyalty_value
        from casuser.RAND_RETAILDEMO as retail inner join
             casuser.LOYALTY_LOOKUP_TBL as loyalty
        on retail.loyalty_card = loyalty.loyalty_id
        limit 5;
'''
conn.execDirect(query = joinData)

The results show that the join was successful and returns a CASResults object to the Python client.

Create a new CAS table using the join

Now that we know the join works, let's join the full table and create a new CAS table with the results. I'll create a new CAS table named LOYALTY_FINAL and place it in the Casuser caslib. I'll add the replace=True option to replace the CAS table if it already exists.

createTable = '''
    create table loyalty_final{options replace=True} as
    select retail.CustID,
           retail.age,
           retail.Department,
           retail.loyalty_card,
           loyalty.loyalty_value
        from casuser.RAND_RETAILDEMO as retail inner join
             casuser.LOYALTY_LOOKUP_TBL as loyalty
        on retail.loyalty_card = loyalty.loyalty_id;
'''
 
conn.execDirect(query = createTable)
 
# and the results
NOTE: Table LOYALTY_FINAL was created in caslib CASUSER(Peter) with 930046 rows returned.

The results return a note that the new table was created successfully. Once thing to know when joining and creating tables with SQL in CAS is the CAS tables are not required to be in the same caslibs. Just make sure to use the two-level naming conventions to specify the exact table you want.

Next, I'll execute the tableInfo action to view available CAS tables.

conn.tableInfo(caslib = 'casuser')


The results show that there are three CAS tables available. The original two and the newly joined CAS table LOYALTY_FINAL. I'll preview the LOYALTY_FINAL CAS table.

loyaltyInfo = conn.CASTable('LOYALTY_FINAL',caslib = 'casuser')
loyaltyInfo.head()

Summary

The SWAT package blends the world of pandas and CAS to process your distributed data. In this example I focused on using Python to execute FedSQL in the CAS server. Being able to execute SQL with Python in the CAS server gives you another set of tools when working with your distributed data.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 16 - Execute SQL was published on SAS Users.

3月 222023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to count missing values in a CAS table using the Python SWAT package.

Load and prepare data

First, I connect my Python client to the distributed CAS server and named my connection conn. Next, I create a small pandas DataFrame with missing values for the demonstration. Further, I load the DafaFrame to the CAS server using the upload_frame method and and name the CAS table MISSING_DATA and place it in the Casuser caslib. The upload_frame method returns a reference to the CAS table in the variable castbl.

For more information about uploading client-side data to the CAS server, check out my previous post.

conn = ## your CAS connection information
 
## Create a simple dataframe
df = pd.DataFrame([
            [np.nan, 2, 45, 0, 'A'],
            [3, 4, np.nan, 1,'A'],
            [np.nan, np.nan, 50, np.nan,'B'],
            [np.nan, 3, np.nan, 4,],
            [2, 2, np.nan, 0, 'A'],
            [3, 4, np.nan, 1,'A'],
            [np.nan, np.nan, 75, np.nan,'B'],
            [np.nan, 3, 60, 4,]
            ],
            columns=['col1','col2','col3','col4','col5'])
 
## Upload the dataframe to the CAS server as a CAS table
castbl = conn.upload_frame(df,
                           casout = {'name':'missing_data', 
                                     'caslib':'casuser', 
                                     'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table MISSING_DATA in caslib CASUSER(Peter).
NOTE: The table MISSING_DATA has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the MISSING_DATA CAS table has been loaded to the CAS server.

Lastly, I'll preview the distributed CAS table using the SWAT package head method.

castbl.head(10)

Count missing values in a CAS table

There are a variety of ways to count the number missing values in a CAS table. Counting missing values in CAS tables is not exactly the same in the SWAT package as it is in pandas. However, it's just as easy. Let's look at a few different methods.

Using the SWAT package nmiss method

For users experienced with pandas, you traditionally run the isna and sum methods to find the number of missing values in each column of a DataFrame. For CAS tables it's even easier. You can use the SWAT package nmiss method. The nmiss method returns the total number of missing values in each column.

Here I'll specify the CAS table object castbl, then the SWAT nmiss method.

castbl.nmiss()

The distributed CAS server counts the number of missing values in each column and returns a Series to the Python client. Once you have the Series object you can use traditional pandas for additional processing. For example, I'll chain the pandas plot method after the SWAT nmiss method to plot the series. The summarization occurs in the distributed CAS server and the plotting occurs on the client.

(castbl                    ## CAS table
 .nmiss()                  ## SWAT method
 .plot(kind = 'bar'))      ## pandas plot

You can also use the nmiss method to find the number of missing values in specific columns. Simply specify the CASTable object, the columns, then the method.

colNames = ['col1','col5']
 
castbl[colNames].nmiss()

Using the Distinct CAS action

The distinct CAS action is one of my favorite actions. It not only finds the number of missing values, but it gives you the number of distinct values in each column. Now, depending on the size of your data and the number of distinct values, this one can be a bit more resource intensive since finding the number of distinct values in a column can be more time consuming.

To use the distinct CAS action specify the CASTable object, castbl, then the distinct CAS action.

castbl.distinct()

The distinct action returns a CASResults object (dictionary) back to the Python client with the number of distinct and missing values.

In the distinct action you can use the inputs parameter to specify the columns you want to summarize.

castbl.distinct(inputs = colNames)

The results show the number of distinct and missing values in the specified columns. Now with the summarized data form the CAS server in a CASResults object you can use your Python client to continue working with the data. For more information on working with CASResults objects, check out my previous post.

Using the summary CAS action

You can also use the summary CAS action to find the number of missing values for numeric columns. The summary action will also generate a variety of descriptive statistics such as the mean, variance, size, sum of squares and more.

castbl.summary()

The results show the descriptive statistics. The NMiss column contains the number of missing values for the numeric columns.

Within the summary action you can also specify the columns to analyze with the inputs parameter, and the summary statistics to generate with the subSet parameter.

castbl.summary(inputs = ['col1','col2'], 
               subSet = ['min','max','nmiss'])

Summary

The SWAT package blends the world of pandas and CAS to process your distributed data. In this example I focused on using specific SWAT methods and CAS actions to count the number of missing values in CAS table columns.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 15 - Count Missing Values in a CAS Table was published on SAS Users.

2月 142023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to bring a distributed CAS table back to your Python client as a DataFrame.

In this example, I'm using Python on my laptop (Python client) to connect to the CAS server. Bringing a distributed CAS table back to the Python client as a local DataFrame is typically appropriate for smaller data. If the data sets become larger, it is more efficient to process the data in the CAS server, or bring a smaller subset of data back to the client. The big question is, why would I want to bring a CAS table out of the CAS server's massively parallel processing (MPP) environment back to my client as a DataFrame?

 

The CAS server in SAS Viya is setup to connect to a variety of data sources throughout your organization like databases, folder paths, cloud data and more. Typically these connections are setup by your administrator. Having these connections setup makes it easy to access data. In this example, maybe the data you need is available to the CAS server and is small enough where you don't need to use the power of the CAS server's MPP environment. Maybe you simply want to pull a CAS table back to your client as a DataFrame and use a familiar Python package like numpy, pandas, seaborn, matplotlib, or scikit-learn for data exploration, preprocessing, visualization or machine learning.

Use caution when bringing data back to your Python client

With the to_frame SWAT method you can easily transfer a CAS table to your Python client as a DataFrame. When using the to_frame method YOU MUST USE CAUTION. It will attempt to pull all of the data down from the CAS server regardless of size. If the data is large, this can be time consuming or overwhelm your Python client depending on your available memory.

Load the demonstration data into memory

I've created a connection to my CAS server in the variable conn. I'll use my conn connection object and the SWAT read_csv method to load the cars.csv file from the Example Data Sets for the SAS® Viya® Platform website into memory on the CAS server. I'll add some formats and labels to the CAS table. This is a small table for demonstration purposes.

fileurl = 'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/cars.csv'
castbl = conn.upload_file(fileurl, 
                          casout = {'name':'cars_cas_table', 
                                    'caslib':'casuser', 
                                    'replace':True},
                          importoptions = {
                              'fileType':'csv',
                              'guessRows':100,
                              'vars': {
                                  'Invoice':{'format':'dollar16.2','label':'Invoice Price'},
                                  'MSRP':{'format':'dollar16.2', 'label':'Manufacturer Suggested Retail Price'},
                                  'Weight':{'format':'comma16.'},
                                  'EngineSize':{'label':'Engine Size (L)'},
                                  'Length':{'label':'Length (IN)'},
                                  'MPG_City':{'label':'MPG (City)}'},
                                  'MPG_Highway':{'label':'MPG (Highway)'}
                              }
                          })
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table CARS_CAS_TABLE in caslib CASUSER(Peter).
NOTE: The table CARS_CAS_TABLE has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

Next, I'll execute the tableInfo action to confirm the table was loaded into memory.

conn.tableInfo(caslib = 'casuser')


Lastly I'll execute the columnInfo action to view the column information of the CAS table.

castbl.columnInfo()

 

The results above show the CARS_CAS_TABLE has 15 columns. Some of the columns contain SAS labels and SAS formats.

Pull the entire CAS table to the client as a DataFrame

We saw earlier that the CARS_CAS_TABLE is a small in-memory table on the CAS server. Let's pretend the data was connected to some cloud data storage and was loaded into memory. I noticed that the data is small, and I want to use familiar Python packages to process it.

The first thing I can do is use the to_frame SWAT method on my castbl object to convert the CAS table to a DataFrame. I'll store the result in df and then view it's type.

df = castbl.to_frame()
display(type(df))
 
# and the results
swat.dataframe.SASDataFrame

 

The results show I now have a SASDataFrame. A SASDataFrame lives on the Python client and is a subclass of pandas.DataFrame. Therefore, anything you can do with a pandas.DataFrame will also work with a SASDataFrame. The only difference is that a SASDataFrame object contain extra metadata from the CAS table. For more information on CASTable vs DataFrame vs SASDataFrame check out the SWAT documentation.

You can view the extra metadata of a SASDataFrame with the colinfo attribute.

df.colinfo

The results show the SASDataFrame contains a variety of information about the columns like the data type, name and width. If the column contains a SAS label or format, that is also stored.

Once you have the SASDataFrame on your Python client, you can use pandas. For example, here I'll execute the pandas plot method to create a scatter plot of EngineSize by MPG_City.

df.plot.scatter(x = 'EngineSize', y = 'MPG_City', title = "Engine Size by MPG City");y');

The results show a simple scatter plot using pandas.

Apply column labels to a SASDataFrame

Sometimes you might want to use column labels, as they are more descriptive. Traditional pandas DataFrames don't have the concept of labels, but CAS tables do.  CAS enables you to add more descriptive labels to columns. Since a SASDataFrame stores information from the CAS table, you can use the SWAT apply_labels method to apply labels to the SASDataFrame. For more information on adding labels to a CAS table, check on my previous post Renaming columns.

df.apply_labels()

Notice the apply_labels method applies the SAS column labels (Manufacturer Suggested Retail Price, Invoice Price, Engine Size (L), Length (IN), MPG (City) and MPG (Highway)) to the SASDataFrame.

Pull a sample of the CAS table to the client as a DataFrame

What if the CAS table is big, but you want to work with a sample of the data using a specific Python package? Well, you can use the sample_pct parameter in the to_frame SWAT method to pull a smaller subset of data to your Python client.

Here I'll specify I want ten percent of the data from the CAS table. I'll also add the sample_seed parameter to pull the same random subset of data. Then I'll execute the shape method to count the number of rows and columns in the new SASDataFrame.

df = castbl.to_frame(sample_pct = .1, sample_seed = 99)
df.shape
 
# and the results
(43, 15)

The results show that only 43 rows of data were returned from the CAS table as a SASDataFrame on the Python client.

Apply SAS formats to the SASDataFrame

Lastly, what if you want to apply the SAS formats of a CAS table to your SASDataFrame? You can do that using the format parameter in the SWAT to_frame method, as seen in the code below. I'll also display five rows of the SASDataFrame and the data types of the columns.

df_formats = castbl.to_frame(format=True)
 
display(df_formats.head(), df_formats.dtypes)

The results show the DOLLAR format was applied to the MSRP and Invoice columns, and the COMMA format was applied to the Weight column.

The dtype method shows that when using the format parameter in the to_frame method all columns are returned to your Python client as objects, even if they do not contain a format.

Summary

The SWAT package blends the world of Pandas and CAS. It enables you to use the CAS server's massively parallel processing engine for data exploration, preprocessing and analytics. It also enables you to easily transfer your distributed CAS tables back to your Python client as SASDataFrames for additional processing using other Python packages.

The one thing you must remember when using the to_frame method is that transferring large data can be time consuming or take up all of your Python client's resources. USE IT WITH CAUTION.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 14 - CAS Table to DataFrame was published on SAS Users.

2月 072023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how your data is organized on the CAS server. In this post I'll discuss loading client-side CSV files into CAS.

Loading data from a client-side file is appropriate for smaller data. If the data sets become larger, it is more efficient to use a server-side load with the loadTable action to access and load the data. For more information check out my previous post loading server-side files into memory,

In this example, I'm using Python on my laptop to connect to the CAS server. Client-side files consist of files on my laptop (client). In these examples I'll load the heart.csv file from the SAS® Viya® Example Data Sets webpage to the CAS server. There are multiple ways to load client-side CSV files into CAS. I'll show you how to do it using the familiar Pandas API in the SWAT package with the read_csv method. I'll also show you using the upload_file and upload_frame SWAT methods.

Using the Pandas API in the SWAT package - read_csv

First, I'll begin with the familiar read_csv method from the SWAT package and store the results in castbl. The SWAT read_csv method calls the Pandas read_csv method, creates a client-side DataFrame, then uploads the DataFrame to the CAS server as an in-memory table. The main difference is the casout parameter. The casout parameter is specific to the SWAT package. It enables you to specify the output CAS table information like the name of the new distributed table and the in-memory location. Here, I'll name the CAS table heart_read_csv and place it in the Casuser caslib. I'll also specify the replace parameter to replace the CAS table if it already exists.

castbl = conn.read_csv(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv',
                      casout = {'name':'heart_read_csv', 
                                'caslib':'casuser', 
                                'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_READ_CSV in caslib CASUSER(Peter).
NOTE: The table HEART_READ_CSV has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

Next, I'll execute the tableInfo CAS action to view available CAS tables in the Casuser caslib.

conn.tableInfo(caslib = 'casuser')

The results above show that the table was loaded to the CAS server.

Let's view the type and value of the castbl object.

display(type(castbl), castbl)
 
# and the results
swat.cas.table.CASTable
CASTable('HEART_READ_CSV', caslib='CASUSER(Peter)')

The results show that castbl is a CASTable object and simply references the CAS table on the CAS server.

Lastly, I'll preview  the CAS table with the SWAT head method.

castbl.head()

The SWAT head method returns 5 rows from the CAS table to the client.

The read_csv method enables you to use all of the available Pandas parsers through the SWAT package to upload data to the distributed CAS server. The read_csv method parses the data on the client into a DataFrame, and then uploads the DataFrame to the CAS server.

Using the upload_file method in the SWAT package

Instead of using the Pandas API through the SWAT package you can also use the upload_file SWAT method. The upload_file method transfers the file to CAS server and then all parsing is done on the server. The upload_file method is not as robust as the read_csv method, but can be a bit faster than client-side parsing. Upload_file can be used to upload other file types to the CAS server, not just CSV files.

Here, I'll load the same CSV file to the CAS server as in the previous example. This time I'll use upload_file method and name the CAS table heart_upload_file. I'll store the results in castbl2.

castbl2 = conn.upload_file(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv', 
                           casout = {'name':'heart_upload_file', 
                                     'caslib':'casuser',
                                     'replace':True})     
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_UPLOAD_FILE in caslib CASUSER(Peter).
NOTE: The table HEART_UPLOAD_FILE has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the CSV file was uploaded to CAS successfully.

I'll view the available CAS table using the tableInfo action.

conn.tableInfo(caslib = 'casuser')


The results show that now I have two CAS tables in memory.

Using the SWAT upload_frame method

Another useful SWAT method is the upload_frame method. I like using this method if I'm preparing a DataFrame on my client using Pandas, then need to transfer the Pandas DataFrame to the CAS server for additional processing or to use another SAS Viya application like SAS Visual Analytics.

For example, here is some traditional Pandas code to read the same CSV file as before and then prepare it using traditional Pandas. The code is renaming columns, creating calculated columns and dropping a column.

conn.tableInfo(caslib = 'casuser')
## Read the data into a DataFrame
df_raw = pd.read_csv(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv')
 
## Prepare the DataFrame
df = (df_raw
      .rename(columns = lambda colName: colName.upper())
      .assign(
         STATUS = lambda _df: _df.STATUS.str.upper(),
         DEATHCAUSE = lambda _df: _df.DEATHCAUSE.fillna('Still Alive').str.lower()
        )
     .drop('AGEATSTART', axis=1)
)
df.head()

Now that I have the final DataFrame, I can simply upload it to the CAS server using the upload_frame method. Again, the casout parameter specifies output CAS table information.

castbl3 = conn.upload_frame(df, casout = {'name':'heart_upload_frame', 
                                          'caslib':'casuser', 
                                          'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_UPLOAD_FRAME in caslib CASUSER(Peter).
NOTE: The table HEART_UPLOAD_FRAME has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the Pandas DataFrame was successfully uploaded to the CAS server.

Lastly I'll view available CAS tables.

Summary

In this post I discussed using the read_csv, upload_file and upload_frame methods for loading client-side CSV files into CAS. Depending on the size of your data and your parsing needs, you may consider one method over the other.

Loading data from the client side into memory onto the CAS server will be slower than loading server-side files into memory. Remember, server-side files are data sources that the CAS server has direct access to like a network path or database. Client-side files are available on your Python client.  Client-side data loading is intended for smaller data sets. For more information check out the Client-Side Data Files and Sources section in the SWAT documentation page.

Additional and related resources

Getting started with Python integration to SAS® Viya® - Part 13 - Loading a Client-Side CSV File into CAS was published on SAS Users.

1月 172023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS server, working with CAS actions and CASResults objects, and how to summarize columns. Now it's time to focus on how to get the count of unique values in a CAS table column.

Load and prepare data

First, I connected my Python client to the distributed CAS server and named my connection conn. Then I created a function to load and prepare my CAS table. The custom function loads the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory, renames the columns using the column labels and drops unnecessary columns. This simplifies the table for the demonstration.

The Samples caslib should be available in your SAS Viya environment and contains sample tables. For more information on how to rename columns in a CAS table view Part 11 - Rename Columns.

## Packages
import swat
import pandas as pd
 
## Options
pd.set_option('display.max_columns', 50)
 
## Connect to CAS
conn = ## your connection information
 
def prep_data():
    ## Load the data into CAS
    conn.loadTable(path='WARRANTY_CLAIMS_0117.sashdat', caslib='samples',
                   casout={'name':'warranty_claims', 'caslib':'casuser'})
 
    ## Reference the CAS table in an object
    castbl = conn.CASTable('warranty_claims', caslib = 'casuser')
 
    ## Store the column names and labels in a dataframe
    df_col_names = castbl.columnInfo()['ColumnInfo'].loc[:,['Column','Label']]
 
    ## Create a list of dictionaries of how to rename each column using the column labels
    renameColumns = []
    for row in df_col_names.iterrows():
        colName = row[1].values[0]
        labelName = row[1].values[1].replace(' ','_')
        renameColumns.append(dict(name=colName, rename=labelName))
 
    ## List of columns to keep in the CAS table
    keepColumns = {'Campaign_Type', 'Platform','Trim_Level','Make','Model_Year','Engine_Model',
                   'Vehicle_Assembly_Plant','Claim_Repair_Start_Date', 'Claim_Repair_End_Date'}
 
    ## Rename and drop columns to make the table easier to use
    castbl.alterTable(columns = renameColumns, keep = keepColumns)
 
    return castbl

Next, I'll execute the user defined function and store the CAS table object in the variable tbl  and view it's type.

tbl = prep_data()
type(tbl)
 
# and the results
NOTE: Cloud Analytic Services made the file WARRANTY_CLAIMS_0117.sashdat available as table WARRANTY_CLAIMS in caslib CASUSER(Peter).
swat.cas.table.CASTable

The results show that the WARRANTY_CLAIMS_0117.sashdat is available in the CAS server, and tbl is a CASTable object.

Lastly, I'll preview the distributed CAS table using the SWAT package head method.

tbl.head()

The results show a preview of the WARRANTY_CLAIMS CAS table. The table provides data on warranty claims for car repairs. The data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data.

Using the Pandas API in the SWAT package - value_counts method

I'll begin by using the Pandas API in the SWAT package which provides the value_counts method. The value_counts method works like it's Pandas counterpart. For example, I'll obtain the count of unique values in the Engine_Model CAS table column. I'll store the results of in vc, then display the type and value of vc.

vc = (tbl               ## CAS table
      .Engine_Model     ## CAS table column
      .value_counts()   ## SWAT value_counts method
     )
 
## Display the type and value
display(type(vc),vc)

The SWAT value_counts method summarizes the data in the distributed CAS server and returns a Pandas Series to the Python client. Once you have the Pandas Series on the client, you can work with it as you normally would. For example, I'll plot the Series using the Pandas plot method.

vc.plot(kind = 'bar', figsize=(8,6));

In this example, I used the Pandas API in the SWAT package to summarize data on the CAS server's massively parallel processing environment to return smaller, summarized results to the Python client. Once the summarized results are on the client, I'll work with them using other Python packages like Pandas.

Using the freq CAS action

Instead of using the Pandas API in the SWAT package you can achieve similar results using native CAS actions. In SWAT, CAS actions are simply specified as a method. One action that provides the count of unique values is the simple.freq CAS action.

For example, I can find the count of unique values for multiple columns within the freq action. Here, I'll specify the Engine_Model, Model_Year and Campaign_Type columns in the the inputs parameter. Then, I'll call the Frequency key after the action to obtain the SASDataFrame stored in the dictionary returned to the Python client. Remember, CAS actions always return a dictionary, or CASResults object, to the Python client. You must use familiar dictionary manipulation techniques to work with the results of an action. For more information on working with results of CAS actions, check out Part 2 - Working with CAS Actions and CASResults Objects.

## Columns to analyze
colNames = ['Engine_Model', 'Model_Year', 'Campaign_Type']
 
## Execute the freq CAS action and store the SASDataFrame
freq_df = tbl.freq(inputs = colNames)['Frequency']
 
## Display the type and DataFrame
display(type(freq_df), freq_df)

Again, the action processes the data in the distributed CAS server and returns results to the Python client. The results show the freq action counts the unique values of each column and stores the results in a single SASDataFrame. Once you have the SASDataFrame on the client, you can work with it like you would a Pandas DataFrame.

For example, I'll loop over each analysis column, query the SASDataFrame for the specific column name, and then plot the count of unique values of each column using the familiar Pandas package.

for column in colNames:
    (freq_df
     .query('Column == @column')
     .loc[:,['CharVar','Frequency']]
     .set_index('CharVar')
     .sort_values(by = 'Frequency', ascending=False)
     .plot(kind='bar', figsize=(8,6), title = f'The {column} Column')
    )

The loop produces a visualization of the count of unique values for each analysis column. This was all done using familiar Pandas code on the client side. Remember, the distributed CAS server did all of the processing and summarization, then returned smaller summarized results to the Python client.

Using the freqTab CAS action

Lastly, you can use the freqTab.freqTab CAS action to construct frequency and crosstabulation tables. The freqTab action provides a variety of additional features and information. The action is not loaded by default, so I'll begin by loading the action set.

conn.loadActionSet('freqTab')

Then I'll use the freqTab action in the freqTab action set to count the unique values for the Model_Year and Engine_Model columns, and also count the unique values of Engine_Model by Model_Year.

tbl.freqTab(tabulate = [
                'Model_Year',
                'Engine_Model',
                {'vars':['Engine_Model','Model_Year']}
            ]
    )

The results above show the freqTab action returns a dictionary with a variety of information. The first SASDataFrame is level information, the second SASDataFrame shows the number of observations used, and the remaining SASDataFrames show the two one-way frequency tables for Model_Year, and Engine_Model, and the crosstabulation between Engine_Model by Model_Year (also includes the totals).

With the results on the Python client, you can begin accessing and manipulating the SASDataFrames as needed.

Summary

The SWAT package blends the world of Pandas and CAS. You can use many of the familiar Pandas methods within the SWAT package like value_counts, or the flexible, highly optimized CAS actions like simple.freq and freqTab.freqTab to obtain counts of unique values in the massively parallel processing CAS engine. For more examples on the freq or freqTab CAS actions, check out my CAS action four part series (part 1, part 2, part 3 and part 4). The four part series executes CAS actions using the native CAS language. However, with some small changes to the syntax you can execute the same actions using Python.

Additional and related resources

Getting started with Python integration to SAS® Viya® - Part 12 - Count of Unique Values was published on SAS Users.

12月 222022
 

The addition of the PYTHON procedure and Python editor in SAS Viya enables users to execute Python code in SAS Studio. This new capability in SAS Viya adds another tool to SAS's existing collection. With this addition I thought, how can I utilize this new found power?

In this example, I'll keep it simple. I want to create a Microsoft Excel report using a combination of SAS, Python and SQL. I'll use data that's stored in a SAS library; however, the library could be using data stored anywhere, like a path, database or in the cloud. I'll write a program that executes the following:

All code used in this post is located on GitHub, here.

Set folder path and file name

To begin, I'll create a macro variable to specify the output folder path and Microsoft Excel workbook name.

%let path=/*Enter your output folder path*/;
%let xlFileName = myExcelReport.xlsx;

Prepare data

Further, I'll prepare the data using the SAS DATA step. I'll use the available sashelp.cars table, create a new column named MPG_Avg, and drop unnecessary columns. Instead of using the DATA step you can use Python or SQL to prepare the data. Whatever tool works best for you.

data work.cars;
    set sashelp.cars;
    MPG_Avg=mean(MPG_City, MPG_Highway);
    drop Wheelbase Weight Length;
run;

Create the Microsoft Excel workbook

After the data is ready, I'll use the ODS EXCEL statement to create the Excel spreadsheet. The following ODS options are used:

  • FILE - specifies the file path and name.
  • STYLE - modifies the appearance of the SAS output
  • EMBEDDED_TITLES - specifies titles should appear in the worksheet
  • SHEET_INTERVAL - enables manual control when to create a new worksheet
ods excel file="&amp;path./&amp;xlFileName" 
		  style=ExcelMidnight   
		  options(embedded_titles="on");

Worksheet 1

Print the data using SAS

With the ODS EXCEL destination open I'll name the first worksheet Data, and manually specify when a new sheet is created. Next, I'll use the PRINT procedure to print the detailed data to Excel. The PRINT procedure will print the entire SAS data set with the associated formats and styles to Excel.

* Sheet 1 - Print the data using SAS *;
ods excel options(sheet_name='Data' sheet_interval='none');
title height=16pt color=white "Detailed Car Data";
proc print data=work.cars noobs;
run;

Worksheet 2

Create violin plots using Python

Next, I want to create violin plots on a new worksheet named Origin_MPG. Now, these can be created in SAS, but I personally found the matplotlib package in Python a bit easier to use . With the PYTHON procedure, I can include the Python code within the SAS program (or you can reference a .py file) to create the visualization. Then I'll use the SAS.pyplot method to save and render the visualization. Since the pyplot callback renders the image in the results tab, it exports the image to the Excel workbook by default.

First I'll use ODS EXCEL to create the new worksheet and the TITLE statement to add a title to the Excel worksheet.

ods excel options(sheet_name='Origin_MPG' sheet_interval='now');
title justify=left height=16pt color=white "Analyzing MPG by Each Car Origin";

Then I'll execute the PYTHON procedure to execute my Python code.

* Create violin plots using Python *;
proc python;
submit;
 
##
## Import packages and options
##
 
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
outpath = SAS.symget('path')
 
##
## Data prep for the visualization
##
 
## Load the SAS table as a DataFrame
df = (SAS
      .sd2df('work.cars')                 ## SAS callback method to load the SAS data set as a DataFrame
      .loc[:,['Origin','MPG_Avg']]        ## Keep the necessary columns
)
 
 
## Create a series of MPG_Avg for each distinct origin for the violin plots
listOfUniqueOrigins = df.Origin.unique().tolist()
 
mpg_by_origin = {}
for origin in listOfUniqueOrigins:
    mpg_by_origin[origin] = df.query(f'Origin == @origin ').MPG_Avg
 
 
##
## Create the violin plots
##
 
## Violin plot
fig, ax = plt.subplots(figsize = (8,6))
ax.violinplot(mpg_by_origin.values(), showmedians=True)
 
## Plot appearance
ax.set_title('Miles per Gallon (MPG) by Origin')
rename_x_axis = {'position': [1,2,3], 'labels':listOfUniqueOrigins}
ax.set_xticks(rename_x_axis['position'])
ax.set_xticklabels(rename_x_axis['labels']);
 
## Save and render the image file
SAS.pyplot(plt, filename='violinPlot',filepath=outpath)
 
endsubmit;
quit;
title;

SQL Aggregation

SQL is an extremely common and useful language for data analysts and scientists. I find using SQL for aggregation easy, so I will create a simple aggregation and add it below the visualization on the same worksheet in the the Excel report.

* SQL Aggregation *;
title justify=left "Average MPG by Car Makes";
proc sql;
select Origin, round(mean(MPG_Avg)) as AverageMPG
	from work.cars
	group by Origin
	order by AverageMPG desc;
quit;
title;

Add text

At the end of the same worksheet I'll add some simple text using the ODSTEXT procedure to give some information about the data.

proc odstext;
   heading 'NOTES';
   p 'Using the SASHELP.CARS data. The following car Origins were analyzed:';
   list ;
      item 'Asia';
      item 'Europe';
      item 'USA';
   end;    
   p 'Created by Peter S';
quit;

Close the Excel workbook

Lastly, I'll close the ODS EXCEL destination since I am done writing out to Excel.

ods excel close;

Results

That's it! Now I'll execute the entire program and view the Excel workbook.

Summary

With the capabilities of SAS and the new ability to execute Python code in SAS Studio, teams have a variety of tools in SAS Viya for their analytic needs.

Additional resources

PYTHON Procedure documentation
SAS opens its code editor interface to Python users
Using PROC PYTHON to augment your SAS programs
ODS Excel Statement

Creating a Microsoft Excel report using SAS, Python and SQL! was published on SAS Users.