SAS programmers

3月 222023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to count missing values in a CAS table using the Python SWAT package.

Load and prepare data

First, I connect my Python client to the distributed CAS server and named my connection conn. Next, I create a small pandas DataFrame with missing values for the demonstration. Further, I load the DafaFrame to the CAS server using the upload_frame method and and name the CAS table MISSING_DATA and place it in the Casuser caslib. The upload_frame method returns a reference to the CAS table in the variable castbl.

For more information about uploading client-side data to the CAS server, check out my previous post.

conn = ## your CAS connection information
 
## Create a simple dataframe
df = pd.DataFrame([
            [np.nan, 2, 45, 0, 'A'],
            [3, 4, np.nan, 1,'A'],
            [np.nan, np.nan, 50, np.nan,'B'],
            [np.nan, 3, np.nan, 4,],
            [2, 2, np.nan, 0, 'A'],
            [3, 4, np.nan, 1,'A'],
            [np.nan, np.nan, 75, np.nan,'B'],
            [np.nan, 3, 60, 4,]
            ],
            columns=['col1','col2','col3','col4','col5'])
 
## Upload the dataframe to the CAS server as a CAS table
castbl = conn.upload_frame(df,
                           casout = {'name':'missing_data', 
                                     'caslib':'casuser', 
                                     'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table MISSING_DATA in caslib CASUSER(Peter).
NOTE: The table MISSING_DATA has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the MISSING_DATA CAS table has been loaded to the CAS server.

Lastly, I'll preview the distributed CAS table using the SWAT package head method.

castbl.head(10)

Count missing values in a CAS table

There are a variety of ways to count the number missing values in a CAS table. Counting missing values in CAS tables is not exactly the same in the SWAT package as it is in pandas. However, it's just as easy. Let's look at a few different methods.

Using the SWAT package nmiss method

For users experienced with pandas, you traditionally run the isna and sum methods to find the number of missing values in each column of a DataFrame. For CAS tables it's even easier. You can use the SWAT package nmiss method. The nmiss method returns the total number of missing values in each column.

Here I'll specify the CAS table object castbl, then the SWAT nmiss method.

castbl.nmiss()

The distributed CAS server counts the number of missing values in each column and returns a Series to the Python client. Once you have the Series object you can use traditional pandas for additional processing. For example, I'll chain the pandas plot method after the SWAT nmiss method to plot the series. The summarization occurs in the distributed CAS server and the plotting occurs on the client.

(castbl                    ## CAS table
 .nmiss()                  ## SWAT method
 .plot(kind = 'bar'))      ## pandas plot

You can also use the nmiss method to find the number of missing values in specific columns. Simply specify the CASTable object, the columns, then the method.

colNames = ['col1','col5']
 
castbl[colNames].nmiss()

Using the Distinct CAS action

The distinct CAS action is one of my favorite actions. It not only finds the number of missing values, but it gives you the number of distinct values in each column. Now, depending on the size of your data and the number of distinct values, this one can be a bit more resource intensive since finding the number of distinct values in a column can be more time consuming.

To use the distinct CAS action specify the CASTable object, castbl, then the distinct CAS action.

castbl.distinct()

The distinct action returns a CASResults object (dictionary) back to the Python client with the number of distinct and missing values.

In the distinct action you can use the inputs parameter to specify the columns you want to summarize.

castbl.distinct(inputs = colNames)

The results show the number of distinct and missing values in the specified columns. Now with the summarized data form the CAS server in a CASResults object you can use your Python client to continue working with the data. For more information on working with CASResults objects, check out my previous post.

Using the summary CAS action

You can also use the summary CAS action to find the number of missing values for numeric columns. The summary action will also generate a variety of descriptive statistics such as the mean, variance, size, sum of squares and more.

castbl.summary()

The results show the descriptive statistics. The NMiss column contains the number of missing values for the numeric columns.

Within the summary action you can also specify the columns to analyze with the inputs parameter, and the summary statistics to generate with the subSet parameter.

castbl.summary(inputs = ['col1','col2'], 
               subSet = ['min','max','nmiss'])

Summary

The SWAT package blends the world of pandas and CAS to process your distributed data. In this example I focused on using specific SWAT methods and CAS actions to count the number of missing values in CAS table columns.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 15 - Count Missing Values in a CAS Table was published on SAS Users.

2月 142023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to bring a distributed CAS table back to your Python client as a DataFrame.

In this example, I'm using Python on my laptop (Python client) to connect to the CAS server. Bringing a distributed CAS table back to the Python client as a local DataFrame is typically appropriate for smaller data. If the data sets become larger, it is more efficient to process the data in the CAS server, or bring a smaller subset of data back to the client. The big question is, why would I want to bring a CAS table out of the CAS server's massively parallel processing (MPP) environment back to my client as a DataFrame?

 

The CAS server in SAS Viya is setup to connect to a variety of data sources throughout your organization like databases, folder paths, cloud data and more. Typically these connections are setup by your administrator. Having these connections setup makes it easy to access data. In this example, maybe the data you need is available to the CAS server and is small enough where you don't need to use the power of the CAS server's MPP environment. Maybe you simply want to pull a CAS table back to your client as a DataFrame and use a familiar Python package like numpy, pandas, seaborn, matplotlib, or scikit-learn for data exploration, preprocessing, visualization or machine learning.

Use caution when bringing data back to your Python client

With the to_frame SWAT method you can easily transfer a CAS table to your Python client as a DataFrame. When using the to_frame method YOU MUST USE CAUTION. It will attempt to pull all of the data down from the CAS server regardless of size. If the data is large, this can be time consuming or overwhelm your Python client depending on your available memory.

Load the demonstration data into memory

I've created a connection to my CAS server in the variable conn. I'll use my conn connection object and the SWAT read_csv method to load the cars.csv file from the Example Data Sets for the SAS® Viya® Platform website into memory on the CAS server. I'll add some formats and labels to the CAS table. This is a small table for demonstration purposes.

fileurl = 'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/cars.csv'
castbl = conn.upload_file(fileurl, 
                          casout = {'name':'cars_cas_table', 
                                    'caslib':'casuser', 
                                    'replace':True},
                          importoptions = {
                              'fileType':'csv',
                              'guessRows':100,
                              'vars': {
                                  'Invoice':{'format':'dollar16.2','label':'Invoice Price'},
                                  'MSRP':{'format':'dollar16.2', 'label':'Manufacturer Suggested Retail Price'},
                                  'Weight':{'format':'comma16.'},
                                  'EngineSize':{'label':'Engine Size (L)'},
                                  'Length':{'label':'Length (IN)'},
                                  'MPG_City':{'label':'MPG (City)}'},
                                  'MPG_Highway':{'label':'MPG (Highway)'}
                              }
                          })
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table CARS_CAS_TABLE in caslib CASUSER(Peter).
NOTE: The table CARS_CAS_TABLE has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

Next, I'll execute the tableInfo action to confirm the table was loaded into memory.

conn.tableInfo(caslib = 'casuser')


Lastly I'll execute the columnInfo action to view the column information of the CAS table.

castbl.columnInfo()

 

The results above show the CARS_CAS_TABLE has 15 columns. Some of the columns contain SAS labels and SAS formats.

Pull the entire CAS table to the client as a DataFrame

We saw earlier that the CARS_CAS_TABLE is a small in-memory table on the CAS server. Let's pretend the data was connected to some cloud data storage and was loaded into memory. I noticed that the data is small, and I want to use familiar Python packages to process it.

The first thing I can do is use the to_frame SWAT method on my castbl object to convert the CAS table to a DataFrame. I'll store the result in df and then view it's type.

df = castbl.to_frame()
display(type(df))
 
# and the results
swat.dataframe.SASDataFrame

 

The results show I now have a SASDataFrame. A SASDataFrame lives on the Python client and is a subclass of pandas.DataFrame. Therefore, anything you can do with a pandas.DataFrame will also work with a SASDataFrame. The only difference is that a SASDataFrame object contain extra metadata from the CAS table. For more information on CASTable vs DataFrame vs SASDataFrame check out the SWAT documentation.

You can view the extra metadata of a SASDataFrame with the colinfo attribute.

df.colinfo

The results show the SASDataFrame contains a variety of information about the columns like the data type, name and width. If the column contains a SAS label or format, that is also stored.

Once you have the SASDataFrame on your Python client, you can use pandas. For example, here I'll execute the pandas plot method to create a scatter plot of EngineSize by MPG_City.

df.plot.scatter(x = 'EngineSize', y = 'MPG_City', title = "Engine Size by MPG City");y');

The results show a simple scatter plot using pandas.

Apply column labels to a SASDataFrame

Sometimes you might want to use column labels, as they are more descriptive. Traditional pandas DataFrames don't have the concept of labels, but CAS tables do.  CAS enables you to add more descriptive labels to columns. Since a SASDataFrame stores information from the CAS table, you can use the SWAT apply_labels method to apply labels to the SASDataFrame. For more information on adding labels to a CAS table, check on my previous post Renaming columns.

df.apply_labels()

Notice the apply_labels method applies the SAS column labels (Manufacturer Suggested Retail Price, Invoice Price, Engine Size (L), Length (IN), MPG (City) and MPG (Highway)) to the SASDataFrame.

Pull a sample of the CAS table to the client as a DataFrame

What if the CAS table is big, but you want to work with a sample of the data using a specific Python package? Well, you can use the sample_pct parameter in the to_frame SWAT method to pull a smaller subset of data to your Python client.

Here I'll specify I want ten percent of the data from the CAS table. I'll also add the sample_seed parameter to pull the same random subset of data. Then I'll execute the shape method to count the number of rows and columns in the new SASDataFrame.

df = castbl.to_frame(sample_pct = .1, sample_seed = 99)
df.shape
 
# and the results
(43, 15)

The results show that only 43 rows of data were returned from the CAS table as a SASDataFrame on the Python client.

Apply SAS formats to the SASDataFrame

Lastly, what if you want to apply the SAS formats of a CAS table to your SASDataFrame? You can do that using the format parameter in the SWAT to_frame method, as seen in the code below. I'll also display five rows of the SASDataFrame and the data types of the columns.

df_formats = castbl.to_frame(format=True)
 
display(df_formats.head(), df_formats.dtypes)

The results show the DOLLAR format was applied to the MSRP and Invoice columns, and the COMMA format was applied to the Weight column.

The dtype method shows that when using the format parameter in the to_frame method all columns are returned to your Python client as objects, even if they do not contain a format.

Summary

The SWAT package blends the world of Pandas and CAS. It enables you to use the CAS server's massively parallel processing engine for data exploration, preprocessing and analytics. It also enables you to easily transfer your distributed CAS tables back to your Python client as SASDataFrames for additional processing using other Python packages.

The one thing you must remember when using the to_frame method is that transferring large data can be time consuming or take up all of your Python client's resources. USE IT WITH CAUTION.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 14 - CAS Table to DataFrame was published on SAS Users.

2月 072023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how your data is organized on the CAS server. In this post I'll discuss loading client-side CSV files into CAS.

Loading data from a client-side file is appropriate for smaller data. If the data sets become larger, it is more efficient to use a server-side load with the loadTable action to access and load the data. For more information check out my previous post loading server-side files into memory,

In this example, I'm using Python on my laptop to connect to the CAS server. Client-side files consist of files on my laptop (client). In these examples I'll load the heart.csv file from the SAS® Viya® Example Data Sets webpage to the CAS server. There are multiple ways to load client-side CSV files into CAS. I'll show you how to do it using the familiar Pandas API in the SWAT package with the read_csv method. I'll also show you using the upload_file and upload_frame SWAT methods.

Using the Pandas API in the SWAT package - read_csv

First, I'll begin with the familiar read_csv method from the SWAT package and store the results in castbl. The SWAT read_csv method calls the Pandas read_csv method, creates a client-side DataFrame, then uploads the DataFrame to the CAS server as an in-memory table. The main difference is the casout parameter. The casout parameter is specific to the SWAT package. It enables you to specify the output CAS table information like the name of the new distributed table and the in-memory location. Here, I'll name the CAS table heart_read_csv and place it in the Casuser caslib. I'll also specify the replace parameter to replace the CAS table if it already exists.

castbl = conn.read_csv(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv',
                      casout = {'name':'heart_read_csv', 
                                'caslib':'casuser', 
                                'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_READ_CSV in caslib CASUSER(Peter).
NOTE: The table HEART_READ_CSV has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

Next, I'll execute the tableInfo CAS action to view available CAS tables in the Casuser caslib.

conn.tableInfo(caslib = 'casuser')

The results above show that the table was loaded to the CAS server.

Let's view the type and value of the castbl object.

display(type(castbl), castbl)
 
# and the results
swat.cas.table.CASTable
CASTable('HEART_READ_CSV', caslib='CASUSER(Peter)')

The results show that castbl is a CASTable object and simply references the CAS table on the CAS server.

Lastly, I'll preview  the CAS table with the SWAT head method.

castbl.head()

The SWAT head method returns 5 rows from the CAS table to the client.

The read_csv method enables you to use all of the available Pandas parsers through the SWAT package to upload data to the distributed CAS server. The read_csv method parses the data on the client into a DataFrame, and then uploads the DataFrame to the CAS server.

Using the upload_file method in the SWAT package

Instead of using the Pandas API through the SWAT package you can also use the upload_file SWAT method. The upload_file method transfers the file to CAS server and then all parsing is done on the server. The upload_file method is not as robust as the read_csv method, but can be a bit faster than client-side parsing. Upload_file can be used to upload other file types to the CAS server, not just CSV files.

Here, I'll load the same CSV file to the CAS server as in the previous example. This time I'll use upload_file method and name the CAS table heart_upload_file. I'll store the results in castbl2.

castbl2 = conn.upload_file(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv', 
                           casout = {'name':'heart_upload_file', 
                                     'caslib':'casuser',
                                     'replace':True})     
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_UPLOAD_FILE in caslib CASUSER(Peter).
NOTE: The table HEART_UPLOAD_FILE has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the CSV file was uploaded to CAS successfully.

I'll view the available CAS table using the tableInfo action.

conn.tableInfo(caslib = 'casuser')


The results show that now I have two CAS tables in memory.

Using the SWAT upload_frame method

Another useful SWAT method is the upload_frame method. I like using this method if I'm preparing a DataFrame on my client using Pandas, then need to transfer the Pandas DataFrame to the CAS server for additional processing or to use another SAS Viya application like SAS Visual Analytics.

For example, here is some traditional Pandas code to read the same CSV file as before and then prepare it using traditional Pandas. The code is renaming columns, creating calculated columns and dropping a column.

conn.tableInfo(caslib = 'casuser')
## Read the data into a DataFrame
df_raw = pd.read_csv(r'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/heart.csv')
 
## Prepare the DataFrame
df = (df_raw
      .rename(columns = lambda colName: colName.upper())
      .assign(
         STATUS = lambda _df: _df.STATUS.str.upper(),
         DEATHCAUSE = lambda _df: _df.DEATHCAUSE.fillna('Still Alive').str.lower()
        )
     .drop('AGEATSTART', axis=1)
)
df.head()

Now that I have the final DataFrame, I can simply upload it to the CAS server using the upload_frame method. Again, the casout parameter specifies output CAS table information.

castbl3 = conn.upload_frame(df, casout = {'name':'heart_upload_frame', 
                                          'caslib':'casuser', 
                                          'replace':True})
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table HEART_UPLOAD_FRAME in caslib CASUSER(Peter).
NOTE: The table HEART_UPLOAD_FRAME has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

The results show that the Pandas DataFrame was successfully uploaded to the CAS server.

Lastly I'll view available CAS tables.

Summary

In this post I discussed using the read_csv, upload_file and upload_frame methods for loading client-side CSV files into CAS. Depending on the size of your data and your parsing needs, you may consider one method over the other.

Loading data from the client side into memory onto the CAS server will be slower than loading server-side files into memory. Remember, server-side files are data sources that the CAS server has direct access to like a network path or database. Client-side files are available on your Python client.  Client-side data loading is intended for smaller data sets. For more information check out the Client-Side Data Files and Sources section in the SWAT documentation page.

Additional and related resources

Getting started with Python integration to SAS® Viya® - Part 13 - Loading a Client-Side CSV File into CAS was published on SAS Users.

1月 172023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS server, working with CAS actions and CASResults objects, and how to summarize columns. Now it's time to focus on how to get the count of unique values in a CAS table column.

Load and prepare data

First, I connected my Python client to the distributed CAS server and named my connection conn. Then I created a function to load and prepare my CAS table. The custom function loads the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory, renames the columns using the column labels and drops unnecessary columns. This simplifies the table for the demonstration.

The Samples caslib should be available in your SAS Viya environment and contains sample tables. For more information on how to rename columns in a CAS table view Part 11 - Rename Columns.

## Packages
import swat
import pandas as pd
 
## Options
pd.set_option('display.max_columns', 50)
 
## Connect to CAS
conn = ## your connection information
 
def prep_data():
    ## Load the data into CAS
    conn.loadTable(path='WARRANTY_CLAIMS_0117.sashdat', caslib='samples',
                   casout={'name':'warranty_claims', 'caslib':'casuser'})
 
    ## Reference the CAS table in an object
    castbl = conn.CASTable('warranty_claims', caslib = 'casuser')
 
    ## Store the column names and labels in a dataframe
    df_col_names = castbl.columnInfo()['ColumnInfo'].loc[:,['Column','Label']]
 
    ## Create a list of dictionaries of how to rename each column using the column labels
    renameColumns = []
    for row in df_col_names.iterrows():
        colName = row[1].values[0]
        labelName = row[1].values[1].replace(' ','_')
        renameColumns.append(dict(name=colName, rename=labelName))
 
    ## List of columns to keep in the CAS table
    keepColumns = {'Campaign_Type', 'Platform','Trim_Level','Make','Model_Year','Engine_Model',
                   'Vehicle_Assembly_Plant','Claim_Repair_Start_Date', 'Claim_Repair_End_Date'}
 
    ## Rename and drop columns to make the table easier to use
    castbl.alterTable(columns = renameColumns, keep = keepColumns)
 
    return castbl

Next, I'll execute the user defined function and store the CAS table object in the variable tbl  and view it's type.

tbl = prep_data()
type(tbl)
 
# and the results
NOTE: Cloud Analytic Services made the file WARRANTY_CLAIMS_0117.sashdat available as table WARRANTY_CLAIMS in caslib CASUSER(Peter).
swat.cas.table.CASTable

The results show that the WARRANTY_CLAIMS_0117.sashdat is available in the CAS server, and tbl is a CASTable object.

Lastly, I'll preview the distributed CAS table using the SWAT package head method.

tbl.head()

The results show a preview of the WARRANTY_CLAIMS CAS table. The table provides data on warranty claims for car repairs. The data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data.

Using the Pandas API in the SWAT package - value_counts method

I'll begin by using the Pandas API in the SWAT package which provides the value_counts method. The value_counts method works like it's Pandas counterpart. For example, I'll obtain the count of unique values in the Engine_Model CAS table column. I'll store the results of in vc, then display the type and value of vc.

vc = (tbl               ## CAS table
      .Engine_Model     ## CAS table column
      .value_counts()   ## SWAT value_counts method
     )
 
## Display the type and value
display(type(vc),vc)

The SWAT value_counts method summarizes the data in the distributed CAS server and returns a Pandas Series to the Python client. Once you have the Pandas Series on the client, you can work with it as you normally would. For example, I'll plot the Series using the Pandas plot method.

vc.plot(kind = 'bar', figsize=(8,6));

In this example, I used the Pandas API in the SWAT package to summarize data on the CAS server's massively parallel processing environment to return smaller, summarized results to the Python client. Once the summarized results are on the client, I'll work with them using other Python packages like Pandas.

Using the freq CAS action

Instead of using the Pandas API in the SWAT package you can achieve similar results using native CAS actions. In SWAT, CAS actions are simply specified as a method. One action that provides the count of unique values is the simple.freq CAS action.

For example, I can find the count of unique values for multiple columns within the freq action. Here, I'll specify the Engine_Model, Model_Year and Campaign_Type columns in the the inputs parameter. Then, I'll call the Frequency key after the action to obtain the SASDataFrame stored in the dictionary returned to the Python client. Remember, CAS actions always return a dictionary, or CASResults object, to the Python client. You must use familiar dictionary manipulation techniques to work with the results of an action. For more information on working with results of CAS actions, check out Part 2 - Working with CAS Actions and CASResults Objects.

## Columns to analyze
colNames = ['Engine_Model', 'Model_Year', 'Campaign_Type']
 
## Execute the freq CAS action and store the SASDataFrame
freq_df = tbl.freq(inputs = colNames)['Frequency']
 
## Display the type and DataFrame
display(type(freq_df), freq_df)

Again, the action processes the data in the distributed CAS server and returns results to the Python client. The results show the freq action counts the unique values of each column and stores the results in a single SASDataFrame. Once you have the SASDataFrame on the client, you can work with it like you would a Pandas DataFrame.

For example, I'll loop over each analysis column, query the SASDataFrame for the specific column name, and then plot the count of unique values of each column using the familiar Pandas package.

for column in colNames:
    (freq_df
     .query('Column == @column')
     .loc[:,['CharVar','Frequency']]
     .set_index('CharVar')
     .sort_values(by = 'Frequency', ascending=False)
     .plot(kind='bar', figsize=(8,6), title = f'The {column} Column')
    )

The loop produces a visualization of the count of unique values for each analysis column. This was all done using familiar Pandas code on the client side. Remember, the distributed CAS server did all of the processing and summarization, then returned smaller summarized results to the Python client.

Using the freqTab CAS action

Lastly, you can use the freqTab.freqTab CAS action to construct frequency and crosstabulation tables. The freqTab action provides a variety of additional features and information. The action is not loaded by default, so I'll begin by loading the action set.

conn.loadActionSet('freqTab')

Then I'll use the freqTab action in the freqTab action set to count the unique values for the Model_Year and Engine_Model columns, and also count the unique values of Engine_Model by Model_Year.

tbl.freqTab(tabulate = [
                'Model_Year',
                'Engine_Model',
                {'vars':['Engine_Model','Model_Year']}
            ]
    )

The results above show the freqTab action returns a dictionary with a variety of information. The first SASDataFrame is level information, the second SASDataFrame shows the number of observations used, and the remaining SASDataFrames show the two one-way frequency tables for Model_Year, and Engine_Model, and the crosstabulation between Engine_Model by Model_Year (also includes the totals).

With the results on the Python client, you can begin accessing and manipulating the SASDataFrames as needed.

Summary

The SWAT package blends the world of Pandas and CAS. You can use many of the familiar Pandas methods within the SWAT package like value_counts, or the flexible, highly optimized CAS actions like simple.freq and freqTab.freqTab to obtain counts of unique values in the massively parallel processing CAS engine. For more examples on the freq or freqTab CAS actions, check out my CAS action four part series (part 1, part 2, part 3 and part 4). The four part series executes CAS actions using the native CAS language. However, with some small changes to the syntax you can execute the same actions using Python.

Additional and related resources

Getting started with Python integration to SAS® Viya® - Part 12 - Count of Unique Values was published on SAS Users.

12月 222022
 

The addition of the PYTHON procedure and Python editor in SAS Viya enables users to execute Python code in SAS Studio. This new capability in SAS Viya adds another tool to SAS's existing collection. With this addition I thought, how can I utilize this new found power?

In this example, I'll keep it simple. I want to create a Microsoft Excel report using a combination of SAS, Python and SQL. I'll use data that's stored in a SAS library; however, the library could be using data stored anywhere, like a path, database or in the cloud. I'll write a program that executes the following:

All code used in this post is located on GitHub, here.

Set folder path and file name

To begin, I'll create a macro variable to specify the output folder path and Microsoft Excel workbook name.

%let path=/*Enter your output folder path*/;
%let xlFileName = myExcelReport.xlsx;

Prepare data

Further, I'll prepare the data using the SAS DATA step. I'll use the available sashelp.cars table, create a new column named MPG_Avg, and drop unnecessary columns. Instead of using the DATA step you can use Python or SQL to prepare the data. Whatever tool works best for you.

data work.cars;
    set sashelp.cars;
    MPG_Avg=mean(MPG_City, MPG_Highway);
    drop Wheelbase Weight Length;
run;

Create the Microsoft Excel workbook

After the data is ready, I'll use the ODS EXCEL statement to create the Excel spreadsheet. The following ODS options are used:

  • FILE - specifies the file path and name.
  • STYLE - modifies the appearance of the SAS output
  • EMBEDDED_TITLES - specifies titles should appear in the worksheet
  • SHEET_INTERVAL - enables manual control when to create a new worksheet
ods excel file="&path./&xlFileName" 
		  style=ExcelMidnight   
		  options(embedded_titles="on");

Worksheet 1

Print the data using SAS

With the ODS EXCEL destination open I'll name the first worksheet Data, and manually specify when a new sheet is created. Next, I'll use the PRINT procedure to print the detailed data to Excel. The PRINT procedure will print the entire SAS data set with the associated formats and styles to Excel.

* Sheet 1 - Print the data using SAS *;
ods excel options(sheet_name='Data' sheet_interval='none');
title height=16pt color=white "Detailed Car Data";
proc print data=work.cars noobs;
run;

Worksheet 2

Create violin plots using Python

Next, I want to create violin plots on a new worksheet named Origin_MPG. Now, these can be created in SAS, but I personally found the matplotlib package in Python a bit easier to use . With the PYTHON procedure, I can include the Python code within the SAS program (or you can reference a .py file) to create the visualization. Then I'll use the SAS.pyplot method to save and render the visualization. Since the pyplot callback renders the image in the results tab, it exports the image to the Excel workbook by default.

First I'll use ODS EXCEL to create the new worksheet and the TITLE statement to add a title to the Excel worksheet.

ods excel options(sheet_name='Origin_MPG' sheet_interval='now');
title justify=left height=16pt color=white "Analyzing MPG by Each Car Origin";

Then I'll execute the PYTHON procedure to execute my Python code.

* Create violin plots using Python *;
proc python;
submit;
 
##
## Import packages and options
##
 
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
outpath = SAS.symget('path')
 
##
## Data prep for the visualization
##
 
## Load the SAS table as a DataFrame
df = (SAS
      .sd2df('work.cars')                 ## SAS callback method to load the SAS data set as a DataFrame
      .loc[:,['Origin','MPG_Avg']]        ## Keep the necessary columns
)
 
 
## Create a series of MPG_Avg for each distinct origin for the violin plots
listOfUniqueOrigins = df.Origin.unique().tolist()
 
mpg_by_origin = {}
for origin in listOfUniqueOrigins:
    mpg_by_origin[origin] = df.query(f'Origin == @origin ').MPG_Avg
 
 
##
## Create the violin plots
##
 
## Violin plot
fig, ax = plt.subplots(figsize = (8,6))
ax.violinplot(mpg_by_origin.values(), showmedians=True)
 
## Plot appearance
ax.set_title('Miles per Gallon (MPG) by Origin')
rename_x_axis = {'position': [1,2,3], 'labels':listOfUniqueOrigins}
ax.set_xticks(rename_x_axis['position'])
ax.set_xticklabels(rename_x_axis['labels']);
 
## Save and render the image file
SAS.pyplot(plt, filename='violinPlot',filepath=outpath)
 
endsubmit;
quit;
title;

SQL Aggregation

SQL is an extremely common and useful language for data analysts and scientists. I find using SQL for aggregation easy, so I will create a simple aggregation and add it below the visualization on the same worksheet in the the Excel report.

* SQL Aggregation *;
title justify=left "Average MPG by Car Makes";
proc sql;
select Origin, round(mean(MPG_Avg)) as AverageMPG
	from work.cars
	group by Origin
	order by AverageMPG desc;
quit;
title;

Add text

At the end of the same worksheet I'll add some simple text using the ODSTEXT procedure to give some information about the data.

proc odstext;
   heading 'NOTES';
   p 'Using the SASHELP.CARS data. The following car Origins were analyzed:';
   list ;
      item 'Asia';
      item 'Europe';
      item 'USA';
   end;    
   p 'Created by Peter S';
quit;

Close the Excel workbook

Lastly, I'll close the ODS EXCEL destination since I am done writing out to Excel.

ods excel close;

Results

That's it! Now I'll execute the entire program and view the Excel workbook.

Summary

With the capabilities of SAS and the new ability to execute Python code in SAS Studio, teams have a variety of tools in SAS Viya for their analytic needs.

Additional resources

PYTHON Procedure documentation
SAS opens its code editor interface to Python users
Using PROC PYTHON to augment your SAS programs
ODS Excel Statement

Creating a Microsoft Excel report using SAS, Python and SQL! was published on SAS Users.

10月 182022
 

Chameleons are examples of adaptability to environments for they change their colors to better blend into their surroundingThe SAS applications development process typically includes the following three phases of the software development life cycle (SDLC): Development (Dev), Testing (Test) and Production (Prod).

In order to protect the Production (or Operation) environment from unforeseen disruptions, these stages are usually implemented in their own separate environments. Sometimes, Development and Testing environments are combined into a single Development environment, which just treats Testing as an integral part of the Development stage.

(For brevity of the further narrative, we will use the two-stage process/environment paradigm, which can be easily expanded as needed.)

During the software development life cycle, when our programming modules are designed and built (developed), tested, validated and have proved to reliably produce the desired results, we promote (transition) them from Dev to the Prod environment.

The software promotions are often managed by change control procedures adopted by your organization.

Software promotion limitations

However, no matter how strict the change control procedures are or how scrupulously we conduct testing and promotion from Dev to Prod, we must recognize the limitations of this paradigm. The alternative would be a fallacy of dreams in “wishful thinking” territory.

Let’s look into two highly desirable, but rarely attainable, “dreams”.

Dream 1

If a software is validated and reliably works in Dev, it is guaranteed that it will work just fine in Prod and therefore no testing in Prod is required.

Reality 1

Although a thorough testing in Dev will keep failures in Prod to a minimum, it cannot completely eliminate them or guarantee your code’s success when running in Prod. Even if a promoted module does not require even a single change and you just copy it from Dev to Prod, there are still many possibilities that it will not work properly in Prod. Consider potential environmental differences such as data sources (structures and access), file system permissions and SAS metadata groups and permissions.

You will need to account for these environmental differences and still conduct some testing in Prod to make sure that the environment itself does not break your code/software.

Dream 2

When software is developed, tested and validated in Dev, promoting it to Prod is just a matter of copying it from Dev to Prod without any changes/adjustments.

Reality 2

However, in actuality, again environmental differences may require adjusting your code for the new environment. For example, data sources and data targets may have different library names or library references, or different directories or files names. If your application automatically sends out an email to a distribution list of your application users, you most likely will have different distribution lists in Dev vs. Prod.

Of course, you may capture all these differences into a configuration file, which will keep all the differences between Dev and Prod environments out of your code. However, in this case, your application becomes data-driven (with the configuration file being a data driver) and, as such, the configuration file itself becomes part of your software. Obviously, these configuration files simply cannot be copied from Dev to Prod during the promotion since they are inherently different.

Therefore, you cannot say that for promotion from Dev to Prod you can just copy your software. At least that is not true for that configuration file.

This reality makes the promotion process quite sensitive and susceptible to errors. Besides, we must maintain two versions of the application (Dev and Prod) and make efforts not to mix them up and not overwrite each other.

Adaptive programming

Adaptability is the ability to adjust to different environments, much like a chameleon changes its color to better blend into its surroundings.

If we can develop a program that automatically adjusts itself to run in either the Development or Production environment, it would mean that we have found a solution making Dream 2 a reality.

You can copy such a program between the environments back and forth and you don’t need to change a thing in it. Because it is identical in both, Dev and Prod.

You can even store a single version of this adaptive program in a centralized shared location and access it to run by either the Development or Production server. This holds true even though these environments may use different data sources, different data targets, different email distribution lists…

The key for such an implementation is the program’s self-awareness when it knows which environment it is running in, and based on that knowledge adapts to the environment by adjusting its configuration variables.

But these are just words of make-believe promises. Let’s get to the proof.

Identifying the environment your SAS program runs in

Suppose you have SAS software installed in the following two environments:

  • Development, where the SAS application server runs on computer DEVSRV.YOURDOMAIN.COM
  • Production, where the SAS application server runs on computer PRODSRV.YOURDOMAIN.COM

If you have SAS® Enterprise BI Server or SAS® BI Server installed you may have your metadata servers either on the same device/computer/machine/server as the application servers or on separate machines. Let’s say these are separate machines: DEVMETASRV.YOURDOMAIN.COM and PRODMETASRV.YOURDOMAIN.COM.

When we run a program in a particular environment (server), we usually know in which environment we run this program. However, in order to build an adaptive program we need to make sure the program itself knows which environment it is running in.

The key here is the SYSHOSTNAME automatic macro variable that the SAS System conveniently makes available within any SAS session. It contains the host name of the computer that is running the SAS program. In our adaptive program, &SYSHOSTNAME macro variable reference will be equal to either DEVSRV or PRODSRV depending on which environment our program runs in.

Making a SAS program adapt to the Development or Production environment

Now, when our program knows its running environment, we can make it adjust to that environment.

Let’s consider a simplified scenario where our program needs to do the following different things depending on the running environment.

Adaptability use case

1. Read a source data table SALES from a metadata-bound library:

  • In DEV: library named ‘SQL CORP DEV’;
  • In PROD: library named ‘SQL CORP’.

2. Produce target Excel file, report.xlsx and save it a specified directory on a non-SAS machine:

  • In DEV: \\DEVOUTSRV\outpath
  • In PROD: \\PRODOUTSRV\outpath

3. Send out an email to a distribution list (contact group) of users when the report is ready:

Adaptive code implementation

The following SAS code illustrates this adaptive technique:

/* ----- Configuration section ----- */
 
%if "&syshostname"="DEVSRV" %then
%do; /* Development server */
   %let metasrv = DEVMETASRV;
   %let lname = SQL CORP DEV;
   %let outsrv = DEVOUTSRV;
   %let tolist = "developers@yourdomain.com";
   options symbolgen mprint mlogic fullstimer;
%end;
%else
%if "&syshostname"="PRODSRV" %then
%do; /* Production server */
   %let metasrv = PRODMETASRV;
   %let lname = SQL CORP;
   %let outsrv = PRODOUTSRV;
   %let tolist = "businessusers@yourdomain.com" "developers@yourdomain.com";
   options nosymbolgen nomprint nomlogic nofullstimer;
 
   /* adjust email distribution list based on test_trigger_YYYYMMDD.txt file existence */
   %let yyyymmdd = %sysfunc(date(),yymmddn8.);
   %if %sysfunc(fileexist(&projpath\test_trigger_&yyyymmdd..txt) %then
      %let tolist = "testers@yourdomain.com";
%end;
%else
%do; /* Unauthorized server */
   %put ERROR: This program is not designed to run on &syshostname server;
   %put ERROR: SAS session has been terminated.;
   endsas;
%end;
 
options 
   metaserver="&metasrv" metaport=8561 metarepository=Foundation metaprotocol=bridge
   metauser='service-accout-ID' metapass="encrypted-sas-password";
 
libname SRCLIB meta "&lname";
 
%let outpath = \\&outsrv\outpath;
%let outname = Report.xlsx;
 
/* ----- End of Configuration section ----- */
 
/* Produce Excel report */
ods excel file="&outpath\&outname";
proc print data=SRCLIB.SOMETABLE;
   where Product='Loan';
run;
ods excel close;
 
/* Send out email */
filename fm email to=(&tolist) from='sender@mydomain.com' subject='Your subject';
 
data _null_;
   file fm;
   put '*** THIS IS AUTOMATICALLY GENERATED EMAIL ***' //
       "Loan report job completed successfully on &syshostname server." /
       "The following Excel file has been generated: &outpath\&outname" //
       'Sincerely,' / 'Your automation team';
run;

Code highlights

As you can see, in the configuration section we conditionally set various macro variables (including &tolist distribution list) and global options depending on where this code runs, in Dev or Prod, determined by the &SYSHOSTNAME macro variable. For unauthorized &SYSHOSTNAME values, we write a relevant message to the SAS log and terminate the SAS session.

Then we establish connection to the metadata server, assign source data library SRCLIB as well as output directory (&outpath) and output file name (&outname).

Based on these conditionally defined macro variables and libref, in the remaining sections, we:

You can wrap this configuration section into a macro and store it as a separate file; then the main program would just invoke this macro. In either case, the code achieves total environmental awareness and naturally, logically adapts to the environment in which we designed it to function.

This adaptive coding approach ensures 100% equality between Dev and Prod code versions. In fact, the two versions of the code are identical. They are self-contained, self-reliant, and do not require any adjustments during promotion throughout the development life cycle. You can copy them from one environment to another (and vice versa) without fear of accidentally wiping out and replacing the right file with the wrong one.

Can we say that this implementation makes Dream 2 come true?

Testing in the Production environment

Notice a little section dubbed /* adjust email distribution list based on test_trigger_YYYYMMDD.txt file existence */ within the Production Server logic. It allows for triggering a test run, in this case by limiting the scope of email distribution in Prod. Just plant / create a file named test_trigger_YYYYMMDD.txt in a specified location &projpath and run your code on a desired test date YYYYMMDD.  You can delete this file afterwards if a full-scale run is scheduled for the same date or otherwise keep it for future reference (it becomes “harmless” and irrelevant for any other dates). You can use the same trigger file tactic to modify other parts of your code as needed in both Prod and Dev.

Even though this feature does not make Dream 1 come true, it does alleviate its Reality.

Questions? Thoughts? Comments?

Do you find this blog post useful? Do you have questions, concerns, suggestions, or comments? Please share with us below in the Comments section.

Additional Resources

Adaptive SAS programming for the Software Development Life Cycle was published on SAS Users.

10月 142022
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how to summarize columns. Now it's time to focus on how to rename columns in CAS tables.

Load and explore data

In SAS Viya we can use the Samples caslib for demonstration data. I'll load the data into the distributed CAS server using my CAS connection object conn, followed by the loadTable action. The loadTable action loads a server-side file into memory. Then I'll reference the CAS table in the tbl object and view the column information using the columnInfo CAS action. Remember, the data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data.

conn.loadTable(path = 'WATER_CLUSTER.sashdat', caslib = 'samples',
                            casOut = dict(caslib = 'casuser'))
 
tbl = conn.CASTable('water_cluster', caslib='casuser')
 
tbl.columnInfo()

Rename CAS table columns

To rename a column in a CAS table you will need to use the alterTable CAS action. The alterTable action can rename columns, but it can also rename tables, add labels to tables, drop or keep columns, and modify column attributes like column labels. This post will only focus on how use the rename feature of the action.

As of the SWAT package version 1.11.0 there is no Pandas equivalent to rename the column. Techniques like the rename method or by explicitly assigning a list of new column names to the CASTable object are not available.

Rename a single column

For example, what if you want to rename the Year column to Year_Value? To rename columns with the alterTable action you specify the columns parameter with a list of dictionaries. Each dictionary specifies the column to modify. In this example specify the columns parameter and add a list with a single dictionary. The dictionary requires the name key to specify the column name and the rename key specifies the new column name. Lastly, execute the columnInfo action to view the updated columns names.

tbl.alterTable(columns = [
                {'name':'Year', 'rename':'Year_Value'}
              ])
tbl.columnInfo()

 

Notice in the results the Year column has been renamed to Year_Value.

Rename multiple columns

To rename multiple columns you simply add additional dictionaries to the list. For example, if you want to rename the Month column to Month_Value, and the Day column to Day_Value, add two dictionaries to the list. Here I'll create the dictionary in the variable renameColumns then use that in the alterTable action.

## Columns to rename
renameColumns = [
    {'name': 'Month', 'rename':'Month_Value'},
    {'name': 'Day', 'rename':'Day_Value'}
]
 
## Rename the CAS table columns
tbl.alterTable(columns = renameColumns)
 
## View the column information of the CAS table
tbl.columnInfo()

In the results notice that the Month column has been renamed to Month_Value, and the Day column has been renamed to Day_Value.

Dynamically rename columns

What if you want to dynamically rename columns? For example, what if you want to remove all underscores in the column names and lowercase all characters? You could manually create a dictionary for each column to rename, but that's a lot of manual work. Instead, you can use a Python list comprehension to create a list of dictionaries with the columns to rename. I'll store the results of the list comprehension in the variable newColumnNames.

newColumnNames = [{'name':colName, 'rename':colName.replace("_","").lower()} for colName in tbl.columns]
display(newColumnNames)
# and the results
[{'name': 'Year_Value', 'rename': 'yearvalue'},
 {'name': 'Month', 'rename': 'month'},
 {'name': 'Day', 'rename': 'day'},
 {'name': 'Date', 'rename': 'date'},
 {'name': 'Serial', 'rename': 'serial'},
 {'name': 'Property', 'rename': 'property'},
 {'name': 'Address', 'rename': 'address'},
 {'name': 'City', 'rename': 'city'},
 {'name': 'Zip', 'rename': 'zip'},
 {'name': 'Lat', 'rename': 'lat'},
 {'name': 'Long', 'rename': 'long'},
 {'name': 'Property_type', 'rename': 'propertytype'},
 {'name': 'Meter_Location', 'rename': 'meterlocation'},
 {'name': 'Clli', 'rename': 'clli'},
 {'name': 'DMA', 'rename': 'dma'},
 {'name': 'Weekday', 'rename': 'weekday'},
 {'name': 'Weekend', 'rename': 'weekend'},
 {'name': 'Daily_W_C_M3', 'rename': 'dailywcm3'},
 {'name': 'Week', 'rename': 'week'},
 {'name': 'US Holiday', 'rename': 'us holiday'},
 {'name': 'CLUSTER', 'rename': 'cluster'}]

Notice it creates a list of dictionaries, one for each column to rename. The list comprehension removed all underscores if they exist in a column name, then lowercases all the characters in the column.

Now that the list of dictionaries is created, add it to the alterTable action's columns parameter.

"tbl.alterTable(columns = newColumnNames)
tbl.columnInfo()

Summary

The SWAT package blends the world of Pandas and SAS, making available many familiar Pandas techniques to work with your distributed CAS tables. However, in this example, familiar Pandas techniques to rename columns is currently not available in the SWAT package. Instead, you can use the alterTable action to easily rename columns.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 11 - Rename Columns was published on SAS Users.

10月 062022
 

Group and aggregate CAS tables

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how to summarize columns. Now it's time to focus on how to group and aggregate CAS tables.

Load and explore data

First, let's load some data. In SAS Viya we can use the Samples caslib for demonstration data. I'll load the data into the distributed CAS server using my CAS connection object conn, followed by the loadTable action. The loadTable action loads a server-side file into memory. Then I'll reference the CAS table in the tbl object and preview 5 rows using the SWAT head method. Remember, the data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data.

conn.loadTable(path = 'WATER_CLUSTER.sashdat', caslib = 'samples',
                            casOut = dict(caslib = 'casuser'))
 
tbl = conn.CASTable('water_cluster', caslib='casuser')
 
tbl.head()

 

The water_cluster CAS table contains daily water usage for a variety properties. The Daily_W_C_M3 column displays the water used per location in cubic meters (m3), the Serial column identifies each property, and the Weekend column identifies if the reading occurred on the weekend or not.

Using the SWAT groupby method

The SWAT package contains the groupby method. It is defined to match the pandas.DataFrame.groupby() method. For example, I'll specify the CAS table object tbl, then the groupby method and specify the Serial column, and then enclose it with the type function to view the object.

type(tbl.groupby('Serial')) 
# and the results
swat.cas.table.CASTableGroupBy

Notice the results return a CASTableGroupBy object. This works similarly to the pandas DataFrameGroupBy object. This enables you to add aggregation methods to aggregate by groups. For example, what if you want to see the total water consumption for each Serial value?

First, specify the CASTable object tbl, followed by the groupby method to aggregate by each Serial group. Then specify the column to aggregate, Daily_W_C_M3, and then add the sum aggregation method. Here the results are stored in the variable df_serial and displayed.

df_serial = (tbl                    ## CAS table reference
             .groupby('Serial')     ## Group the CAS table
             .Daily_W_C_M3          ## Specify the CAS table column to aggregate
             .sum()                 ## Specify the aggregation
)
 
display(df_serial)

partial results

The code executes in the distributed CAS server and returns a Series object to the client. Once you have a Series on the client you can work with it as your normally would in Pandas.

Let's answer another question, is water consumption higher on the weekend or weekday? Using the same technique from above, let's view the mean water consumption on weekends and weekdays. I'll begin by specifying the CASTable object followed by the groupby method to group the Weekend column. The Weekend column indicates if it's a weekend, 1, or a weekday, 0. Then I'll specify the CAS column Daily_W_C_M3 and the mean method.  Lastly, after the CAS server processes the data in parallel it returns a Series to the client. On the client-side Series object I'll chain the rename method to rename the values 0 and 1 to Weekday and Weekend respectively.

(tbl                      ## CAS table reference          
 .groupby('Weekend')      ## Group the CAS table
 .Daily_W_C_M3            ## Specify the CAS table column to aggregate
 .mean()                  ## Specify the aggregation
 .rename({0:'Weekday',    ## Rename the values in the Series object returned from the CAS server on the client
          1:'Weekend'})
)

The results show that water consumption seems to be higher on weekends.

Using the CASTable groupby parameter

You can also achieve the same results using the CASTable object with the groupby parameter. For example, what if I wanted to answer the same question as the previous example about water consumption on weekends vs weekdays?

First, let's display the parameters of the tbl object using the params attribute.

tbl.params
# and the results
{'name': 'water_cluster', 'caslib': 'casuser'}

The results show the CASTable object has the name and caslib parameters. The name and caslib parameters simply reference the CAS table.

Next, add the groupby parameter to the tbl CASTable object with the column or columns to group by. Here I'll specify the Weekend column.

tbl.groupby = 'Weekend'
# and the results
CASTable('water_cluster', caslib='casuser', groupby='Weekend')

Notice that the CASTable object now contains the groupby parameter with the value Weekend. I'll confirm that groupby parameter was permanently added to the CASTable object by checking the parameters of the object again.

tbl.params
# and the results
{'name': 'water_cluster', 'caslib': 'casuser', 'groupby': 'Weekend'}

The results show the groupby parameter was added to the CASTable object. Now I can simply use the CASTable object to aggregate by the group. Specify the CASTable object followed by the CAS table column to aggregate. Then add the mean aggregation method to aggregate the data by groups on the CAS server. Lastly, the rename method will rename the Series object returned by the CAS server on the client.

(tbl                      ## CAS table reference with the groupby parameter
 .Daily_W_C_M3            ## Specify the CAS table column to aggregate
 .mean()                  ## Specify the aggregation
 .rename({0:'Weekday',    ## Rename the values in the Series object returned from the CAS server on the client
          1:'Weekend'})
)


Notice the results are the same.

Summary

The SWAT package blends the world of Pandas and SAS, making available many familiar Pandas techniques to work with your distributed CAS tables. This enables you to utilize the massively parallel processing power of the CAS server in SAS Viya using Python. Here we learned about using the familiar Pandas groupby method in the SWAT package and the CAS table groupby parameter to aggregate CAS tables by groups.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 10 - Group and Aggregate CAS Tables was published on SAS Users.

9月 202022
 

SAS automation using Windows batch scripts

Let’s consider the following ubiquitous scenario. You have created a SAS program that you want to run automatically on schedule on a SAS server under the Microsoft Windows operating system.

If you have SAS® Enterprise BI Server or SAS® BI Server and Platform Suite for SAS you can do the scheduling using the Schedule Manager plug-in of SAS Management Console.

But even if you only have SAS Foundation installed on a Windows server (or any Windows machine), you can automate/schedule your SAS program to run in batch using Windows Task Scheduler. In order to do that you would need a Windows batch file.

What are Windows batch files?

A Windows batch file (or batch script) is a text file consisting of a series of Windows commands executed by the command-line interpreter. It can be created using a plain text editor such as Notepad or Notepad++ and saved as a text file with the .bat file name extension. (Do not use a word processor like Microsoft Word for batch script writing!)

Historically, batch files originated in DOS (Disk Operating System), but modern versions of Windows continually support them. Batch files get executed when you double click on them in Windows File Explorer or type & enter their name in the Command Prompt window.

Even though Windows batch files are not as powerful as PowerShell scripts or Unix/Linux scripts they are still quite versatile and useful for automating processes running SAS in batch mode.

Besides simply running a bunch of OS command sequentially one after another, batch scripts can be more sophisticated. For example, they can use environment variables (which are similar to SAS macro variables). They can use functions and formats. They can capture exit code from one command (or SAS program) and then, depending on its value, conditionally execute a command or a group of commands (which might include another SAS program or another batch script). They can even do looping.

They can submit a SAS program to run by SAS System in batch mode and pass to this program a parameter string.

They can control where SAS program saves its log and generate a dynamic log file name adding a datetime stamp suffix to it.

Windows batch script examples

Let’s explore a couple of examples.

Example 1. Simple script running SAS program in batch mode

set sas=D:\SAS94\sashome\SASFoundation\9.4\sas.exe
set proj=C:\Projects\SAS_to_Excel
set name=mysasprogram
set pgm=%proj%\%name%.sas
set log=%proj%\logs\%name%.log
%sas% -sysin %pgm% -log %log% -nosplash -nologo -icon

Here, we define several environment variables (proj, name, pgm, log) and reference these variables by surrounding their names with %-sign as %variable-name% (analogous to SAS macro variables which are defined by %let mvar-name = mvar-value; and referenced as &mvar-name).

This script will initiate SAS session in batch mode which executes your SAS program mysasprogram.sas and outputs SAS log as mysasprogram.log file.

Example 2. Running SAS program in batch and date/time stamping SAS log

Batch scripts are often used to run a SAS programs repeatedly at different times. In order to preserve SAS log for each run, we can assign unique names for the log files by suffixing their names with a date/time. For example, instead of saving SAS log with the same name mysasprogram.log we can dynamically generate unique names for SAS log, e.g. mysasprogram_YYYYMMDD_HHMMSS, where YYYYMMDD_HHMMSS indicates the date (YYYYMMDD) and time (HHMMSS) of run. This will effectively keep all SAS logs and indicate when (date and time) each log file was created. Here is a Windows batch script that does it:

: generate suffix dt=YYYYMMDD_HHMMSS, pad HH with leading 0
set z=%time: =0%
set dt=%date:~-4%%date:~4,2%%date:~7,2%_%z:~0,2%%z:~3,2%%z:~6,2%

set sas=D:\SAS94\sashome\SASFoundation\9.4\sas.exe
set proj=C:\Projects\SAS_to_Excel
set name=mysasprogram
set pgm=%proj%\%name%.sas
set log=%proj%\logs\%name%_%dt%.log
%sas% -sysin %pgm% -log %log% -nosplash -nologo -icon

Windows batch scripts with conditional execution

Let’s enhance our script by adding the following functionality:

  • Captures exit code from the mysasprogram.sas (exit code 0 mean there are no ERRORs or WARNINGs)
  • If exit code is not equal to 0, conditionally execute another SAS program my_error_email.sas which sends out an email to designated recipients informing them that mysasprogram.sas failed (successful execution email can be sent from mysasprogram.sas itself).

One would expect that it can be achieved by adding the following scripting code to the above example 2:

: capture exit code from sas
set exitcode=%ERRORLEVEL%

: generate email if ERROR and/or WARNING
if not %exitcode% == 0 (
   set ename=my_error_email
   set epgm=%proj%\Programs\%ename%.sas
   set elog=%proj%\SASLogs\%ename%_%dt%.log
   %sas% -sysin %epgm% -log %elog% -nosplash -nologo -icon -sysparm %log%
)

However, you might be in for a big surprise (I was!) when you discover that my_error_email.sas program runs regardless of whether exitcode is equal on not equal to 0.  How is that possible!

It turned out that Windows script environment variable references in form of %variable-name% do not resolve at execution time like SAS macro variable references &mvar-name or Unix/Linux script variable references $variable-name . They resolve during the initial script parsing before the run-time exitcode is evaluated. As a result, all the commands within the parentheses of the IF-command (including SAS session kickoff) are resolved and executed unconditionally.

Initially, DOS (and later Windows) scripts were implemented without conditional execution (IF command) and looping (FOR command) functionality and their environment variable references were resolving during script parsing. Later, when scripting language was brought to a higher standard that did include conditional execution and looping, the developers decided to keep the original  behavior of the %variable-name% references intact, but added a new form of the environment variable references !variable-name! surrounding variable names with exclamation marks. They called it "delayed expansion" and cardinally altered the variable references behavior causing them to resolve (expand) during execution time rather than parse time.

The following scripting command enables delayed expansion:

SetLocal EnableDelayedExpansion

We can place this SetLocal command right before the IF section and replace variable references in it with !variable-name! . Alternatively, for consistency, we can place SetLocal EnableDelayedExpansion at the beginning of the script and replace all environment variable references with !variable-name! . In the latter case, all our variables will be resolved at execution time.  Here is the final script:

SetLocal EnableDelayedExpansion

: generate suffix dt=YYYYMMDD_HHMMSS, pad HH with leading 0
set z=!time: =0!
set dt=!date:~-4!!date:~4,2!!date:~7,2!_!z:~0,2!!z:~3,2!!z:~6,2!

set sas=D:\SAS94\sashome\SASFoundation\9.4\sas.exe
set proj=C:\Projects\SAS_to_Excel
set name=mysasprogram
set pgm=!proj!\!name!.sas
set log=!proj!\logs\!name!_!dt!.log
!sas! -sysin !pgm! -log !log! -nosplash -nologo -icon

: capture exit code from sas
set exitcode=!ERRORLEVEL!

: generate email if ERROR and/or WARNING
if not !exitcode! == 0 (
   set ename=my_error_email
   set epgm=!proj!\Programs\!ename!.sas
   set elog=!proj!\SASLogs\!ename!_!dt!.log
   !sas! -sysin !epgm! -log !elog! -nosplash -nologo -icon -sysparm !log!
)

Notice, how we pass in to the my_error_email.sas program the log name of failed mysasprogram.sas:

-sysparm !log!

This log name can be captured in the my_error_email.sas program by using the SYSPARM automatic macro variable:

%let failed_log = &sysparm;

Then, it can be used either to attach that log file to the automatically generated email or at least provide its path and name in the email body.

Job scheduling

With Windows batch script file in place, you can easily schedule and run your SAS program in batch mode on a SAS machine that have just SAS Foundation installed using Microsoft Windows Task Scheduler. In order to do that you would need to specify your script’s fully qualified name in the Windows Task Scheduler (Create Task →  New Action Program/script field) as shown below:

Creating a new Task in Windows Task Scheduler

Then you would need to specify (add) new Trigger(s) that ultimately define the scheduling rules:

Setting up a new Trigger to define scheduling rules

That’s all. You can now sleep well while your job runs at 3:00 am.

Questions? Thoughts? Comments?

Do you have questions, concerns, comments or use other ways of automating SAS jobs? Please share with us below in the Comments section.

Additional Resources

Automating SAS processes using Windows batch files was published on SAS Users.