Data Management

4月 052023
 

In part 1 of this series, we examined our data before building any models. Among the discoveries was a column that seemed to contain a SAS date value. Here, we will discuss what exactly is meant by a 'SAS date', how to format it correctly, and how to create a new column using it.

What is a SAS Date?

In SAS, a date value is the count of days between January 1, 1960 and any specified day. Dates before that are reflected as negative numbers; after it's represented by positive numbers - for instance, 0 for January 1st, 1960; -365 for January 1st 1959 and 17532 for January 1, 2008.

Additionally, there's another way to represent dates in SAS - datetime values which signify how many seconds have elapsed from 1960 till an hour/minute/second on your chosen date.
For more information about SAS Date, Time and Datetime Values explore the

Being a seasoned SAS User made me aware that these values had the potential to be interpreted as date values when loaded. To further investigate, I applied a format and checked if they were indeed translated into relevant dates for this data set.

HomeEquity = conn.upload_file("https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/home_equity.csv",   
                     casOut={"name":"HomeEquity",
                             "caslib":"casuser",
                             "replace":True},
                     importoptions={"vars":{"APPDATE":{"format":"mmddyy10."}}
                     })  
conn.table.fetch(table='HomeEquity', to=5, format=True)

To view the formatting related to this column, simply append "format=True" to our fetch statement.

With the data being from the past 5 years, these dates seem to be accurate.

The column information data now indicates that MMDDYY is its designated format.

HomeEquity.table.columnInfo()

Using the powerful SAS date format enables us to display data into various formats. For example, when reading the data with a DATE9. format, APPDATE will be presented as such. This is an invaluable benefit that makes working with dates much simpler and more efficient.

The format associated with APPDATE is also updated to reflect the DATE format.

Exploring Date Values

By leveraging SAS date values, we can quickly and easily create output based on the year, month or any other relevant categorization via straightforward Pandas syntax.

Utilize the dt and value_counts options of Pandas to conveniently obtain the number of rows per year.

HomeEquity.APPDATE.dt.year.value_counts()

HomeEquity.shape

(5950,18)

This data currently has 5,950 rows and 18 columns.

By utilizing the Pandas query string, we can filter our data by date.

filtered_tbl = HomeEquity.query("'2019-03-01'd <= APPDATE <= '2019-09-01'd")

This selects a group of rows from our data that have an APPDATE between March 1, 2019 and September 1, 2019.

filtered_tbl.shape

(610,18)

By trimming down the original 5,950 rows and 18 columns to 610 rows and 18 columns, we now have a dataset that is specifically tailored for creating a plot featuring our desired date value.

plt.figure().set_figwidth(15)
filtered_tbl.plot(x="APPDATE", y="LOAN")
plt.show()

Create New Column from a Date Value

Dates are incredibly important, yet for our predictive modeling goals, we cannot simply use dates as inputs. Instead, we generate features or columns from the given date information - such as elapsed time or period between events. For instance, in regards to our dataset; the amount of days since application may prove to be an insightful metric!

If we had additional date values, we could introduce inputs such as the number of days since the last payment and the length of time that any payments were overdue. This type of data would be valuable for understanding customer behaviors related to payments.

As APPDATE is a SAS date value, expressed as the number of days since January 1, 1960, it allows us to use mathematical calculations or SAS numerical functions.

Using Basic Math Formulas

To figure out the days since an application date, subtract the two. In this example, let's calculate how many days have passed since an application for a loan using today's date - we can take advantage of SAS' TODAY() function to generate the appropriate SAS value for now.

HomeEquity.eval('DaysSince = TODAY()-APPDATE')
HomeEquity[['appdate','DaysSince']].summary(subSet=["N","NMISS","MEAN","MIN","MAX"])

To accurately calculate the amount of time elapsed from a particular date, we can use the formatted date shown below.

HomeEquity.eval("DaysSince2 = '31DEC2022'd - APPDATE")

Using SAS Functions

To calculate the number of days between two dates, use SAS's DATDIF function in this format: DATDIF(startdate, enddate, basis).

By specifying a 'basis', you can control how days are counted. Utilizing this powerful date manipulation tool will ensure an accurate and easy-to-understand representation of elapsed time.

HomeEquity.eval("DaysSince3 = DATDIF(APPDATE, '31DEC2022'd, 'Actual')")

With the Python head method, we can take a peek at our three freshly created columns.

HE_Dates= HomeEquity[["APPDATE", "DaysSince", "DaysSince2","DaysSince3"]]
HE_Dates.head()

DaysSince was determined using the TODAY() function - March 15, 2023 at the time of writing this post - and for comparison purposes, DaysSince2 and DaysSince3 were both computed using December 31, 2022 for the end date.

The Wrap-Up: Working with Dates

In conclusion, working with dates in SAS is easy and efficient. Using the DYY format allows for different formats to be used and is an important consideration when reading data into SAS. Additionally, using Pandas syntax along with mathematical formulas or SAS functions make it possible to create new features from date values. This can be especially useful for creating elapsed time or time since an event. By taking advantage of the unique features of SAS dates, users can create meaningful insights from their data.
In the next post, we will learn how impute missing values in our data.

Getting Started with Python Integration to SAS Viya for Predictive Modeling - Working with Dates was published on SAS Users.

3月 202023
 

Welcome to my series on getting started with Python integration to SAS Viya for predictive modeling.

  1. Exploring Data - Learn how to explore the data before fitting a model
  2. Working with Dates - Learn how to format a SAS Date and calculate a new column
  3. Imputing Missing Values - Learn how to replace missing values
  4. Creating Training and Validation Data Sets - Learn a way to split your data into a training and validation data set to be used for modeling
  5. Fitting a Linear Regression - Learn how to fit a linear regression and use your model to score new data
  6. Fitting a Logistic Regression - Learn how to fit a logistic regression and use your model to score new data
  7. Fitting a Decision Tree - Learn how to fit a decision tree and use your decision tree model to score new data
  8. Comparing Logistic Regression and Decision Tree - Which of our models is better at predicting our outcome? Learn how to compare models using misclassification, area under the curve (ROC) charts, and lift charts with validation data.
  9. Fitting a Random Forest - Learn how to fit a random forest and use your model to score new data
  10. Fitting a Gradient Boosting Model - Learn how to fit a gradient boosting model and use your model to score new data
  11. Autotuning Your Model - Learn how to improve your models by utilizing the built in autotuning feature in SAS Viya
  12. Creating new Features Automatically - Learn how to utilize SAS to automatically create your features for your models
  13. Creating Machine Learning Pipelines Automatically - Learn how use AI withing SAS Viya to create machine learning pipelines that start with the data, create new features, fit several models, and select the best model overall.

Getting Started with Python Integration to SAS Viya for Predictive Modeling - Index was published on SAS Users.

3月 202023
 

Welcome to the first post in my series Getting Started with Python Integration to SAS Viya for Predictive Modeling.

I'm going to dive right into the content assuming you have minimal knowledge on SAS Cloud Analytic Services (CAS), CAS Actions and Python. For some background on these subjects, refer to the following:

Now let's get started and learn how to explore the data before fitting a model. Before fitting any models, it is essential to inspect the data and explore questions such as: How accurate is the data? Are there missing values? Is there a possibility of incorrect or faulty information in some columns? Answering these queries can help you ensure that your model works correctly.

Load Data

For this Blog series I will use the home_equity.csv file from the SAS Viya example data library. For additional information on loading data into CAS see Loading a Client-Side CSV File into CAS or Loading Server-Side Files into Memory.

After making a connection to CAS (for more information on connecting to CAS see Making a Connection) load the Home Equity data into memory on CAS. Here the word conn is our connection object to CAS.

HomeEquity = conn.upload_file("https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/home_equity.csv",   
                     casOut={"name":"HomeEquity",
                             "caslib":"casuser",
                             "replace":True})

The data is now not only loaded onto the CAS server, but also into CAS memory, and ready to be explored.

HomeEquity is our CAS table object and will be how I refer to and access the data in the code below.

Before I start building any predictive model, there are always certain questions I make sure to ask regarding my data.

  • How many rows and columns?
  • What type of columns (numeric or categorical)?
  • What do the values look like?
  • Are there any issues I can see visually (think dirty data)?
  • Are there any missing values?

Python (pandas) and CAS Actions provide us with an effective way to execute these tasks. Of course, there are multiple approaches available depending on your preferences. In this tutorial, I will be utilizing CAS Actions primarily because the requests are processed quickly and efficiently on a dedicated server when dealing with huge data sets.

How Many Columns in the Data

To answer the question on the number of columns in the data, let’s use the table action set and the recordCount action.

conn.table.recordCount(table='HomeEquity')

This data has 5,960 rows.

Traditional Python methods are available as well. If you want both the number of columns and number of rows, use the shape method.

display(HomeEquity.shape)

HomeEquity has not only 5,960 rows but also 18 columns.

How to Look at Column Information

Let's delve deeper and investigate which columns our data contains and what types of data each holds. Use the table action set and the columnInfo action.

HomeEquity.table.columnInfo()

Of the 18 columns, we have 6 that are categorical, represented by varchar and the other 12 are numerical represented by double. Some of the latter may be categorical as well, but we will have to do further exploration to determine that. Let’s look.

Review the Data

Let's take a closer look at the values of the columns. By utilizing both the table action set and fetch action, we can glance at the initial five rows with ease.

conn.table.fetch(table='HomeEquity', to=5)

When I look at BAD, I notice its values are 1 and 0. However, REASON and JOB have only one value each. Are there any additional values for these three columns?

I also notice a portion of data is missing, represented as NaN (in numerical form) or an empty field for character values.

It appears that APPDATE is a SAS Date, and we will dive into how to work with Dates in the next post.

Descriptive Statistics

Using descriptive statistics will help us see more of the information in the data: the number of values for categorical, the number of missing for all the columns, and the mean, minimum, and maximum for numerical columns.

Using three different actions (distinct, freq, and summary) from the simple action set, let’s explore the data values further.

Distinct action

The distinct action gives us both the number of unique values for each column and the number of missing values.

conn.simple.distinct(table = HomeEquity)

We have 5 columns with less than 10 unique values, these are potential categorical inputs (often referred to as nominal) for our model. For categorical inputs, we want to minimize the number of levels because more levels mean more complexity for our models.

We also have several columns with missing values. Part 3 of this series will address how to work with these columns.

Freq action

The freq action gives us the frequency of each of the unique values for the individual columns. In this example the inputs = option is used to only look at the frequency for the categorical columns with seven or fewer levels. To keep it simple we will use REGION instead of DIVISION to represent a geographic element in our model.

As you can imagine using freq on all columns would produce a lot of output. 😊

conn.simple.freq(table = 'HomeEquity',
    inputs = ["BAD","REASON","JOB","REGION"])

This output shows that we have missing values for REASON and JOB, but not BAD and REGION. This data also reveals that the columns aren't sparse, as there is a decent proportion of rows for each value.

Summary action

The summary action only creates summary descriptive statistics for numerical columns because the mean of a categorical doesn’t make sense 😊.

In this example, all the summary statistics are calculated and include the minimum (MIN), maximum (MAX), number of rows (N), number missing (NMISS), mean, sum, standard deviation (STD), standard error (STDERR), variance (VAR), Uncorrected Sum of Squares (USS), Corrected Sum of Squares (CSS), Coefficient of variation (CV), t-statistic (TValue), the p-value for statistic (ProbT), Skewness, and Kurtosis.

conn.simple.summary(table='HomeEquity')

With the subSet= option, we quickly get a clearer picture of the descriptive statistics of interest.

conn.simple.summary(table='HomeEquity',
                    subSet=["N","NMISS","MEAN","MIN","MAX"])

By examining the descriptive statistics, we can detect any dirty data present. Thankfully, there appears to be none here! We'll also take note of missing values in each column which is something that will get discussed during Part 3 of this series.

Advanced statistics such as Skewness and Kurtosis may come into play when deciding which columns or transformations to use for inputs, but they won't be utilizing them for this series.

The Wrap-Up: Exploring Data

In conclusion, exploring data is the first and most important step before building a predictive model. Using the SAS CAS action sets for simple descriptive statistics helps us quickly identify missing values and potential categorical inputs for our model. This post focused on how to use the Fetch, Distinct, Freq, and Summary actions from the Simple Action Set. In the next post, we will learn how to work with dates in the data.

 

 

 

Getting Started with Python Integration to SAS Viya for Predictive Modeling - Exploring Data was published on SAS Users.

5月 172022
 

Here at SAS, we understand the importance of having access to cutting-edge professional resources. That’s why, for more than 40 years, we’ve provided individuals in programming, data management and analytics fields with low-cost and no-cost materials that promote success in their educational and professional journeys. And today, as the demand for employees with advanced skill sets and global certifications grows, we get it – having the ability to easily access the tools you need to succeed is more important than ever.

We’ve got you covered.

As part of our ongoing commitment to helping individuals enhance their skills, further their careers and increase their chance of success in the field, we’re now offering SAS Certification Practice Exams for free. Yes, free.

Over the years, candidates who have taken advantage of our practice exams have found them to be a valuable, effective tool for gauging their preparedness for SAS Certification Exams. When combined with other SAS training resources – like webinars, content guides, training courses and web tutorials – these free exams greatly increase candidates’ chances of success. Exams are currently available in:

    • Programming
    • Advanced Analytics
    • Data Management
    • Visual Analytics
    • Administration

Representative of the live exams, our online practice exams go through the same rigorous development process and are designed to give candidates an idea of what they should expect in the actual test questions. SAS practice exams also provide the rationale behind correct and incorrect answers, giving participants even more insight and opportunity for exam success.

And the numbers speak for themselves.*

    • Those who passed the practice exam had a 17.5% higher pass rate on the live exam than those who did not take or pass the practice version.
    • Those who took the practice exam – regardless of score – had an 8.15% higher pass rate on the live exam than those who opted not to take it.

Not to mention taking a practice test just might make the difference between passing the SAS Certification Exam on the first try or having to retake it. We’ve found that many who don’t pass the Certification Exam miss the mark by only a few questions, which they could have avoided with a bit more preparation.

Keep in mind, in order to make the best use of this resource, we recommend taking a practice exam as your final method of study in order to test your preparedness before diving into the actual exam. If you are curious about the types of questions typically on the certification exams, we have sample questions available for you to review.

So, what are you waiting for? A free resource to prepare you for a globally-recognized certification and make your resume stand out from the rest? That’s a no-brainer.

For more information about free SAS Certification Practice Exams, visit: https://www.sas.com/en_us/certification/resources/sas-practice-exams.html

*Based on data from 20,000 practice exams taken since 2020.

Access for success: SAS Certification practice exams now offered for free was published on SAS Users.

2月 112022
 

Ever had to work with data that had crazy column names, custom formats you didn’t have access to, or text columns sized way too big for the actual data they contained? Annoying, isn’t it? Well, you’re not alone and, as it turns out, it’s not that hard to fix. Let’s take a peek at renaming, reformatting, relabeling and resizing columns in your data using PROC SQL and PROC DATASETS. PROC SQL could handle most of these tasks, but you can’t rename a column in SQL without creating a new copy of the table. PROC DATASETS can do it all, but the syntax can be trickier. When we resize a column, we're going to have to re-write the data anyway, so I'll use PROC SQL for resizing, and PROC DATASETS for everything else. For example, we have this existing table:

Year Make Model MSRP Invoice Engine Size (L)
2004 Acura MDX $36,945 $33,337 3.5
2004 Audi A4 1.8T 4dr $25,940 $23,508 1.8
2004 Buick Rainier $37,895 $34,357 4.2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP num 8 DOLLAR8.
Invoice num 8 DOLLAR8.
EngineSize Engine Size (L) num 8

 
I want to modify the table to use European formats for the numeric values, to rename the EngineSize column as Size leaving its label unchanged, and to add descriptive labels for MSRP and Invoice.
 
The general syntax for this in PROC DATASETS is:
 
PROC DATASETS LIBRARY=;
MODIFY sas-dataset <(options)>;
FORMAT variable-1 <format-1> <variable-2 <format-2> ...>;
LABEL variable-1=<'label-1' | ' '> <variable-2=<'label-2' | ' '> ...>;
RENAME variable-1=new-name-1 <variable-2=new-name-2 ...>;

After running this code:

proc datasets library=work nolist nodetails;
   modify cars;
      format MSRP eurox8. Invoice eurox8. EngineSize commax5.1;
      label MSRP="Sticker Price" Invoice="Wholesale Price" ;
      rename EngineSize=Size;
run; quit;

 
the data now looks like this:
 

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
In the report, the longest value in the Make column is 5 characters long but the table metadata shows a column width of 13 characters. I can shorten the Make column to 5 characters without truncating the values, making my table size smaller on disk and in memory.

PROC SQL with an ALTER TABLE statement and a MODIFY clause will change a table’s descriptor portion (metadata) for one or more columns. Changing the length of a column will automatically rewrite the data portion of the table, too. The syntax looks like this:

ALTER TABLE table-name MODIFY column-definition-1 <, column-definition-2, ..., column-definition-n>;

After running this code:

proc sql;
alter table work.cars
   modify Make char(5)
;
quit;

The data is still correct:

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 
and the column size is now smaller:

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 5
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
Now, unless you want to write this type of code by hand every time you need it, you’ll want macros to do the work for you. And macros we’ve got! Check out my SAS tutorial on YouTube titled “Resize, Rename and Reformat Data with SAS Macro” for more details on how this works. The video is also embedded below. Links below the embedded video lead to the data and programs used in the video and to my personal macro trove on GitHub, including a macro that quickly strips labels, formats and informats from a data set.

Until next time, may the SAS be with you!
 
Mark

 

DOWNLOAD NOW | VIDEO DATA PACKAGE ON GITHUB
GET MACROS NOW | MY MACRO STASH ON GITHUB

Jedi SAS tricks: resizing, renaming and reformatting your data was published on SAS Users.

2月 112022
 

Ever had to work with data that had crazy column names, custom formats you didn’t have access to, or text columns sized way too big for the actual data they contained? Annoying, isn’t it? Well, you’re not alone and, as it turns out, it’s not that hard to fix. Let’s take a peek at renaming, reformatting, relabeling and resizing columns in your data using PROC SQL and PROC DATASETS. PROC SQL could handle most of these tasks, but you can’t rename a column in SQL without creating a new copy of the table. PROC DATASETS can do it all, but the syntax can be trickier. When we resize a column, we're going to have to re-write the data anyway, so I'll use PROC SQL for resizing, and PROC DATASETS for everything else. For example, we have this existing table:

Year Make Model MSRP Invoice Engine Size (L)
2004 Acura MDX $36,945 $33,337 3.5
2004 Audi A4 1.8T 4dr $25,940 $23,508 1.8
2004 Buick Rainier $37,895 $34,357 4.2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP num 8 DOLLAR8.
Invoice num 8 DOLLAR8.
EngineSize Engine Size (L) num 8

 
I want to modify the table to use European formats for the numeric values, to rename the EngineSize column as Size leaving its label unchanged, and to add descriptive labels for MSRP and Invoice.
 
The general syntax for this in PROC DATASETS is:
 
PROC DATASETS LIBRARY=;
MODIFY sas-dataset <(options)>;
FORMAT variable-1 <format-1> <variable-2 <format-2> ...>;
LABEL variable-1=<'label-1' | ' '> <variable-2=<'label-2' | ' '> ...>;
RENAME variable-1=new-name-1 <variable-2=new-name-2 ...>;

After running this code:

proc datasets library=work nolist nodetails;
   modify cars;
      format MSRP eurox8. Invoice eurox8. EngineSize commax5.1;
      label MSRP="Sticker Price" Invoice="Wholesale Price" ;
      rename EngineSize=Size;
run; quit;

 
the data now looks like this:
 

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
In the report, the longest value in the Make column is 5 characters long but the table metadata shows a column width of 13 characters. I can shorten the Make column to 5 characters without truncating the values, making my table size smaller on disk and in memory.

PROC SQL with an ALTER TABLE statement and a MODIFY clause will change a table’s descriptor portion (metadata) for one or more columns. Changing the length of a column will automatically rewrite the data portion of the table, too. The syntax looks like this:

ALTER TABLE table-name MODIFY column-definition-1 <, column-definition-2, ..., column-definition-n>;

After running this code:

proc sql;
alter table work.cars
   modify Make char(5)
;
quit;

The data is still correct:

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 
and the column size is now smaller:

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 5
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
Now, unless you want to write this type of code by hand every time you need it, you’ll want macros to do the work for you. And macros we’ve got! Check out my SAS tutorial on YouTube titled “Resize, Rename and Reformat Data with SAS Macro” for more details on how this works. The video is also embedded below. Links below the embedded video lead to the data and programs used in the video and to my personal macro trove on GitHub, including a macro that quickly strips labels, formats and informats from a data set.

Until next time, may the SAS be with you!
 
Mark

 

DOWNLOAD NOW | VIDEO DATA PACKAGE ON GITHUB
GET MACROS NOW | MY MACRO STASH ON GITHUB

Jedi SAS tricks: resizing, renaming and reformatting your data was published on SAS Users.

6月 152021
 

Dealing with big dataIn this fast-paced data age, when the sheer volume of data (generated, collected, and waiting to be processed and analyzed) grows at a breathtaking rate, the speed of data processing becomes critically important. In many cases, if data is not processed within an allotted time frame, we lose all its value as it becomes obsolete and ultimately irrelevant. That is why computing power becomes of the essence.

However, computing power itself does not guarantee timely processing. How we use that power makes all the difference. Way too often good old sequential processing just does not cut it anymore and different computing methods are required. One  such method is parallel processing.

In my previous post Using shell scripts for massively parallel processing I demonstrated a script-centered technique of running in parallel multiple independent SAS processes in SAS environments lacking SAS/CONNECT.

In this post, we will take a shot at a slightly different task and solution. Instead of having several totally independent processes, now we have some common “pre-processing” part, then we run several independent processes in parallel, and then we combine the results of parallel processing in the “post-processing” portion of our program.

Problem: monthly data ingestion use case

For simplification, we are going to use a scenario similar to one in the previous blog post:

Each month, shortly after the end of the previous month we needed to ingest a number of CSV files pertinent to transactions during the previous month and produce daily SAS data tables for each day of the previous month. Only now, we will go a step further: combining all those daily tables into a monthly table.

Solution: combining sequential and parallel processing

The solution is comprised of the three major components:

  • Shell script running the main SAS program.
  • Main SAS program, consisting of three parts: pre-parallel processing, parallel processing, and post-parallel processing.
  • Single thread SAS program responsible for a single day data ingestion.

1. Shell script running main SAS program

Below shell script mainprog.sh runs the main SAS program mainprog.sas:

#!/bin/sh
 
# HOW TO CALL:
# nohup sh /path/mainprog.sh YYYYMM &
 
now=$(date +%Y.%m.%d_%H.%M.%S)
 
# getting YYYYMM as a parameter in script call
ym=$1
 
pgmname=/path/mainprog.sas
logname=/path/saslogs/mainprog_$now.log
sas $pgmname -log $logname -set inDate $ym -set logname $logname

The script is run a background mode as indicated by the ampersand at the end of its invocation command:

nohup sh /path/mainprog.sh YYYYMM &

We pass a parameter YYYYMM (e.g. 202106) indicating year and month of our request.

When we call SAS program mainprog.sas within the script we indicate the name of the SAS log file to be created (-log $logname) and also pass on inDate parameter (-set inDate $ym, which has the same value YYYYMM as parameter specified in the script calling command), and logname parameter (-set logname $logname). As you will see further, we are going to use these two parameters within mainprog.sas program.

2. Main SAS program

Here is an abridged version of the mainprog.sas program:

/* ======= pre-processing ======= */
 
/* parameters passed from shell script */
%let inDate = %sysget(inDate);
%let logname = %sysget(logname);
 
/* year and month */
%let yyyy = %substr(inDate,1,4);
%let mm = %substr(inDate,5,2);
 
/* output data library */
libname SASDL '/data/target';
 
/* number of days in month mm of year yyyy */
%let days = %sysfunc(day(%sysfunc(mdy(&mm+1,1,&yyyy))-1));
 
/* ======= parallel processing ======= */
%macro loop;
   %local threadprog looplogdir logdt workpath tasklist i z threadlog cmd;
   %let threadprog = /path/thread.sas;
   %let looplogdir = %substr(&logname,1,%length(&logname)-4)_logs;
   x "mkdir &looplogdir"; *<- directory for loop logs;
   %let logdt = %substr(&logname,%length(&logname)-22,19);
   %let workpath = %sysfunc(pathname(WORK));
   %let tasklist=;
   %do i=1 %to &days;
      %let z = %sysfunc(putn(&i,z2.));
      %let threadlog = &looplogdir/thread_&z._&logdt..log;
      %let tasklist = &tasklist DAY&i;
      %let cmd = sas &threadprog -log &threadlog -set i &i -set workpath &workpath -set inDate &inDate;
      systask command "&cmd" taskname=DAY&i;
   %end;
 
   waitfor _all_ &tasklist;
 
%mend loop;
%loop
 
/* ======= post-processing ======= */
 
/* combine daily tables into one monthly table */
data SASDL.TARGET_&inDate;
   set WORK.TARGET_&inDate._1 - WORK.TARGET_&inDate._&days;
run;

The key highlights of this program are:

  • We capture values of the parameters passed to the program (inDate and logname).
  • Based on these parameters, assign source directory and target data library SASDL.
  • Calculate number of days in a specific month defined by year and month.
  • Create a directory to hold SAS logs of all parallel threads; the directory name is matching the log name of the mainprog.sas.
  • Capture the WORK library location of the main SAS session running mainprog.sas as:

    %let workpath = %sysfunc(pathname(WORK));We use that location in the thread sessions to pass back to the main session data produced by the thread sessions.

  • Macro %do-loop generates a series of SYSTASK statements to spawn additional SAS sessions in the background mode, each ingesting data for a single day of a month:

    systask command "&cmd" taskname=DAY&i;The SYSTASK statement enables you to execute host-specific commands from within your SAS session or application. Unlike the X statement, the SYSTASK statement runs these commands as asynchronous tasks, which means that these tasks execute independently of all other tasks that are currently running. Asynchronous tasks run in the background, so you can perform additional tasks (including launching other asynchronous tasks) while the asynchronous task is still running.

    Restriction: SYSTASK statement is not supported on the CAS server.

  • Also, we generate a cumulative list of all tasknames assigned to each thread sessions:

    %let tasklist = &tasklist DAY&i;

  • Outside the macro %do-loop we use WAITFOR statement which suspends execution of the main SAS session until the specified tasks finish executing. Since we created a list of all daily thread sessions (&tasklist), this will synchronize all our parallel threads and continue mainprog.sas session only when all threads finished executing.
  • At the end of the main SAS session we concatenate all our daily data tables that have been created by parallel threads in the location of the WORK library of the main SAS session.

Using SAS macro loop to generate a series of SYSTASK statements for parallel processing is not the only method available. Alternatively, you can achieve this within a data step using CALL EXECUTE. In this case, each data step iteration will generate a single global SYSTASK statement and push it out of the data step boundaries where they will be sequentially executed (just like in the case of macro implementation). Since option NOWAIT is the default for SYSTASK statements, despite all of them being launched sequentially, their corresponding OS commands will be still running in parallel.

3. Single thread SAS program

Here is an abridged version of the thread.sas program:

/* inDate parameter */
%let inDate = %sysget(inDate);
 
/* parent program's WORK library */
%let workpath = %sysget(workpath);
libname MAINWORK "&workpath";
 
/* thread number */
%let i = %sysget(i);
 
/* year and month */
%let yyyy = %substr(inDate,1,4);
%let mm = %substr(inDate,5,2);
 
/* source data directory */
%let srcdir = /datapath/&yyyy/&mm;
 
/* create varlist macro variable to list all input variable names */
proc sql noprint;
   select name into :varlist separated by ' ' from SASHELP.VCOLUMN
   where libname='PARMSDL' and memname='DATA_TEMPLATE';
quit;
 
/* create fileref inf for the source file */
filename inf "&srcdir/source_data_&inDate._day&i..cvs";
 
/* create daily output data set */
data MAINWORK.TARGET_&inDate._&i; 
   if 0 then set PARMSDL.DATA_TEMPLATE;
   infile inf missover dsd encoding='UTF-8' firstobs=2 obs=max;
   input &varlist;
run;

This program ingests a single .csv file corresponding to the &i-th day of &inDate (year and month) and creates a SAS data table MAINWORK.TARGET_&inDate._&i. To be available in the main SAS session the MAINWORK library is defined here in the same physical location as the WORK library of the main parental SAS session.

We also use a pre-created SAS data template PARMSDL.DATA_TEMPLATE - a zero-observations data set that contains descriptions of all the variables and their attributes.

Additional resources

Thoughts? Comments?

Do you find this post useful? Do you have processes that may benefit from parallelization? Please share with us below.

Using SYSTASK and SAS macro loops for massively parallel processing was published on SAS Users.

5月 112021
 

It’s safe to say that SAS Global Forum is a conference designed for users, by users. As your conference chair, I am excited by this year’s top-notch user sessions. More than 150 sessions are available, many by SAS users just like you. Wherever you work or whatever you do, you’ll find sessions relevant to your industry or job role. New to SAS? Been using SAS forever and want to learn something new? Managing SAS users? We have you covered. Search for sessions by industry or topic, then add those sessions to your agenda and personal calendar.

Creating a customizable agenda and experience

Besides two full days of amazing sessions, networking opportunities and more, many user sessions will be available on the SAS Users YouTube channel on May 20, 2021 at 10:00am ET. After you register, build your agenda and attend the sessions that most interest you when the conference begins. Once you’ve viewed a session, you can chat with the presenter. Don’t know where to start? Sample agendas are available in the Help Desk.

For the first time, proceedings will live on SAS Support Communities. Presenters have been busy adding their papers to the community. Everything is there, including full paper content, video presentations, and code on GitHub. It all premiers on “Day 3” of the conference, May 20. Have a question about the paper or code? You’ll be able to post a question on the community and ask the presenter.

Want training or help with your code?

Code Doctors are back this year. Check out the agenda for the specific times they’re available and make your appointment, so you’ll be sure to catch them and get their diagnosis of code errors. If you’re looking for training, you’ll be quite happy. Training is also back this year and it’s free! SAS instructor-led demos will be available on May 20, along with the user presentations on the SAS Users YouTube channel.

Chat with attendees and SAS

It is hard to replicate the buzz of a live conference, but we’ve tried our best to make you feel like you’re walking the conference floor. And we know networking is always an important component to any conference. We’ve made it possible for you to network with colleagues and SAS employees. Simply make your profile visible (by clicking on your photo) to connect with others, and you can schedule a meeting right from the attendee page. That’s almost easier than tracking down someone during the in-person event.

We know the exhibit hall is also a big draw for many attendees. This year’s Innovation Hub (formerly known as The Quad) has industry-focused booths and technology booths, where you can interact in real-time with SAS experts. There will also be a SAS Lounge where you can learn more about various SAS services and platforms such as SAS Support Communities and SAS Analytics Explorers.

Get started now

I’ve highlighted a lot in this blog post, but I encourage you to view this 7-minute Innovation Hub video. It goes in depth on the Hub and all its features.

This year there is no reason not to register for SAS Global Forum…and attend as few or as many sessions as you want. Why? Because the conference is FREE!

Where else can you get such quality SAS content and learning opportunities? Nowhere, which is why I encourage you to register today. See you soon!

SAS Global Forum: Your experience, your way was published on SAS Users.

4月 202021
 

I can’t believe it’s true, but SAS Global Forum is just over a month away. I have some exciting news to share with you, so let’s start with the theme for this year:

New Day. New Answers. Inspired by Curiosity.

What a fitting theme for this year! Technology continues to evolve, so each new day is a chance to seek new answers to what can sometimes feel like impossible challenges. Our curiosity as humans drives us to seek out better ways to do things. And I hope your curiosity will drive you to register for this year’s SAS Global Forum.

We are excited to offer a global event across three regions. If you’re in the Americas, the conference is May 18-20. In Asia Pacific? Then we’ll see you May 19-20. And we didn’t forget about Europe. Your dates are May 25-26. We hope these region-specific dates and the virtual nature of the conference means more SAS users than ever will join us for an inspiring event. Curious about the exciting agenda? It’s all on the website, so check it out.

Keynotes speakers that you’ll talk about for months to come

Want to be inspired to chase your “impossible” dreams? Or hear more about the future of AI? How about learning about work-life balance and your mental health? We have you covered. SAS executives are gearing up to host an exciting lineup of extremely smart, engaging and thought-provoking keynote speakers like Adam Grant, Ayesha Khanna and Hakeem Oluseyi.

And who knows, we might have a few more surprises up our sleeve. You’ll just have to register and attend to find out.

Papers and proceedings: simplified and easy to find

Have you joined the SAS Global Forum online community? You should, because that’s where you’ll find all the discussion around the conference…before, during and after. It’s also where you’ll find a link to the 2021 proceedings, when they become available. Authors are busy preparing their presentations now and they are hard at work staging their proceedings in the community. Join the community so you can connect with other attendees and know when the proceedings become available.

Stay tuned for even more details

SAS Global Forum is the place where creativity meets curiosity, and amazing analytics happens! I encourage you to regularly check the conference website, as we’re continually adding new sessions and events. You don’t want to miss this year’s conference, so don’t forget to register for SAS Global Forum. See you soon!

Registration is open for a truly inspiring SAS Global Forum 2021 was published on SAS Users.

2月 252021
 

The people, the energy, the quality of the content, the demos, the networking opportunities…whew, all of these things combine to make SAS Global Forum great every year. And that is no exception this year.

Preparations are in full swing for an unforgettable conference. I hope you’ve seen the notifications that we set the date, actually multiple dates around the world so that you can enjoy the content in your region and in your time zone. No one needs to set their alarm for 1:00am to attend the conference!

Go ahead and save the date(s)…you don’t want to miss this event!

Content, content, content

We are working hard to replicate the energy and excitement of a live conference in the virtual world. But we know content is king, so we have some amazing speakers and content lined up to make the conference relevant for you. There will be more than 150 breakout sessions for business leaders and SAS users, plus the demos will allow you to see firsthand the innovative solutions from SAS, and the people who make them. I, for one, am looking forward to attending live sessions that will allow attendees the opportunity to ask presenters questions and have them respond in real time.

Our keynote speakers, while still under wraps for now, will have you on the edge of your seats (or couches…no judgement here!).

Networking and entertainment

You read that correctly. We will have live entertainment that'll have you glued to the screen. And you’ll be able to network with SAS experts and peers alike. But you don’t have to wait until the conference begins to network, the SAS Global Forum virtual community is up and running. Join the group to start engaging with other attendees, and maybe take a guess or two at who the live entertainment might be.

A big thank you

We are working hard to bring you the best conference possible, but this isn’t a one-woman show. It takes a team, so I would like to introduce and thank the conference teams for 2021. The Content Advisory Team ensures the Users Program sessions meet the needs of our diverse global audience. The Content Delivery Team ensures that conference presenters and authors have the tools and resources needed to provide high-quality presentations and papers. And, finally, the SAS Advisers help us in a multitude of ways. Thank you all for your time and effort so far!

Registration opens in April, so stay tuned for that announcement. I look forward to “seeing” you all in May.

What makes SAS Global Forum great? was published on SAS Users.