data preparation

6月 152022
 

Have you ever heard the phrase “beggars can’t be choosers”? Basically, it means that if you ask for something, be grateful for what you get, especially if you don’t have the means to acquire it yourself. This phrase can be widely applicable to most areas of our lives, but when it comes to preparing data in SAS Viya, beggars can be choosers.

SAS Viya has many tools for preparing your data: you can write code in SAS Studio, create plans in SAS Data Studio, or prepare data within reports in SAS Visual Analytics. Because each of these tools has its own features and functions, you may find yourself wondering, which tool should I use?

The answer, like the answer to many questions, is it depends. What are you trying to do? What skills do you have? How quickly does the data need to be ready to use? How many people will be using the data and how?

The tool you choose will boil down to two factors: (1) simplicity and (2) control. For example, let’s say you need to create a calculated column quickly. The easiest way to create this column is within SAS Visual Analytics. You can create the new column without having to switch tools. SAS Visual Analytics is great for preparing data with little effort (no code needed) and limited control (limited options available).

On the other hand, let’s say you need to perform a more complex action like transposing or appending data sets, or standardizing values. SAS Visual Analytics, while easy to use, doesn’t have the features available for this level of data preparation. Instead, you can create a plan in SAS Data Studio to prepare your data. SAS Data Studio is great for preparing data with moderate effort (little to no code needed) and moderate control (more options available).

If you want to have full control, you can write code in SAS Studio to prepare your data. SAS Studio is great for preparing your data with more effort (may require some coding) and full control (unlimited options).

So, you can really think about it as a tradeoff between simplicity and control.  To perform more complex data preparation tasks, you need more control.

I like to think about it in the context of drawing. Let’s say I want to draw an elephant. Before I start to draw, I need to think about how much time I have available (simplicity) and the level of detail I want to include in my drawing (control).

If I’m playing Pictionary, a charades-inspired word-guessing game where you draw a picture so your teammate can identify a specific word, I want to draw the simplest picture possible so my teammate can guess the word as quickly as possible. I don’t need to create the best-looking picture in the world; I just need my teammate to guess correctly in the shortest amount of time. SAS Visual Analytics is the Pictionary version of my elephant.

On the other hand, if I want to sketch a cute picture of an elephant for my niece’s birthday card, I may want to spend a bit more time on the drawing and add some additional details like eyes and a mouth. SAS Data Studio is the birthday card sketch version of my elephant.

If I’m interested in creating a real-life approximation of an elephant, however, I may want to spend a lot more time and add a lot more detail like shading. SAS Studio is the real-life drawing version of my elephant.

Note: The SAS Visual Analytics version took me about 5 seconds to draw, the SAS Data Studio version took me about 5 minutes, and the SAS Studio version took me an hour.

Now you’re probably wondering, when do I use each tool?

SAS Visual Analytics

Use SAS Visual Analytics in the following scenarios:

  • Modifying properties: modify properties of data items (names, formats, classifications, aggregations).
  • Filters: create subsets of the data for the report (data source filters) or specific objects (basic filters, advanced filters, post-aggregate filters, common filters). Note: Post-aggregate filters, basic filters, and common filters can only be created in SAS Visual Analytics.
  • Calculated items: create a simple calculation, like number of days to delivery (calculated item) or percent of total (aggregated measure) or custom groupings (custom categories). Note: Due to their dynamic nature, aggregated measures can only be created in SAS Visual Analytics.
  • Hierarchies and geography data. Note: Hierarchies and geography data items can only be created in SAS Visual Analytics.
  • A simple join between two tables based on equality condition (for example, T1.Date = T2.OrderDate).
  • Mapping data sources: select a value in an object that uses Table A and filter the associated values in an object that uses Table B (mapping) and the common column between Table A and Table B is used by both objects. Note: Mapping data sources can only be performed in SAS Visual Analytics.
  • Aggregated data sources: create an aggregated version of the table.
  • Data views for ad-hoc analysis: changes made to data in SAS Visual Analytics are available only for the specific report. To apply data changes to other reports that use the same data source, you can create and share a data view, but you cannot use that prepared data outside of SAS Visual Analytics.

SAS Data Studio

Use SAS Data Studio in the following scenarios:

  • Improve data quality with casing, parsing, extraction, identification analysis, generating matchcodes, standardizing, or matching and clustering.
  • To discover and address unknown problems in your data using the Suggestions.
  • Append tables or transpose.
  • Create a join between two or more tables (maximum is 32) based on equality condition (for example, T1.Date = T2.OrderDate) or when the common column between the data sources is calculated (for example, datepart(T1.Date) = T2.OrderDate).
  • Write custom DATA step or CASL code to perform any required data preparation action.
  • To create tables used for many reports, by many analysts, and in different Viya tools. Note: Jobs can be created from plans and scheduled to run at specific times to ensure the data is up-to-date.

Note: You can also modify the properties of data items, create subsets of the data, and create calculated items in SAS Data Studio.

SAS Studio

Use SAS Studio in the following scenarios:

  • Programming with SAS Data Quality: improve data quality programmatically with casing, parsing, extraction, identification analysis, generating matchcodes, standardizing, or matching and clustering.
  • Create a non-equi join (for example T1.Date > T2.EndDate) between two or more tables or when the common column between the data sources is calculated (for example, datepart(T1.Date) = T2.OrderDate).

Note: You can use the SQL procedure or CASL code to perform the join.

Note: The DATA step can be used to combine tables where the name of the column is the same in each table.

  • Programming with the DATA step: perform complex conditional logic, use arrays, or process data iteratively using the DATA step.
  • Programming with SQL: perform any action that requires the use of SQL code (creating subqueries, using set operators, inserting rows into tables, updating values in a table, using DICTIONARY tables, and more).
  • Using SAS procedures: use any SAS procedure not associated with a transform in SAS Data Studio (for example, FORMAT, REPORT, FREQ, UNIVARIATE, and more).
  • Programming with macros: generate dynamic code based on data or user input using the SAS macro language.
  • Create tables used for many reports, by many analysts, and in different Viya tools.

Note: Jobs can be created from SAS code and scheduled to run at specific times to ensure the data is up-to-date.

Note: You can also modify properties of data items, create subsets of the data, create calculated items, create custom groupings, create aggregated tables, append and transpose data, and write custom DATA step or CASL code in SAS Studio.

Summary

The following table summarizes the data preparation actions you can perform in each tool.

Action SAS Visual Analytics SAS Data Studio SAS Studio
Modify properties of data items (names, formats, classifications, aggregations)
 
Filter the data
Filter specific objects in a report
 
Create simple calculations
Create custom groupings
Create dynamic aggregated measures
Create hierarchies
Create geography data items
 
Map data sets to apply interactions between two objects that use different data sources
Perform simple joins between two tables based on equality condition
Perform joins between more than two tables (up to 32) based on equality condition
Perform joins when the common column needs to be calculated
Perform non-equi joins
 
Create an aggregated table
Append tables
Transpose tables
 
Improve data quality with transforms
Improve data quality programmatically
Determine data issues using Suggestions
 
Write DATA step code
Write CASL code
Write SQL code
Use SAS procedures
Use SAS macro language
 
Perform ad-hoc analysis
Create tables used for many reports
Schedule actions to run on a repeatable basis

 

For more information about how to use SAS Visual Analytics to prepare your data and create advanced reports, check out my book Interactive Reports in SAS Visual Analytics: Advanced Features and Customization.

Beggars can’t be choosers…Unless they’re using SAS Viya was published on SAS Users.

5月 242022
 

Higher education institutions are some of the most data-rich entities in the world. Postsecondary leaders need high-quality, consistent and accurate insights to make the best decisions for their institution, students and constituents. Data governance is a topic that may seem technical in nature and perhaps important to only the IT [...]

Exploring 4 components of sound data governance in higher education was published on SAS Voices by Brock Matthews

4月 202022
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. This post builds upon CAS-Action! Create Columns in CAS Tables - Part 1 by showing how to add formats and modify the length of computed columns.

I'll start by building off the following code where I create three new columns in the PRODUCTS table:

proc cas;
    source createColumns;
        Total_Price = Price * Quantity;
        Product_fix = upcase(Product);
        if Return = "" then Return_fix = "No"; 
           else Return_fix = "Yes";
    endsource;
 
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVarsProgram = createColumns
    }; 
 
    table.fetch / table = productsTbl;
quit;

And the results:

Notice in the results the Total_Price column does not have a format, and the Return_fix column truncates the value Yes. Let's fix these issues.

Add Formats to a Computed Column

First, I'll add a format to the Total_Price computed column. To add formats to a computed column use the computedVars sub-parameter. ComputedVars enables you to specify column metadata (labels and formats) for computed columns. The parameter expects a list of dictionaries, even if one column is specified.

For example, in the computedVars sub-parameter I'll add list with a single dictionary that specifies the name of a column Total_Price and apply the SAS dollar format. Next, I'll execute the fetch action to preview the table.

proc cas;
    source createColumns;
        Total_Price = Price * Quantity;
        Product_fix = upcase(Product);
        if Return = "" then Return_fix = "No"; 
           else Return_fix = "Yes";
    endsource;
 
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVars = {
                            {name = 'Total_Price', format = 'dollar16.2'}
                   },
                   computedVarsProgram = createColumns
    }; 
 
    table.fetch / table = productsTbl;
quit;

And the results:

Notice in the results above the format was applied to the Total_Price column. However, now the additional computed columns are not displayed in the output. To fix this issue, add all of the computed columns to the computedVars sub-parameter as a dictionary with the name key, even if you do not modify the column metadata.

In the code below, the Product_fix and Return_fix columns were added to the list in the computedVars sub-parameter.

proc cas;
    source createColumns;
        Total_Price = Price * Quantity;
        Product_fix = upcase(Product);
        if Return = "" then Return_fix = "No"; 
           else Return_fix = "Yes";
    endsource;
 
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVars = {
                            {name = 'Total_Price', format = 'dollar16.2'},
                            {name = 'Product_fix'},
                            {name = 'Return_fix'}
                   },
                   computedVarsProgram = createColumns
    }; 
 
    table.fetch / table = productsTbl;
quit;

And the results:

Notice in the results that adding each computed column in the computedVars sub-parameter fixed the issue and the computed columns are shown.

However, we still have one more issue. Look at the truncated values in the Return_fix column. How can we fix that?

Modify Computed Column Lengths

When using an IF/THEN/ELSE statement to create a new column in SAS, it sets the column width by using the length of the first assignment statement. In this example, the first assignment statement sets the value of the Return_fix column to No, so the column length is set to 2.

To modify the length of the computed column add a SAS LENGTH statement in the computedVarsProgram parameter prior to the first assignment statement. Here I'll set the Return_fix column to length of 3.

proc cas;
    source createColumns;
        Total_Price = Price * Quantity;
        Product_fix = upcase(Product);
 
        length Return_fix varchar(3);
        if Return = "" then Return_fix = "No"; 
           else Return_fix = "Yes";
    endsource;
 
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVars = {
                            {name = 'Total_Price', format = 'dollar16.2'},
                            {name = 'Product_fix'},
                            {name = 'Return_fix'}
                   },
                   computedVarsProgram = createColumns
    }; 
 
    table.fetch / table = productsTbl;
quit;

And the results:

In the results the truncation issues in the Return_fix column are corrected.

Summary

In summary, using the computedVars sub parameter allows you to add formats to the computed columns, and the computedVarsProgram parameter enables you to set the column lengths. A few key points:

  • When using the computedVars sub-parameter, all computed columns from the computedVarsProgram sub-parameter must be added. Otherwise, the computed columns are not returned.
  • The computedVars sub-parameter takes a list of dictionaries as it's value.

Additional resources

fetch Action
SAS® Cloud Analytic Services: Fundamentals
Code
Create Data Code

CAS-Action! Create Columns in CAS Tables - Part 2 was published on SAS Users.

4月 132022
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. I've broken the series into logical, consumable parts. If you'd like to start by learning a little more about what CAS Actions are, please see CAS Actions and Action Sets - a brief intro. Or if you'd like to see other topics in the series, see the overview page. Otherwise, let's learn how to create calculated columns in CAS tables.

In this example, I will use the CAS procedure to create columns in CAS tables for use in CAS actions. I could execute similar code in Python, R and more with some slight changes to the syntax for the specific language. The other languages also include specific methods to accomplish the same objective.

I'll start by executing the following code to create a CAS table named PRODUCTS in the Casuser caslib with 80 million rows. Then I'll create a variable named productsTbl to reference the PRODUCTS CAS table, and use the variable as the value to the table parameter in the fetch CAS action.

proc cas;
    productsTbl = {name = 'products', caslib = 'casuser'};
 
    table.fetch / table = productsTbl;
quit;

And the response:

Create a Calculated Column

To create a calculated column you add the CAS table reference. The computedVarsProgram parameter specifies an expression for each computed variable and takes a string as it's value. The string can be:

  • an assignment statement that can include most SAS functions.
  • a SAS IF/THEN/ELSE statement
  • a SAS LENGTH or LABEL statement

I'll start by creating a simple calculated column named Total_Price that multiplies Price by Quantity. This calculation triggers when referencing the productsTbl CAS table in the table parameter of a CAS action. Here I'll use the fetch action to view 20 rows of the table.

proc cas;
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVarsProgram = 'Total_Price = Price * Quantity;'};
 
    table.fetch / table = productsTbl;
quit;

And the response:

Notice in the results above, the new column Total_Price was created during the execution of the fetch action. The original CAS table was not modified.

Create Multiple Calculated Columns

Next, I'll create a second column by adding a new assignment statement in the string. The second calculated column, Product_fix renders upper case values in the original Product column. In the string you must end each assignment statement with a semicolon. Above all, a new column is required since the computedVarsProgram sub parameter cannot overwrite existing columns in the table. To update values in a CAS table column, use the table.update action.

proc cas;
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVarsProgram = 'Total_Price = Price * Quantity;
                                          Product_fix = upcase(Product);'};
 
    table.fetch / table = productsTbl;
quit;

And the response:

Notice in the results above, the fetch action created the Total_Price and Product_fix columns.

Use Conditional Logic

In addition, you can also use a traditional SAS IF/THEN/ELSE statement in the computedVarsProgram sub parameter. For example, I'll create a new column named Return_fix that replaces missing values in the Return column with the value No.

proc cas;
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVarsProgram = 'Total_Price = Price * Quantity;
                                          Product_fix = upcase(Product);
                                          if Return = "" then Return_fix = "No"; 
                                              else Return_fix = "Yes";'
    }; 
 
    table.fetch / table = productsTbl;
quit;

And the response:

In the results above there are three new calculated columns: Total_Price, Product_fix and Return_fix.

Storing the Calculated Columns in a Variable

Lastly, instead of adding the calculated columns as a string, I like to store them in a variable using the CAS language SOURCE block statement. Storing the string in the the SOURCE block enables you to embed the calculated columns as a text string and avoid quotation issues that can occur. I'll name my SOURCE block variable createColumns and use it as the value in the computedVarsProgram sub parameter.

proc cas;
    source createColumns;
        Total_Price = Price * Quantity;
        Product_fix = upcase(Product);
        if Return = "" then Return_fix = "No"; 
           else Return_fix = "Yes";
    endsource;
 
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVarsProgram = createColumns
    }; 
 
    table.fetch / table = productsTbl;
quit;

And the response:

The results show the same result as the previous section.

While this is great, how can I format the new Total_Price column using the traditional SAS dollar format? Look closely at the Return_fix column. Do you see an issue with the Yes values? Well, stay tuned for part 2!

Summary

In summary, using the computedVarsProgram sub parameter allows you to easily create calculated columns in a CAS tables during the execution of CAS actions. A few key points:

  • The computedVarsProgram sub parameter takes a string as it's value and uses traditional SAS assignment statements, IF/THEN/ELSE statements, and you can also use the LABEL and LENGTH statements.
  • Using the computedVarsProgram sub parameter does not modify the original CAS table. It is simply applied during the execution of a CAS action.
  • If you create a dictionary variable with the CAS table reference, you can then use that variable in a variety of actions within the table parameter.
  • Using a SOURCE block enables you to embed your text in a variable without using a string.

Additional resources

fetch Action
SAS® Cloud Analytic Services: Fundamentals
Code

CAS-Action! Create Columns in CAS Tables - Part 1 was published on SAS Users.

1月 132022
 

3 ways of calculating length of overlapping of two integer intervalsIn this post, we tackle the problem of calculating the length of overlap of two date/time intervals. For example, we might have two events — each lasting several days — and need to calculate the number of shared days between both events. In other words, we need to calculate the length of the events’ overlap.

Such tasks are typical in clinical trials programming, health (and non-health) insurance claim processing, project management, planning and scheduling, etc.

The length of the overlap can be measured not only in days, but also in any other units of the date/time dimension: hours, minutes, seconds, weeks, months, years and so on. Moreover, the date/time application is just one special case of a broader task of calculating the overlap length of two integer intervals. An integer interval [x .. y] is a set of all consecutive integer numbers beginning with x and ending with y (boundaries included and x ≤ y).

Instead of suggesting a single “best” way of solving this problem, I offer three different strategies. This allows you to compare and weigh their pros and cons and decide for yourself which approach is most suitable for your circumstances. All three solutions presented in this blog post are applicable to the date/time use case, as well as its superset of integer intervals’ overlap length calculation.

Problem description

Suppose we have two events A [A1 .. A2] and B [B1 .. B2]. Event A lasts from date A1 until date A2 (A1 ≤ A2), and event B lasts from date B1 until date B2 (B1 ≤ B2).  Both starting and ending dates are included in the corresponding event intervals.

We need to calculate the overlap between these two events, defined as the number of days that belong to both events A and B.

Overlapping intervals: sample data

Before solving the problem, let’s create sample data to which we are going to apply our possible solutions.

The following data step creates data set EVENTS representing such a data sample:

data EVENTS;
   input A1 A2 B1 B2;
   format A1 A2 B1 B2 date9.;
   informat A1 A2 B1 B2 mmddyy10.;
   lines;
01/02/2022 01/05/2022 01/06/2022 01/10/2022
01/22/2022 01/30/2022 01/16/2022 01/18/2022
01/02/2022 01/05/2022 01/03/2022 01/10/2022
01/02/2022 01/05/2022 01/03/2022 01/04/2022
01/10/2022 01/15/2022 01/06/2022 01/14/2022
01/01/2022 01/05/2022 01/05/2022 01/09/2022
01/07/2022 01/13/2022 01/10/2022 01/13/2022
;

Now let’s go over the following three distinct possible solutions, each with its own merit and downside. At the end, you decide which solution you like the most.

Solution 1: Brute force

In general, brute force algorithms solve problems by exhaustive iteration through all possible choices until a solution is found. It rarely results in a clever, efficient solution. However, it is a straightforward, valid and important algorithm design strategy that is applicable to a wide variety of problems.

For integer intervals overlap calculation, a brute force approach would involve the following steps:

  1. Determine the earliest date in the definitions of our two intervals
  2. Determine the latest date in the definitions of our two intervals
  3. Create an iterative loop with index variable ranging from the earliest to the latest date
  4. Within this loop, increment a counter by one if index variable falls within both intervals
  5. At the end of this loop, the counter will equal the desired value for the overlap.

Here is the code implementation of the described above algorithm:

data RESULTS;
   set EVENTS;
   OVERLAP = 0;
   do i=min(A1,B1) to max(A2,B2);
      if (A1<=i<=A2) and (B1<=i<=B2) then OVERLAP + 1;
   end;
run;

As you can see, the code implementation of the brute force solution is quite simple. However, it is hardly efficient as it involves extensive looping.

Solution 2: Exhaustive logic

Exhaustive logic algorithms solve problems by splitting them into several smaller problems (logical units or classes). This way, instead of working on a problem in its entirety, you can work separately on smaller and simpler problems that are easier to comprehend, digest and solve.

Here is an illustration of various arrangements of integer intervals representing separate logical units (numbered 1...5) when calculating the intervals overlaps:

Different arrangements of date/time integer intervals

The following SAS code demonstrates an implementation of the exhaustive logic algorithm (each IF-THEN statement corresponds to the numbered logical unit in the above figure):

data RESULTS;
   set EVENTS;
   if A1<=B1 and A2>=B2    then OVERLAP = B2 - B1 + 1; else
   if A1>=B1 and A2<=B2    then OVERLAP = A2 - A1 + 1; else
   if A1<=B1<=A2 and B2>A2 then OVERLAP = A2 - B1 + 1; else
   if B1<=A1<=B2 and A2>B2 then OVERLAP = B2 - A1 + 1; else
   if B1>A2 or A1>B2       then OVERLAP = 0;
run;

Notice that we need to add 1 to each of the two days difference. That is because the number of days spanned is always 1 less than the number of days contained in an interval. Here is an illustration:

Days spanned vs. number of days

Solution 3: Holistic Math

Finally, a holistic math method looks at the problem in its entirety and condenses it to a comprehensive mathematical formula. While it is usually not the most obvious approach, and often requires some brain stretching and internalization, the effort is often rewarded by achieving the best possible solution.

The exhaustive logic solution may help in better understanding the problem and arriving at the generalized mathematical formula.

If you dwell on the exhaustive logic solution for a while, you may come to realization that all the variety of the logical units boils down to the following single simple formula:

Overlap = min(A2, B2) - max(A1, B1) + 1.

In other words, the overlap of two integer intervals is a difference between the minimum value of the two upper boundaries and the maximum value of the two lower boundaries, plus 1.

Positive OVERLAP value represents actual overlap days, while zero or negative value means that intervals do not overlap at all. The higher absolute value of the negative result – the farther apart the intervals are.  To correct this “negativity” effect we can just set all negative values to zero.

The following SAS code implements the described holistic math approach:

data RESULTS;
   set EVENTS;
   OVERLAP = min(A2,B2) - max(A1, B1) + 1;
   if OVERLAP<0 then OVERLAP = 0;
run;

As you can see, this in the most concise, elegant, and at the same time the most efficient solution with no iterative looping and complicated exhaustive (and exhausting) logic.

Output

Here is the output data table RESULTS showing our sample intervals (columns A1, A2, B1 and B2) and the resulting column OVERLAP (all three solutions described above should produce identical OVERLAP values):

Output data set with calculated days overlaps

Questions? Thoughts? Comments?

Which of the presented here three methods do you like most? Which one would you prefer under different circumstances? Can you come up with yet another way of calculating date intervals overlaps? Do you have questions, concerns, comments? Please share with us below.

Additional resources

Calculating the overlap of date/time intervals was published on SAS Users.

2月 012021
 

Do you want to spend less time on the tedious task of preparing your data? I want to tell you about a magical and revolutionary SAS macro called %TK_codebook. Not only does this macro create an amazing codebook showcasing your data, it also automatically performs quality control checks on each variable. You will easily uncover potential problems lurking in your data including variables that have:

  • Incomplete formats
  • Out of range values
  • No variation in response values
  • Variables missing an assigned user-defined format
  • Variables that are missing labels

All you need is a SAS data set with labels and formats assigned to each variable and the accompanying format catalogue. Not only will this macro change the way you clean and prepare your data, but it also gives you an effortless way to evaluate the quality of data you obtain from others before you start your analysis. Look how easy it is to create a codebook if you have a SAS data set with labels and formats:

title height=12pt 'Master Codebook for Study A Preliminary Data';
title2 height=10pt 'Simulated Data for Participants in a Health Study';
title3 height=10pt 'Data simulated to include anomalies illustrating the power of %TK_codebook';
 
libname library "/Data_Detective/Formats/Blog_1_Codebooks";
 
%TK_codebook(lib=work,
	file1=STUDYA_PRELIM,
	fmtlib=LIBRARY,
	cb_type=RTF,
	cb_file=/Data_Detective/Book/Blog/SAS_Programs/My_Codebook.rtf,
	var_order=INTERNAL,
	organization = One record per CASEID,
	include_warn=YES;
run;

Six steps create your codebook

After creating titles for your codebook, this simple program provides %TK_codebook with the following instructions:

  1. Create a codebook for SAS data set STUDYA_PRELIM located in the temporary Work library automatically defined by SAS
  2. Find the formats assigned to the STUDYA_PRELIM in a format catalogue located in the folder assigned to the libref LIBRARY
  3. Write your codebook in a file named /Data_Detective/Book/Blog/SAS_Programs/My_Codebook.rtf
  4. List variables in the codebook by their INTERNAL order (order stored in the data set)
  5. Add “One record per CASEID” indicating which variable(s) uniquely identify each observation to codebook header
  6. Include reports identifying potential problems in the data

Just these few lines of code will create the unbelievably useful codebook shown below.

The data set used has many problems that can interfere with analysis. %TK_codebook creates reports showing a concise summary of only those problem variables needing close examination. These reports save you an incredible amount of time.

Using assigned formats, %TK_codebook identifies unexpected values occurring in each variable and provides a summary in the first two reports.

Values occurring outside those defined by the assigned format indicate two possible problems:

  1. A value was omitted from the format definition (Report 1 – Incomplete formats)
  2. The variable has unexpected values needing mitigation before the data is analyzed (Report 2 – Out of Range Value)

The next report lists numeric variables that have no variation in their values.

These variables need examining to uncover problems with preparing the data set.

The next two reports warn you about variables missing an assigned user-defined format. These variables will be excluded from screening for out-of-range values and incomplete format definitions.

The last report informs you about variables that are missing a label or have a label that matches the variable name.

It is easy to use %TK_codebook to resolve problems in your data and create an awesome codebook. Instead of spending your time preparing your data, you will be using your data to change the world!

Create your codebook

Download %TK_codebook from my author page, then learn to use it from my new book, The Data Detective’s Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data.

THE DATA DETECTIVE'S TOOLKIT | BUY IT NOW

Creating codebooks with SAS macros was published on SAS Users.

3月 162020
 

As a long-time SAS 9 programmer, I typically accomplish my data preparation tasks through some combination of the DATA Step, Proc SQL, Proc Transpose and some housekeeping procs like Proc Contents and Proc Datasets. With the introduction of SAS Viya, SAS released a new scripting language called CASL – a language that interacts with SAS Cloud Analytics Services (CAS).

CASL statements include actions, logically organized into action sets based on common functionality. For example, the Table action set allows you to load a table in CAS, view table metadata, change table metadata such as drop or rename a column, fetch (print) sample rows, save or drop a table from CAS, among other things. Steven Sober provides a great overview of CASL in his 2019 SAS Global Forum paper.

Learning CASL is a good idea assuming you want to leverage the power of CAS, because CASL is the language of CAS. While you can continue to use Viya-enabled procs for many of your data processing needs, certain new functionality is only available through CASL. CAS actions also provide a more granular access to options which otherwise may not be available as procedure options. But old habits die hard, and for a while I found myself bouncing between SAS 9.4 and CASL. I'd pull the data down from CAS just to get it to process in the SAS Programming Runtime Environment (SPRE) because it took less effort than figuring out how to get it done properly in CAS.

Then I started a project with a seriously large data set and quickly hit the limit on how much data I could pull down to process in SPRE. And although I could adjust the DATALIMIT option to retrieve more data than the default limit, I was wasting time and server resources unnecessarily moving the data between CAS and SPRE. All this, just so I could process the data “old school.”

I decided to challenge myself to do ALL my data preparation in CASL. I love a good challenge! I started collecting various useful CASL code snippets. In this post, I am sharing the tidbits I’ve accumulated, along with some commentary. Note, you can execute CAS actions from multiple clients, including SAS, Python, R, Lua and Java. Since my objective was to transition from traditional SAS code to CASL, I’ll focus solely on CAS actions from the SAS client perspective. While I used SAS Viya 3.5 for this work, most of the code snippets should work on prior versions as well.

The sections below cover: how to submit CASL code; loading, saving, dropping and deleting data; exploring data; table metadata management; and data transformation. Feel free to jump ahead to any section of interest.

How do you submit CASL code?

You use PROC CAS to submit CASL code from a SAS client. For example:

proc cas;
   <cas action 1>;
   <cas action 2>;
   …;
quit;

Similarly to other interactive procs that use run-group processing, separate CAS actions by run; statements. For example:

proc cas;
   <cas action 1>;
   run;
   <cas action 2>;
   run;
quit;

In fact, you can have the entire data preparation and analysis pipeline wrapped inside a single PROC CAS, passing data and results in the form of CASL variables from one action to the next. It can really be quite elegant.

Moving Data Using PROC CAS

Loading SASHDAT data in CAS

Your data must be in the SASHDAT format for CAS to process it. To load a SASHDAT table into CAS, use the table.loadtable CAS action. The code below assumes your SASHDAT table is saved to a directory on disk associated with your current active caslib, and you are loading it into the same caslib. (This usually occurs when you already performed the conversion to SASHDAT format, but the data has been unloaded. If you are just starting out and are wondering how to get your data into the SASHDAT format in the first place, the next session covers it, so keep reading.)

proc cas; 
     table.loadtable / path="TABLE_NAME.sashdat" casOut="TABLE_NAME"; 
     table.promote /name="TABLE_NAME" drop=true; 
quit;

The table.promote action elevates your newly loaded CAS table to global scope, making it available to other CAS sessions, including any additional sessions you start, or to other users assuming they have the right privileges. I can’t tell you how many times I forgot to promote my data, only to find that my hard-earned output table disappeared because I took a longer coffee break than expected! Don’t forget to promote or save off your data (or both, to be safe).

If you are loading from a directory other than the one associated with your active caslib, modify the path= statement to include the relative path to the source directory – relative to your active caslib. If you are looking to load to a different caslib, modify the casOut= statement by placing the output table name and library in curly brackets. For example:

proc cas;
    table.loadtable / path="TABLE_NAME.sashdat" casOut={name="TABLE_NAME" caslib="CASLIB2"};
    table.promote /name="TABLE_NAME" drop=true;
quit;

You can also place a promote=true option inside the casOut= curly brackets instead of calling the table.promote action, like so:

proc cas;
    table.loadtable / path="TABLE_NAME.sashdat" 
                      casOut={name="TABLE_NAME" caslib="CASLIB2" promote=true};
quit;

Curly brackets are ubiquitous in CASL (and quite unusual for SAS 9.4). If you take away one thing from this post, make it “watch your curly brackets.”

Loading SAS7BDAT, delimited data, and other file formats in CAS

If you have a SAS7BDAT file already on disk, load it in CAS with this code:

proc cas;
    table.loadtable /path="TABLE_NAME.sas7bdat" casout="TABLE_NAME" 
                     importoptions={filetype="basesas"};
quit;

Other file formats load similarly – just use the corresponding filetype= option to indicate the type of data you are loading, such as CSV, Excel, Document (.docx, .pdf, etc.), Image, Video, etc. The impressive list of supported file types is available here.

proc cas;
    table.loadtable / path="TABLE_NAME.csv" casout="TABLE_NAME" 
                      importoptions={filetype="csv"};
    run;
quit;

You can include additional parameters inside the importOptions= curly brackets, which differ by the file type. If you don’t need any additional parameters, use the filetype=”auto” and let CAS determine the best way to load the file.

When loading a table in SAS7BDAT, delimited or some other format, the table.loadtable action automatically converts your data to SASHDAT format.

Loading data in CAS conditionally

Imagine you are building a script to load data conditionally – only if it’s not already loaded. This is handy if you have a reason to believe the data might already be in CAS. To check if the data exists in CAS and load conditionally, you can leverage the table.tableExists action in combination with if-then-else logic. For example:

proc cas;
    table.tableExists result =r / name="TABLE_NAME";
    if r=0  then do;
        table.loadtable / path="TABLE_NAME.sashdat" casOut={name="TABLE_NAME"};
        table.promote /name="YOUR_TABLE_NAME" drop=true;
    end;
    else print("Table already loaded");
quit;

Notice that the result=r syntax captures the result code from the tableExists action, which is evaluated before the loadtable and promote actions are executed. If the table is already loaded in CAS, “Table already loaded” is printed to the log. Otherwise, the loadtable and promote actions are executed.

The ability to output CAS action results to a CASL variable (such as result=r in this example) is an extremely powerful feature of CASL. I include another example of this further down, but you can learn more about this functionality from documentation or this handy blog post.

Saving your CAS data

Let’s pretend you’ve loaded your data, transformed it, and promoted it to global scope. You or your colleagues can access it from other CAS sessions. You finished your data preparation, right? Wrong. As the header of this section suggests, you also need to save your prepared CAS data. Why? Because up to this point, your processed and promoted data exists only in memory. You will lose your work if your SAS administrator reboots the server or restarts the CAS controller. If you need to quickly reload prepared data, you must back it up to a caslib’s data source. See the CAS data lifecycle for more details.

To save off CAS data, naturally, you use the table.save action. For example:

proc cas;
    table.save / table="TABLE_NAME" name="TABLE_NAME.sashdat" replace=true;
quit;

In this example, you save off the CAS table to disk as a SASHDAT file, defaulting to the location associated with your active caslib. You can modify the table.save parameters to save or export the data to an alternative data storage solution with full control over the file format (including but not limited to such popular options as HDFS, Oracle, SQL Server, Salesforce, Snowflake and Teradata), compression, partitioning and other options.

Dropping and deleting data

To drop a table from CAS, execute a table.droptable action. For example:

proc cas;
    table.droptable / name="TABLE_NAME" quiet=true;
quit;

The quiet=true option prevents CAS from generating an error if the table does not exist in CAS. Dropping a table deletes it from memory. It’s a good practice to drop tables you no longer need, particularly the one you have promoted. Local-scope tables disappear on their own when the session expires, whereas global tables will stay in memory until they are unloaded.

Dropping a table does not delete the underlying source data. To delete the source of a CAS table, use the table.deleteSource action. For example:

proc cas;
    table.deletesource / source="TABLE_NAME.sashdat" quiet=true;
quit;

Exploring Data Using PROC CAS

After taking a close look at moving the data using PROC CAS, let’s look at some useful ways to start exploring and manipulating CAS data.

Fetching sample data

When preparing data, I find it useful to look at sample data. The table.fetch action is conceptually similar to PROC PRINT and, by default, outputs the first 20 rows of a CAS table:

proc cas;
table.fetch / table="Table_Name";
quit;

You can modify the table.fetch options to control which observations and variables to display and how to display them. For example:

proc cas;
table.fetch / table={name="TABLE_NAME" where="VAR1 in ('value1','value2')"},              /*1*/
	      orderby={{name="VAR1"},                                                     /*2*/
                             {name="VAR2", order="descending"}
                             },	
	     fetchvars={{name="VAR1", label="Variable 1"},                                /*3*/
	                     {name="VAR2", label="Variable 2"}, 
 		             {name="VAR3", label="Variable 3", format=comma12.1}
                             },
	     to=50,                                                                       /*4*/
	     index=false;							          /*5*/
quit;

In the code snippet above:

  • #1 – where= statement limits the records to those meeting the where criteria.
  • #2 – orderby= option defines the sort order. Ascending is the default and is not required. If sorting by more than one variable, put them in a list inside curly brackets, as shown in this example. If a list item has a subparameter (such as order= here), encase each item in curly brackets.
  • #3 – fetchvars= option defines the variables to print as well as their display labels and formats. If you select more than one variable, put them in a list inside curly brackets, as shown here. And again, if a list item includes a subparmeter, then enclose each list item in curly brackets.
  • #4 – to= option defines the number of rows to print.
  • #5 – index= false option deactivates the index column in the output (the default is index=true). This is similar to the noobs option in PROC PRINT.

As mentioned earlier, make sure to watch your curly brackets!

Descriptive statistics and variable distributions

The next step in data exploration is looking at descriptive statistics and variable distributions. I would need a separate blog post to cover this in detail, so I only touch upon a few of the many useful CAS actions.

To look at statistics for numeric variables, use the simple.summary action, which computes standard descriptive statistics, such as minimum, maximum, mean, standard deviation, number missing, and so on. For example:

proc cas;
    simple.summary / table="TABLE_NAME";
quit;

Among its other features, the simple.summary action allows analysis by one or more group-by variables, as well as define the list of desired descriptive statistics. For example:

proc cas;
simple.summary / table={name="TABLE_NAME", groupBy="VAR1", vars={"NUMVAR1","NUMVAR2”}},
                 subSet={"MAX", "MIN", "MEAN", "NMISS"};
quit;

Another useful action is simple.topK, which selects the top K and bottom K values for variables in a data set, based on a user-specified ranking order. The example below returns the top 5 and bottom 5 values for two variables based on their frequency:

proc cas;
simple.topk / table="TABLE_NAME" 
              aggregator="N",                        
              inputs={"VAR1","VAR2"},
              topk=5,
              bottomk=5;
quit;

Simple is a rich action set with heaps of useful options covered in the documentation.

You may be wondering – what about crosstabs and frequency tables? The simple action set includes freq and crosstab actions. In addition, the action closely imitating the functionality of the beloved PROC FREQ is freqTab.freqTab. For example, the code snippet below creates frequency tables for VAR1, VAR2 and a crosstab of the two.

proc cas;
freqtab.freqtab / table="TABLE_NAME"
                  tabulate={"VAR1","VAR2",
                  {vars={"VAR1","VAR2"}}
                  };
quit;

Managing CAS Table Variables

Changing table metadata

One of the basic tasks after exploring your data is changing table metadata, such as dropping unnecessary variables, renaming tables and columns, and changing variable formats and labels. The table.altertable action helps you with these housekeeping tasks. For example, the code snippet below renames the table, drops two variables and renames and changes labels for two variables:

proc cas;
    table.altertable / table="TABLE_NAME" rename="ANALYTIC_TABLE"
                       drop={"VAR1",”VAR2”}
                       columns={{name="VAR3" rename="ROW_ID" label="Row ID"},
                                {name="VAR4" rename="TARGET" label="Outcome Variable"}
                                }
                       ;
quit;

Outputting variable list to a data set

Another useful trick I frequently use is extracting table columns as a SAS data set. Having a list of variables as values in a data set makes it easy to build data-driven scripts leveraging macro programming. The code snippet below provides an example. Here we encounter another example of capturing action result as a CASL variable and using it in further processing – I can’t stress enough how helpful this is!

proc cas;
    table.columninfo r=collinfo / table={name="TABLE_NAME"};       /*1*/
    collist=collinfo["ColumnInfo"];                                /*2*/
    saveresult collist casout="collist";                           /*3*/
quit;

In the snippet above:

  • #1 - the columninfo action collects column information. The action result is passed to a CASL variable collinfo. Notice, instead of writing out result=, I am using an alias r =.
  • #2 - the portion of the a CASL variable collinfo containing column data is extracted into another CASL variable collist.
  • #3 - the saveresult statement sends the data to a CAS table collist. If you want to send the results to a SAS7BDAT data set, replace casout= with dataout=, and provide the library.table_name information.

Transforming the Data

Lastly, let’s look at some ways to use CAS actions to transform your data. Proc SQL and DATA step are the two swiss-army knives in SAS 9 developers’ toolkit that take care of 90% of the data prep. The good news is you can execute both DATA Step and SQL directly from PROC CAS. In addition, call the transpose action to transpose your data.

Executing DATA Step code

The dataStep.runCode action enables you to run DATA step code directly inside PROC CAS. You must enclose your DATA step code in quotation marks after the code= statement. For example:

proc cas;
    dataStep.runCode /
    code="
        data table_name;
        set table_name;
        run;
        ";
quit;

Running DATA step code in CAS allows access to sophisticated group-by processing and the use of such popular programming techniques as first- and last-dot. Refer to the documentation for important nuances related to processing in a distributed, multi-threaded environment of CAS.

Executing FedSQL

To run SQL in CASL, use the fedSQL.execDirect action. Enclose the SQL query in quotation marks following the query= statement. Optionally, you can use the casout= statement to save the results to a CAS table. For example:

proc cas;
    fedsql.execDirect/
    query=
         "
          select  *
          from TABLE1 a  inner join TABLE2 b
          on a.VAR1 = b.VAR1
         "
    casout={name="TABLE3", replace=True};
quit;

Similarly to DATA step, be aware of the many nuances when executing SQL in CAS via FedSQL. Brian Kinnebrew provides an excellent overview of FedSQL in his SAS Communities article, and the documentation has up-to-date details on the supported functionality.

Transposing data

Transposing data in PROC CAS is a breeze. The example below uses transpose.transpose action to restructure rows into columns.

proc cas;
    transpose.transpose /
          table={name="TABLE_NAME", groupby={"VAR1"}}
          transpose={"VAR2"}
          id={"VAR3"}
          prefix="Prefix"
    casout={name="TRANSPOSED" replace=true};
run;

You can transpose multiple variables in the same transpose action. Simply place additional variables inside the curly brackets following transpose=, in quotes, separated by a comma.

Conclusion

PROC CAS is a wrapper procedure enabling you to leverage SAS’ new programming language - CASL. CASL enables you to submit CAS actions directly to SAS Cloud Analytic Services engine from a SAS client. This post provided examples of loading, managing, exploring and transforming your data through CAS actions. Certain new functionality in CAS is only available through CAS actions, so getting comfortable with CASL makes sense. Fear not, and let the curly brackets guide the way 😊.

Acknowledgement

I would like to thank Brian Kinnebrew for his thoughtful review and generous help on my journey learning CASL.

Challenge accepted: Learning data prep in CASL was published on SAS Users.