CAS

11月 142018
 

Prior to SAS Viya

With the creation of SAS Viya, the ability to run DATA Step code in a distributed manner became a reality. Prior to distributed DATA Step, DATA Step programmers never had to think about achieving repeatable results when SAS7BDAT datasets were the sources to their DATA Step code that contains a BY statement. This is because prior to SAS Cloud Analytics Services (CAS), DATA Step ran single-threaded and the source SAS7BDAT dataset was stored on disk. Every time one would run the code we obtained repeatable results because the sequence of rows within the BY group were preserved between runs. To illustrate this, review figures 1, 2, and 3.

Figure 1 is the source SAS7BDAT dataset WORK.TEST1. Notice the sequence of VAR2, especially on row 1 and 4 (i.e., _N_ =1 and 4).

_n_ VAR1 VAR2
1 1 N
2 1 Y
3 1 Y
4 2 Y
5 2 Y
6 2 N


Figure 1. WORK.TEST1 the original SAS7BDAT dataset

In figure 2, we see a BY statement with variable VAR1. This will ensure VAR1 is in ascending order. We are also using FIRST. processing to identify the first occurrence of the BY group. Because this data is stored on disk and because the DATA Step is executed using a single thread, the result table will be repeatable no matter how many times we run the DATA Step code.

Figure 2. Focus on the IF statement, especially VAR2

In figure 3, we see the output SAS7BDAT dataset WORK.TEST2.

_n_ VAR1 VAR2
1 1 N

Figure 3. WORK.TEST2 result dataset from running the code in Figure 2

In figure 4, we are running the same DATA Step but this time our source and target tables are CAS tables. The source table CASLIB.TEST1 was created by lifting the original SAS7BDAT dataset WORK.TEST1 (review figure 1) into CAS.

Figure 4. DATA Step executing in CAS

In figure 5, we see that the DATA Step logic is being respected in runs 1, 2 and 3; but we are not achieving repeatable results. This is due to CAS running on multiple threads. Note that the BY statement – which will group the data correctly for each BY group – is done on the fly. Also, the BY statement will not preserve the sequence of rows within the BY group between runs.

For some processes, this is not a concern but for others it could be. If you need to obtain repeatable results in DATA Step code that runs distributed in CAS as well as match your SAS 9 single-threaded DATA Step results, I suggest the following workaround be used.

Figure 5. DATA Step logic is respected but yields different results with each run

With SAS Viya

The workaround is very simplistic to understand and implement. For each SAS7BDAT dataset being lifted into a CAS table, see figure 6, we need to add a new variable ROW_ID.

_n_ VAR1 VAR2
1 1 N
2 1 Y
3 1 Y
4 2 Y
5 2 Y
6 2 N

Figure 6. Original SAS7BDAT dataset source WORK.TEST1

To accomplish this, we will leverage the automatic variable _N_ that is available to all DATA Step programmers. _N_ is initially set to 1. Each time the DATA step loops past the DATA statement, the variable _N_ increments by 1. The value of _N_ represents the number of times the DATA step has iterated. In our case, the value for each row is the row sequence in the original SAS7BDAT dataset. Figure 7 contains the SAS code we ran on the SAS 9.4M5 workspace server or the SAS Viya compute server to add the new variable ROW_ID.

 

Figure 7. Creating the new variable ROW_ID

By reviewing figure 8 we can see the new variable ROW_ID in the SAS7BDAT dataset WORK.TEST1. Now that we have the new variable, we are ready to lift this dataset into CAS.

_N_ VAR1 VAR2 ROW_ID
1 1 N 1
2 1 Y 2
3 1 Y 3
4 2 Y 4
5 2 Y 5
6 2 N 6

Figure 8. WORK.TEST1 with the new variable ROW_ID

There are many ways to lift a SAS7BDAT dataset into CAS. One way is to use a DATA Step like we did in figure 9.

Figure 9. DATA Step code to create distributed CAS table CASLIB.TEST1 

To obtain the repeatable results, we need to control the sequence of rows within each BY group. We accomplish this by adding the new variable ROW_ID as the last variable to the BY statement in our DATA Step code, see figure 10.

Figure 10. Add ROW_ID as last variable of the BY group

Figure 11 shows us the output CAS table created by the code in figure 10. By adding the new variable ROW_ID and using that variable as the last variable of the BY statement, we are controlling the sequencing of rows within the BY groups for all 3 runs.

VAR1 VAR2 ROW_ID
1 N 1

Figure 11. Distrusted CAS table CASLIB.TEST2

Conclusion

With distributed DATA Step comes great opportunities to improve runtimes. It also means we need to understand differences between single-threaded processing of SAS7BDAT datasets that are stored on disk and distributed processing of CAS tables store in-memory. To help you with that journey I suggest you read the SAS Global Forum paper, Parallel Programming with the DATA Step: Next Steps.

How to achieve repeatable results with distributed DATA Step BY Groups was published on SAS Users.

11月 132018
 

In my previous blog post I demonstrated how to create your own CAS actions and action sets.  In this post, we will explore how to create your own CAS functions using the CAS Language (CASL).  A function is a component of the CASL programming language that can accept arguments, perform a computation or other operation, and return a value.  The value that is returned can be used in an assignment statement or elsewhere in expressions.

About SAS functions

SAS provides two types of supplied functions: built-in functions and common functions.  Built-in functions contain functionality that is unique to CASL.  These allow you to perform operations on your result tables, arrays, and dictionaries, and provide run-time support for your CASL programs.  Built-in functions cannot be replaced with user-defined functions.

Conversely, common functions provide functionality that is common to other SAS functions.  When used in a CASL program, SAS functions take a CASL value and a CASL value is returned.  Unlike built-in functions, you can replace these functions with user-defined functions.

Since the capabilities of built-in functions are unique to CASL, let’s look at these in-depth and demonstrate with an example.  Save the following FedSQL code in an external file called hmeqsql.sas.  This code will be read into CAS and stored as a variable.

The execDirect action executes FedSQL code in CAS.  The READPATH built-in function reads the FedSQL code saved in hmeqsql.sas and stores it in the CASL variable sqlcode which is used as input to the query parameter.

The fetch action displays the first 20 rows from the output table hmeq.out.

If you don’t feel like looking through the documentation for a built-in or common function, a list of each can be generated programmatically.  Run the following code to see a list of built-in functions.

Partial list of CASL built-in functions

Run the following code to see a list of common functions.

Partial list of common functions

User-defined CASL functions

In addition to the customizable capabilities of built-in functions supplied by SAS, you can also create your own functions using the FUNCTION statement.  User-defined functions can be called in expressions using CASL and they provide a large amount of flexibility.  The following example creates four different functions for temperature conversion.

After creating these functions, they can be called immediately, or you can store them in an external file and call them via a %include statement.  In this example, the user-defined functions have been stored in an external file called FunctionStore.sas.  You can call one, all, or any number of your user-defined functions.

The output from each function call is displayed in the log.

Lastly, if you want to see all user-defined functions, run the FUNCTIONLIST statement.  A list will be printed to the log.

More about CASL programming and using functions in CASL

Check out these resources for further information on programming in the CASL language and using functions in CASL.

Customize your CASL code with built-in and user-defined functions was published on SAS Users.

11月 062018
 

This post was also written by SAS' Xiangxiang Meng.

You can communicate with various clients (SAS, Python, Lua, Java, and REST) in the same place using Pandas Data Analysis Library, CAS actions should come naturally. CAS enables you to subset tables using Python expressions. Using Python, you can create conditions that are based on the data pulled, instead of creating the conditions yourself. SAS® will use the information you want pulled to determine which rows to select.

For example, rather than using fixed values of rows and columns to select data, SAS can create conditions based on the data in the table to determine which rows to select. This is done using the same syntax as DataFrames. CASColumn objects support Python’s various comparison operators and builds a filter that subsets the rows in the table. You can then use the result of that comparison to index into a CASTable. It sounds much more complicated than it is, so let’s look at an example.

The examples below are from the Iris flower data set, which is available in the SASHELP library, in all distributions of SAS. The listed code and output are produced using the IPython interface but can be employed with Jupyter Notebook just as easily.

If we want to get a CASTable that only contains values where petal_length is greater than 7, we can use the following expression to create our filter.


Behind the scenes, this expression creates a computed column that is used in a WHERE expression on the CASTable. This expression can then be used as an index value for a CASTable. Indexing this way essentially creates a boolean mask. Wherever the expression values are true, the rows of the table are returned. Wherever the expression is false, the rows are filtered out.

These two steps are more commonly done in one line.


We can further filter rows out by indexing another comparison.

Comparisons can be joined using the bitwise comparison operators & (and) and | (or). You do have to be careful with these though due to the operator precedence. Bitwise comparison has a higher precedence than comparisons such as greater-than and less-than, so you need to wrap your comparisons in parentheses.


In all cases, we are not changing anything about the underlying data in CAS. We are simply constructing a query that is executed with the CASTable when it is used as the parameter in a CAS action. You can see what is happening behind the scenes by displaying the resulting CASTable objects.


You can also do mathematical operations on columns with constants or other columns within your comparisons.

The list of supported operations is shown in the table below.

The supported comparison and operators are shown in the following table.

As you can see in the tables above, it is possible to do comparisons on character columns as well. This includes using many of Python’s string methods on the column values. These are accessed using the str attribute of the column, just like in DataFrames.

This easy syntax allows the Python client to manipulate data much easier when working in SAS Viya.

Another great tip? The Python client allows you to manipulate data on the fly, without moving or copying the data to another location. Creating computed columns allows you to speed up the wrangling of data, while giving you options for how want to get there.

Want to learn more great tips about integrating Python with SAS Viya? Check out Kevin Smith and Xiangxiang Meng’s SAS Viya: The Python Perspective to learn how Python can be intergraded into SAS® Viya® —and help you manipulate data with ease.

Great tip for dynamic data selection using SAS Viya and Python was published on SAS Users.

11月 032018
 

When you begin to work within the SAS Viya ecosystem, you learn that the central piece is SAS Cloud Analytic Services (CAS). CAS allows all clients in the SAS Viya ecosystem to communicate and run analytic methods. The great part about SAS Viya is that the R client can drive CAS directly using familiar objects and constructs for R programmers.

The SAS Scripting Wrapper for Analytics Transfer (SWAT) package is an R interface to CAS. With this package, you can load data into memory and apply CAS actions to transform, summarize, model and score the data. You can still retain the ease-of-use of R on the client side to further post process CAS result tables.

But before you can do any analysis in CAS, you need some data to work with and a way to get to it. There are two data access components in CAS:

  1. Caslibs, definitions that give access to a resource that contains data.
  2. CASTables, for analyzing data from a caslib resource. You load the data into a CASTable, which contains information about the data in the columns.

Other references you may find of interest include this GitHub repository where you can find more information on installing and configuring CAS and SWAT. Also available is this article on using RStudio with SAS Viya.

The following excerpt from SAS® Viya® : the R Perspective, the book I co-authored with my SAS colleague Xiangxiang Meng, demonstrates the way the R client in SAS Viya allows you to select data with precision. The examples come from the iris flower data set, which is available in the SASHELP library, in all distributions of SAS. The CASTable object sorttbl is sorted by the Sepal.Width column.

Rather than using fixed values of rows and columns to select data, we can create conditions that are based on the data in the table to determine which rows to select. The specification of conditions is done using the same syntax as that used by data.frame objectsCASTable objects support R’s various comparison operators and build a filter that subsets the rows in the table. You can then use the result of that comparison to index into a CASTableIt sounds much more complicated than it is, so let’s look at an example.

This expression creates a computed column that is used in a where expression on the CASTable. This expression can then be used as an index value for a CASTable. Indexing this way essentially creates a Boolean mask. Wherever the expression values are true, the rows of the table are returned. Wherever the expression is false, the rows are filtered out.

> newtbl <- sorttbl[expr,] > head(newtbl) 
 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
1          7.7         2.6          6.9         2.3 virginica 
2          7.7         2.8          6.7         2.0 virginica 
3          7.6         3.0          6.6         2.1 virginica 
4          7.7         3.8          6.7         2.2 virginica

These two steps are commonly entered on one line.

> newtbl <- sorttbl[sorttbl$Petal.Length > 6.5,]
> head(newtbl) 
 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
1          7.7         2.6          6.9         2.3 virginica 
2          7.7         2.8          6.7         2.0 virginica 
3          7.6         3.0          6.6         2.1 virginica 
4          7.7         3.8          6.7         2.2 virginica

We can further filter rows out by indexing another comparison expression.

> newtbl2 <- newtbl[newtbl$Petal.Width < 2.2,] > head(newtbl2) 
 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
1          7.7         2.8          6.7         2.0 virginica 
2          7.6         3.0          6.6         2.1 virginica

Comparisons can be joined using the bitwise comparison operators & (and) and | (or). You must be careful with these operators though due to operator precedence. Bitwise comparison has a lower precedence than comparisons such as greater-than and less-than, but it is still safer to enclose your comparisons in parentheses.

> newtbl3 <- sorttbl[(sorttbl$Petal.Length > 6.5) & (sorttbl$Petal.Width < 2.2),] > head(newtbl3) 
 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
1          7.7         2.8          6.7         2.0 virginica 
2          7.6         3.0          6.6         2.1 virginica

In all cases, we are not changing anything about the underlying data in CAS. We are simply constructing a query that is executed with the CASTable when it is used as the parameter in a CAS action. You can see what is happening behind the scenes by displaying the attributes of the resulting CASTable objects.

> attributes(newtbl3) 
 
$conn 
CAS(hostname=server-name.mycompany.com, port=8777, username=username, session=11ed56e2-f9dd-9346-8d01-44a496e68880, protocol=http) 
 
$tname
[1] "iris" 
 
$caslib 
[1] "" 
 
$where 
[1] "((\"Petal.Length\"n > 6.5) AND (\"Petal.Width\"n < 2.2))" 
 
$orderby 
[1] "Sepal.Width" 
 
$groupby 
[1] "" 
 
$gbmode 
[1] "" 
 
$computedOnDemand 
[1] FALSE 
 
$computedVars 
[1] "" 
 
$computedVarsProgram 
[1] "" 
 
$XcomputedVarsProgram 
[1] "" 
 
$XcomputedVars 
[1] "" 
 
$names 
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  
[5] "Species"      
 
$class 
[1] "CASTable" 
attr(,"package") 
[1] "swat"

You can also do mathematical operations on columns with constants or on other columns within your comparisons.

> iris[(iris$Petal.Length + iris$Petal.Width) * 2 > 17.5,] 
 
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
118          7.7         3.8          6.7         2.2 virginica 
119          7.7         2.6          6.9         2.3 virginica

The list of supported operators is shown in the following table:

Operator Numeric Data Character Data
+ (add) ✔
- (subtract) ✔
* (multiply) ✔
/ (divide) ✔
%% (modulo) ✔
%/% (integer division) ✔
^ (power) [✔

The supported comparison operators are shown in the following table.

Operator Numeric Data Character Data
== (equality) ✔ ✔
!= (inequality) ✔ ✔
< (less than) ✔ ✔
> (greater than) ✔ ✔
<= (less than or equal to) ✔ ✔
>= (greater than or equal to) ✔ ✔

 

As you can see in the preceding tables, you can do comparisons on character columns as well. In the following example, all of the rows in which Species is equal to "virginica" are selected and saved to a new CASTable object virginica. Note that in this case, data is still not duplicated.

> tbl <- defCasTable(conn, 'iris') > virginica <- tbl[tbl$Species == 'virginica',] > dim(virginica) 
 
[1] 50  5 
 
> head(virginica) 
 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
1          7.7         3.0          6.1         2.3 virginica 
2          6.3         3.4          5.6         2.4 virginica 
3          6.4         3.1          5.5         1.8 virginica 
4          6.0         3.0          4.8         1.8 virginica 
5          6.9         3.1          5.4         2.1 virginica 
6          6.7         3.1          5.6         2.4 virginica

It’s easy to create powerful filters that are executed in CAS while still using the R syntax. However, the similarities to dataframe don’t end there. CASTable objects can also create computed columns and by groups using similar techniques.

Want to learn more? Get your copy of SAS Viya: The R Perspective

How to use SAS® Viya® and R for dynamic data selection was published on SAS Users.

10月 292018
 

CASL is a language specification that can be used by the SAS client to interact with and provide easy access to Cloud Analytic Services (CAS).  CASL is a statement-based scripting language with many uses and strengths including:

  • Specifying CAS actions to submit requests to the CAS server to perform work and return results.
  • Evaluating and manipulating the results returned by an action.
  • Creating user-defined actions and functions and creating the arguments to an action.
  • Developing analytic pipelines.

CASL uses PROC CAS which enables you to program and execute CAS actions from the SAS client and use the results to prepare the parameters for a subsequent action.  A single PROC CAS statement can contain several CASL programs.  With the CAS procedure you can run any CAS action supported by the server, load new action sets into the server, use multiple sessions to perform asynchronous execution and operate on parameters and results as variables using the function expression parser.

CASL, and the CAS actions, provide the most control, flexibility and options when interacting with CAS.  One can use DATA Step, CAS-enabled PROCS and CASL for optimal flexibility and control.  CASL works well with traditional SAS interfaces and the Base SAS language.

Each CAS action belongs to an action set.  Each action set is further categorized by product (i.e. VA, VS, VDMML, etc.).  In addition to the many CAS actions supplied by SAS, as of SAS® Viya™ 3.4, you can create your own actions using CASL.  Developing and utilizing your own CAS actions allows you to further customize your code and increase your ability to work with CAS in a manner that best suits you and your organization.

About user-defined action sets

Developing a CASL program that is stored on the CAS server for processing is defined as a user-defined action set.  Since the action set is stored on the CAS server, the CASL statements can be written once and executed by many users. This can reduce the need to exchange files between users that store common code.  Note that you cannot add, remove, or modify a single user-defined action. You must redefine the entire action set.

Before creating any user-defined actions, test your routines and functions first to ensure they execute successfully in CAS when submitted from the programming client.  To create user-defined actions, use the defineActionSet action in the builtins action set and add your code.  You also need to modify your code to use CASL functions such as SEND_RESPONSE, so the resulting objects on the server are returned to the client.

Developing new actions by combining SAS-provided CAS actions

One method for creating user-defined CAS actions is to combine one or more SAS provided CAS actions into a user-defined CAS action.  This allows you to execute just one PROC CAS statement and call all user-defined CAS actions.  This is beneficial if you repeatedly run many of the same actions against a CAS table.  An example of this is shown below. If you would like copy of the actual code, feel free to leave a reply below.

In this example, four user-defined CAS actions named listTableInfo, simplefreq, detailfreq, and corr have been created by using the corresponding SAS-provided CAS actions tableInfo, freq, freqTab, and correlation.  These four actions return information about a CAS table, simple frequency information, detailed frequency and tabulate information, and Pearson correlation coefficients respectively.  These four actions are now part of the newly created user-defined action set myActionSet.  When this code is executed, the log will display a note that the new action set has been added.

Once the new action set and actions have been created, you can call all four or any combination of them via a PROC CAS statement.  Specify the user-defined action set, user-defined action(s), and parameters for each.

Developing new actions by writing your own code

Another way to create user-defined CAS actions is to apply user-defined code, functions, and statements instead of SAS-provided CAS actions.

In this example, two user-defined CAS actions have been created, bdayPct and sos.  These actions belong to the new user-defined action set myFunctionSet.

To call one or both actions, specify the user-defined action set, user-defined action(s), and parameters for each.

The results for each action are shown in the log.

Save and load custom actions across CAS sessions

User-defined action sets only exist in the current CAS session.  If the current CAS session is terminated, the program to create the user-defined action set must be executed again unless an in-memory table is created from the action set and the in-memory table is subsequently persisted to a SASHDAT file.  Note: SASHDAT files can only be saved to path-based caslibs such as Path, DNFS, HDFS, etc.  To create an in-memory table and persist it to a SASHDAT file, use the actionSetToTable and save CAS actions.

To use the user-defined action set, it needs to be restored from the saved SASHDAT file.  This is done with the actionSetFromTable action.

More about CASL programming and CAS actions

Check out these resources for further information on programming in the CASL language and running actions with CASL.

How to use CASL to develop and work with user-defined CAS actions was published on SAS Users.

8月 132018
 

Data in the cloud makes it easily accessible, and can help businesses run more smoothly. SAS Viya runs its calculations on Cloud Analytics Service (CAS). David Shannon of Amadeus Software spoke at SAS Global Forum 2018 and presented his paper, Come On, Baby, Light my SAS Viya: Programming for CAS. (In addition to being an avid SAS user and partner, David must be an avid Doors fan.) This article summarizes David's overview of how to run SAS programs in SAS Viya and how to use CAS sessions and libraries.

If you're using SAS Viya, you're going to need to know the basics of CAS to be able to perform calculations and use SAS Viya to its potential. SAS 9 programs are compatible with SAS Viya, and will run as-is through the CAS engine.

Using CAS sessions and libraries

Use a CAS statement to kick off a session, then use CAS libraries (caslibs) to store data and resources. To start the session, simply code "cas;" Each CAS session is given its own unique identifier (UUID) that you can use to reconnect to the session.

Handpicked Related VIDEO: SAS programming in the cloud: CASL code

There are a few significant codes that can help you to master CAS operations. Consider these examples, based on a CAS session that David labeled "speedyanalytics":

  • What CAS sessions do I have running?
    cas _all_ list;
  • Get the version and license specifics from the CAS server hosting my session:
    cas speedyanalytics listabout;
  • I want to sign out of SAS Studio for now, so I will disconnect from my CAS session, but return to it later…
    cas speedyanalytics disconnect;
  • ...later in the same or different SAS Studio session, I want to reconnect to the CAS session I started earlier using the UUID I previous grabbed from the macro variable or SAS log:
    cas uuid="&speedyanalytics_uuid";
  • At the end of my program(s), shutdown all my CAS sessions to release resources on the server:
    cas _all_ terminate;

Using CAS libraries

CAS libraries (caslib) are the method to access data that is being stored in memory, as well as the related metadata.

From the library, you can load data into CAS tables in a couple of different ways:

  1. Takes a sample data set, calculate a new measure and stores the output in memory
  2. Proc COPY can bring existing SAS data into a caslib
  3. Proc CASUTIL loads tables into caslibs

The Proc CASUTIL allows you to save your tables (named "classsi" data in David's examples) for future use through the SAVE statement:

proc casutil;
 save casdata="classsi" casout="classsi";
run;

And reload like this in a future session, using the LOAD statement:

proc casutil;
 load casdata="classsi" casout="classsi";
run;

When accessing your CAS libraries, remember that there are multiple levels of scope that can apply. "Session" refers to data from just the current session, whereas "Global" allows you to reach data from all CAS sessions.

Programming in CAS

Showing how to put CAS into action, David shared this diagram of a typical load/save/share flow:

Existing SAS 9 programs and CAS code can both be run in SAS Viya. The calculations and data memory occurs through CAS, the Cloud Analytics Service. Before beginning, it's important to understand a general overview of CAS, to be able to access CAS libraries and your data. For more about CAS architecture, read this paper from CAS developer Jerry Pendergrass.

The performance case for SAS Viya

To close out his paper, David outlined a small experiment he ran to demonstrate performance advantages that can be seen by using SAS Viya v3.3 over a standard, stand-alone SAS v9.4 environment. The test was basic, but performed reads, writes, and analytics on a 5GB table. The tests revealed about a 50 percent increase in performance between CAS and SAS 9 (see the paper for a detailed table of comparison metrics). SAS Viya is engineered for distributive computing (which works especially well in cloud deployments), so more extensive tests could certainly reveal even further increases in performance in many use cases.

Additional resources

A quick introduction to CAS in SAS Viya was published on SAS Users.

7月 182018
 

Last year when I went through the SAS Global Forum 2017 paper list, the paper Breaking through the Barriers: Innovative Sampling Techniques for Unstructured Data Analysis impressed me a lot. In this paper, the author raised out the common problems caused by traditional sampling method and proposed four sampling methods for textual data. Recently my team is working on a project in which we are facing a huge volume of documents from a specific field, and we need efforts of linguists and domain experts to analyze the textual data and annotate ground truth, so our first question is which documents we should start working on to get a panoramic image of the data with minimum efforts. Frankly, I don’t have a state-of-the-art method to extract representative documents and measure its effect, so why not try this innovative technique?

The paper proposed four sampling methods, and I only tried the first method through using cluster memberships as a strata. Before we step into details of the SAS program, let me introduce the steps of this method.

  • Step 1: Parse textual data into tokens and calculate each term's TF-IDF value
  • Step 2: Generate term-by-document matrix
  • Step 3: Cluster documents through k-means algorithm
  • Step 4: Get top k terms of each cluster
  • Step 5: Do stratified sampling by cluster

I wrote a SAS macro for each step so that you are able to check the results step by step. If you are not satisfied with the final cluster result, you can tune the parameters of any step and re-run this step and its post steps. Now let's see how to do this using SAS Viya to extract samples from a movie review data.

The movie review data has 11,855 rows of observations, and there are 200,963 tokens. After removing stop words, there are 18,976 terms. In this example, I set dimension size of the term-by-document matrix as 3000. This means that I use the top 3000 terms with the highest TF-IDF values of the document collections as its dimensions. Then I use k-means clustering to group documents into K clusters, and I set the maximum K as 50 with the kClus action in CAS. The dataSegment action can cluster documents directly, but this action cannot choose the best K. You need to try the clustering action with different K values and choose the best K by yourself. Conversely the kClus action chooses the best K automatically among the K values defined by minimum K and maximum K, so I use kClus action in my implementation.

After running the program (full code at the end of this post), I got 39 clusters and top 10 terms of the first cluster as Table-1 shows.

Table-1 Top 10 terms of Cluster 1

Let's see what samples we get for the first cluster. I got 7 documents and each document either has term "predictable" or term "emotional."

Samples from cluster

I set sampPct as 5 which means 5% data will be randomly selected from each cluster. Finally I got 582 sample documents. Let's check the sample distribution of each cluster.

Donut chart of cluster samples

This clustering method helped us select a small part of documents from the piles of document collections intelligently, and most importantly it saved us much time and helped us to hit the mark.

I haven't had a chance to try the other three sampling methods from the paper; I encourage you have a try and share your experiences with us. Big thanks to my colleague Murali Pagolu for sharing this innovative technique during the SAS Global Forum 2017 conference and for kindly providing me with some good suggestions.

Appendix: Complete code for text sampling

 
/*-------------------------------------*/
/* Get tfidf                           */
/*-------------------------------------*/
%macro getTfidf(
   dsIn=, 
   docVar=, 
   textVar=, 
   language=, 
   stemming=true, 
   stopList=, 
   dsOut=
);
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   offset={name="tpparse_out",replace=TRUE}
;
run;
 
textparse.tpAccumulate /
   offset={name="tpparse_out"}
   stopList={name="&stopList"}
   termWeight="NONE"
   cellWeight="NONE"
   reduce=1
   parent={name="tpAccu_parent",replace=TRUE}
   terms={name="tpAccu_term",replace=TRUE}
   showdroppedterms=false
;
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_term_stat", replace=true} 
query="
      select tpAccu_parent.&docVar, 
             tpAccu_term._term_,
             tpAccu_parent._count_ as _tf_,
             tpAccu_term._NumDocs_
      from tpAccu_parent
      left join tpAccu_term
      on tpAccu_parent._Termnum_=tpAccu_term._Termnum_;
"
;
run;
 
simple.groupBy / 
   table={name="tpAccu_parent"}
   inputs={"&docVar"}
   casout={name="doc_nodup", replace=true};
run;
 
numRows result=r / 
   table={name="doc_nodup"};
totalDocs = r.numrows;
run;
 
datastep.runcode /
code = "
   data &dsOut;
      set doc_term_stat;"
   ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);"
   ||"run;
";
run;
quit;
 
proc cas;
   table.dropTable name="tpparse_out" quiet=true; run;
   table.dropTable name="tpAccu_parent" quiet=true; run;
   table.dropTable name="tpAccu_term" quiet=true; run;
   table.dropTable name="doc_nodup" quiet=true; run;
   table.dropTable name="doc_term_stat" quiet=true; run;
quit;
%mend getTfidf;
 
 
/*-------------------------------------*/
/* Term-by-document matrix             */
/*-------------------------------------*/
%macro DocToVectors(
   dsIn=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   dimSize=500, 
   dsOut=
);
proc cas;
simple.summary /
   table={name="&dsIn", groupBy={"&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="term_tf_sum", replace=true};
run;
 
simple.topk / 
   table={name="term_tf_sum"}  
   inputs={"&termVar"} 
   topk=&dimSize
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name='termnum_top', replace=true};
run;
 
loadactionset "fedsql";
execdirect casout={name="doc_top_terms", replace=true} 
query="
      select termnum.*, _rank_
      from &dsIn termnum, termnum_top
      where termnum.&termVar=termnum_top._Charvar_
        and &tfVar!=0;
"
;
run;
 
transpose.transpose /
   table={name="doc_top_terms", 
          groupby={"&docVar"}, 
          computedVars={{name="_name_"}},
          computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"}  
   transpose={"&tfVar"}
   casOut={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="term_tf_sum" quiet=true; run;
   table.dropTable name="termnum_top" quiet=true; run;
   table.dropTable name="termnum_top_misc" quiet=true; run;
   table.dropTable name="doc_top_terms" quiet=true; run;
quit;
%mend DocToVectors;
 
 
/*-------------------------------------*/
/* Cluster documents                   */
/*-------------------------------------*/
%macro clusterDocs(
   dsIn=, 
   nClusters=10,
   seed=12345,   
   dsOut=
);
proc cas;
/*get the vector variables list*/
columninfo result=collist /
   table={name="&dsIn"};
ndimen=dim(collist['columninfo']);
vector_columns={};
j=1;
do i=1 to ndimen;
   thisColumn = collist['columninfo'][i][1];
   if lowcase(substr(thisColumn, 1, 4))='_dim' then do;
      vector_columns[j]= thisColumn;
      j=j+1;
   end;
end;
run;
 
clustering.kClus / 
   table={name="&dsIn"},
   nClusters=&nClusters,
   init="RAND",
   seed=&seed,
   inputs=vector_columns,
   distance="EUCLIDEAN",
   printIter=false,
   impute="MEAN",
   standardize='STD',
   output={casOut={name="&dsOut", replace=true}, copyvars="ALL"}
;
run;
quit;
%mend clusterDocs;
 
 
/*-------------------------------------*/
/* Get top-k words of each cluster     */
/*-------------------------------------*/
%macro clusterProfile(
   termDS=, 
   clusterDS=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   clusterVar=_CLUSTER_ID_, 
   topk=10, 
   dsOut=
);
proc cas;
loadactionset "fedsql";
execdirect casout={name="cluster_terms",replace=true} 
query="
      select &termDS..*, &clusterVar
      from &termDS, &clusterDS
      where &termDS..&docVar = &clusterDS..&docVar;
"
;
run;
 
simple.summary /
   table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="cluster_terms_sum", replace=true};
run;
 
simple.topk / 
   table={name="cluster_terms_sum", groupBy={"&clusterVar"}}  
   inputs={"&termVar"} 
   topk=&topk
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="cluster_terms" quiet=true; run;
   table.dropTable name="cluster_terms_sum" quiet=true; run;
quit;
%mend clusterProfile;
 
 
/*-------------------------------------*/
/* Stratified sampling by cluster      */
/*-------------------------------------*/
%macro strSampleByCluster(
   docDS=, 
   docClusterDS=, 
   docVar=, 
   clusterVar=_CLUSTER_ID_, 
   seed=12345,   
   sampPct=, 
   dsOut=
);
proc cas;
loadactionset "sampling";
stratified result=r /
   table={name="&docClusterDS", groupby={"&clusterVar"}}
   sampPct=&sampPct 
   partind="TRUE" 
   seed=&seed
   output={casout={name="sampling_out",replace="TRUE"},
                   copyvars={"&docVar", "&clusterVar"}};
run;
print r.STRAFreq; run;
 
loadactionset "fedsql";
execdirect casout={name="&dsOut", replace=true} 
query="
   select docDS.*, &clusterVar
   from &docDS docDS, sampling_out
   where docDS.&docVar=sampling_out.&docVar
     and _PartInd_=1;
"
;
run;
 
proc cas;
   table.dropTable name="sampling_out" quiet=true; run;
quit; 
%mend strSampleByCluster;
 
 
/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
 
 
/*-------------------------------------*/
/* Prepare and load data.              */
/*-------------------------------------*/
%let myData=movie_reviews;
 
proc cas;
loadtable result=r / 
   importOptions={fileType="csv", delimiter='TAB',getnames="true"}
   path="data/movie_reviews.txt"
   casLib="CASUSER"
   casout={name="&myData", replace="true"} ;
run;
quit;
 
/* Browse the data */
proc cas;
   columninfo / table={name="&myData"};
   fetch / table = {name="&myData"};
run;
quit;
 
/* generate one unique index using data step */
proc cas;
datastep.runcode /
code = "
   data &myData;
      set &myData;
      rename id = _document_;
      keep id text score;  
   run;
";
run;
quit;
 
/* create stop list*/
data sascas1.stopList;
   set sashelp.engstop;
run;
 
/* Get tfidf by term by document */
%getTfidf(
   dsIn=&myData, 
   docVar=_document_, 
   textVar=text, 
   language=english, 
   stemming=true, 
   stopList=stopList, 
   dsOut=doc_term_tfidf
);
 
/* document-term matrix */
%DocToVectors(
   dsIn=doc_term_tfidf, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   dimSize=2500, 
   dsOut=doc_vectors
);
 
/* Cluster documents */
%clusterDocs(
   dsIn=doc_vectors, 
   nClusters=10, 
   seed=12345,   
   dsOut=doc_clusters
);
 
/* Get top-k words of each cluster */
%clusterProfile(
   termDS=doc_term_tfidf, 
   clusterDS=doc_clusters, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   clusterVar=_cluster_id_, 
   topk=10, 
   dsOut=cluster_topk_terms
);
/*-------------------------------------------*/
/* Sampling textual data based on clustering */
/*-------------------------------------------*/
 
 
/*-------------------------------------*/
/* Get tfidf                           */
/*-------------------------------------*/
%macro getTfidf(
   dsIn=, 
   docVar=, 
   textVar=, 
   language=, 
   stemming=true, 
   stopList=, 
   dsOut=
);
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   offset={name="tpparse_out",replace=TRUE}
;
run;
 
textparse.tpAccumulate /
   offset={name="tpparse_out"}
   stopList={name="&stopList"}
   termWeight="NONE"
   cellWeight="NONE"
   reduce=1
   parent={name="tpAccu_parent",replace=TRUE}
   terms={name="tpAccu_term",replace=TRUE}
   showdroppedterms=false
;
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_term_stat", replace=true} 
query="
      select tpAccu_parent.&docVar, 
             tpAccu_term._term_,
             tpAccu_parent._count_ as _tf_,
             tpAccu_term._NumDocs_
      from tpAccu_parent
      left join tpAccu_term
      on tpAccu_parent._Termnum_=tpAccu_term._Termnum_;
"
;
run;
 
simple.groupBy / 
   table={name="tpAccu_parent"}
   inputs={"&docVar"}
   casout={name="doc_nodup", replace=true};
run;
 
numRows result=r / 
   table={name="doc_nodup"};
totalDocs = r.numrows;
run;
 
datastep.runcode /
code = "
   data &dsOut;
      set doc_term_stat;"
   ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);"
   ||"run;
";
run;
quit;
 
proc cas;
   table.dropTable name="tpparse_out" quiet=true; run;
   table.dropTable name="tpAccu_parent" quiet=true; run;
   table.dropTable name="tpAccu_term" quiet=true; run;
   table.dropTable name="doc_nodup" quiet=true; run;
   table.dropTable name="doc_term_stat" quiet=true; run;
quit;
%mend getTfidf;
 
 
/*-------------------------------------*/
/* Term-by-document matrix             */
/*-------------------------------------*/
%macro DocToVectors(
   dsIn=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   dimSize=500, 
   dsOut=
);
proc cas;
simple.summary /
   table={name="&dsIn", groupBy={"&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="term_tf_sum", replace=true};
run;
 
simple.topk / 
   table={name="term_tf_sum"}  
   inputs={"&termVar"} 
   topk=&dimSize
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name='termnum_top', replace=true};
run;
 
loadactionset "fedsql";
execdirect casout={name="doc_top_terms", replace=true} 
query="
      select termnum.*, _rank_
      from &dsIn termnum, termnum_top
      where termnum.&termVar=termnum_top._Charvar_
        and &tfVar!=0;
"
;
run;
 
transpose.transpose /
   table={name="doc_top_terms", 
          groupby={"&docVar"}, 
          computedVars={{name="_name_"}},
          computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"}  
   transpose={"&tfVar"}
   casOut={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="term_tf_sum" quiet=true; run;
   table.dropTable name="termnum_top" quiet=true; run;
   table.dropTable name="termnum_top_misc" quiet=true; run;
   table.dropTable name="doc_top_terms" quiet=true; run;
quit;
%mend DocToVectors;
 
 
/*-------------------------------------*/
/* Cluster documents                   */
/*-------------------------------------*/
%macro clusterDocs(
   dsIn=, 
   nClusters=10,
   seed=12345,   
   dsOut=
);
proc cas;
/*get the vector variables list*/
columninfo result=collist /
   table={name="&dsIn"};
ndimen=dim(collist['columninfo']);
vector_columns={};
j=1;
do i=1 to ndimen;
   thisColumn = collist['columninfo'][i][1];
   if lowcase(substr(thisColumn, 1, 4))='_dim' then do;
      vector_columns[j]= thisColumn;
      j=j+1;
   end;
end;
run;
 
clustering.kClus / 
   table={name="&dsIn"},
   nClusters=&nClusters,
   init="RAND",
   seed=&seed,
   inputs=vector_columns,
   distance="EUCLIDEAN",
   printIter=false,
   impute="MEAN",
   standardize='STD',
   output={casOut={name="&dsOut", replace=true}, copyvars="ALL"}
;
run;
quit;
%mend clusterDocs;
 
 
/*-------------------------------------*/
/* Get top-k words of each cluster     */
/*-------------------------------------*/
%macro clusterProfile(
   termDS=, 
   clusterDS=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   clusterVar=_CLUSTER_ID_, 
   topk=10, 
   dsOut=
);
proc cas;
loadactionset "fedsql";
execdirect casout={name="cluster_terms",replace=true} 
query="
      select &termDS..*, &clusterVar
      from &termDS, &clusterDS
      where &termDS..&docVar = &clusterDS..&docVar;
"
;
run;
 
simple.summary /
   table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="cluster_terms_sum", replace=true};
run;
 
simple.topk / 
   table={name="cluster_terms_sum", groupBy={"&clusterVar"}}  
   inputs={"&termVar"} 
   topk=&topk
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="cluster_terms" quiet=true; run;
   table.dropTable name="cluster_terms_sum" quiet=true; run;
quit;
%mend clusterProfile;
 
 
/*-------------------------------------*/
/* Stratified sampling by cluster      */
/*-------------------------------------*/
%macro strSampleByCluster(
   docDS=, 
   docClusterDS=, 
   docVar=, 
   clusterVar=_CLUSTER_ID_, 
   seed=12345,   
   sampPct=, 
   dsOut=
);
proc cas;
loadactionset "sampling";
stratified result=r /
   table={name="&docClusterDS", groupby={"&clusterVar"}}
   sampPct=&sampPct 
   partind="TRUE" 
   seed=&seed
   output={casout={name="sampling_out",replace="TRUE"},
                   copyvars={"&docVar", "&clusterVar"}};
run;
print r.STRAFreq; run;
 
loadactionset "fedsql";
execdirect casout={name="&dsOut", replace=true} 
query="
   select docDS.*, &clusterVar
   from &docDS docDS, sampling_out
   where docDS.&docVar=sampling_out.&docVar
     and _PartInd_=1;
"
;
run;
 
proc cas;
   table.dropTable name="sampling_out" quiet=true; run;
quit; 
%mend strSampleByCluster;
 
/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
caslib _all_ assign;
 
/*-------------------------------------*/
/* Prepare and load data.              */
/*-------------------------------------*/
%let myData=movie_reviews;
 
proc cas;
loadtable result=r / 
   importOptions={fileType="csv", delimiter='TAB',getnames="true"}
   path="data/movie_reviews.txt"
   casLib="CASUSER"
   casout={name="&myData", replace="true"} ;
run;
quit;
 
/* Browse the data */
proc cas;
   columninfo / table={name="&myData"};
   fetch / table = {name="&myData"};
run;
quit;
 
/* generate one unique index using data step */
proc cas;
datastep.runcode /
code = "
   data &myData;
      set &myData;
      rename id = _document_;
      keep id text score;  
   run;
";
run;
quit;
 
/* create stop list*/
data sascas1.stopList;
   set sashelp.engstop;
run;
 
/* Get tfidf by term by document */
%getTfidf(
   dsIn=&myData, 
   docVar=_document_, 
   textVar=text, 
   language=english, 
   stemming=true, 
   stopList=stopList, 
   dsOut=doc_term_tfidf
);
 
/* document-term matrix */
%DocToVectors(
   dsIn=doc_term_tfidf, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   dimSize=3000, 
   dsOut=doc_vectors
);
 
/* Cluster documents */
%clusterDocs(
   dsIn=doc_vectors, 
   nClusters=50, 
   seed=12345,   
   dsOut=doc_clusters
);
 
/* Get top-k words of each cluster */
%clusterProfile(
   termDS=doc_term_tfidf, 
   clusterDS=doc_clusters, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   clusterVar=_cluster_id_, 
   topk=10, 
   dsOut=cluster_topk_terms
);
 
/* Browse topk terms of the first cluster */
proc cas;
fetch / 
   table={name="cluster_topk_terms",
          where="_cluster_id_=1"};
run;
quit;
 
/* Stratified sampling by cluster      */
%strSampleByCluster(
   docDS=&myData, 
   docClusterDS=doc_clusters, 
   docVar=_document_, 
   clusterVar=_cluster_id_, 
   seed=12345,   
   sampPct=5,
   dsOut=doc_sample_by_cls
);
 
/* Browse sample documents of the first cluster */
proc cas;
fetch / 
   table={name="doc_sample_by_cls",
          where="_cluster_id_=1"};
run;
quit;

How to sample textual data with SAS was published on SAS Users.

6月 222018
 

reference CAS tables using a one-level nameReferencing tables

With the SAS language the way one references a table in code is by using a two-level name or a one-level name. With two-level names, one supplies the libref as well as the table name i.e. PROC MEANS DATA = WORK.TABLE;. By default, all one-level names also refer to SASWORK i.e. PROC MEANS DATA = TABLE;

CAS

To reference CAS tables using a one-level name we will issue two statements that alter which libref houses the tables referenced as one-level names. First, we will create a CAS libref (line 77) followed by the “options user” statement on line 80. It is line 80 that changes the default location from SASWORK to the CAS libref i.e. CASWORK.Figure 1. Statements to alter the default location for one-level names

How to reference one-level CAS tables

From this point on all one-level names referenced in code will be tables managed by CAS. In figure 2 we are creating a one-level CAS table called baseball by reading the two-level named table SASHELP.BASEBALL. This step executes in the SAS Programing Runtime Engine (a SAS Viya based workspace server) and creates the table CASWORK.BASEBALL. Because of the “options user” statement we can now also reference that table using a one-level name i.e. BASEBALL.

Figure 2. Loading a two-level named table into a one-level named table that is managed by CAS

In Figure 3 we will use a DATA Step to read a CAS table and write a CAS table using one-level names. We can also see by reviewing the notes in the SAS log that this DATA Step ran in CAS using multiple threads.

Figure 3. DATA Step referencing one-level named tables

In Figure 4 we observe this PROC MEANS is processing the one-level named table BASEBALL. By reviewing the notes in the SAS log we can see this PROC MEANS ran distributed in CAS.

Figure 4. PROC MEANS referencing one-level named CAS table

Because the default location for one-level names is SASWORK, all tables in SASWORK are automatically deleted when the SAS session ends. When one changes the default location for one-level names, like we just did, it is a best practice to use a PROC DELETE as the last statement in your code to delete all one-level tables managed by CAS, figure 5.

Figure 5. PROC DELETE deleting all one-level CAS tables

Conclusion

It is a very common SAS coding technique to read a source table from a non-SAS data store and write it to SASWORK. By using the technique describe in this blog one now has options on where to store the one-level tables names. As for me, I prefer storing them in CAS so I benefit from the distributed process (faster runtimes) that CAS offers.

How to reference CAS tables using a one-level name was published on SAS Users.

6月 192018
 

CAS DATA StepCloud Analytic Services (CAS) is really exciting. It’s open. It’s multi-threaded. It’s distributed. And, best of all for SAS programmers, it’s SAS. It looks like SAS. It feels like SAS. In fact, you can even run DATA Step in CAS. But, how does DATA Step work in a multi-threaded, distributed context? What’s new? What’s different? If I’m a SAS programming wizard, am I automatically a CAS programming wizard?

While there are certain _n_ automatic variable as shown below:

DATA tableWithUniqueID;
SET tableWithOutUniqueID; 
 
        uniqueID = _n_;
 
run;

CAS DATA Step

Creating a unique ID in CAS DATA Step is a bit more complicated. Each thread maintains its own _n_. So, if we just use _n_, we’ll get duplicate IDs. Each thread will produce an uniqueID field value of 1, 2..and so on. …. When the thread output is combined, we’ll have a bunch of records with an uniqueID of 1 and a bunch with an uniqueID of 2…. This is not useful.

To produce a truly unique ID, you need to augment _n_ with something else. _threadID_ automatic variable can help us get our unique ID as shown below:

DATA tableWithUniqueID;
SET tableWithOutUniqueID;
 
        uniqueID = put(_threadid_,8.) || || '_' || Put(_n_,8.);
 
run;

While there are surely other ways of doing it, concatenating _threadID_ with _n_ ensures uniqueness because the _threadID_ uniquely identifies a single thread and _n_ uniquely identifies a single row output by that thread.

Aggregation with DATA Step

Now, let’s look at “whole table” aggregation (no BY Groups).

SAS DATA Step

Aggregating an entire table in SAS DATA Step usually looks something like below. We create an aggregator field (totSalesAmt) and then add the detail records’ amount field (SaleAmt) to it as we process each record. Finally, when there are no more records (eof), we output the single aggregate row.

DATA aggregatedTable ;
SET detailedTable end=eof;
 
      retain totSalesAmt 0;
      totSalesAmt = totSalesAmt + SaleAmt;
      keep totSalesAmt;
      if eof then output;
 
run;

CAS DATA Step

While the above code returns one row in single-engine SAS, the same code returns multiple rows in CAS — one per thread. When I ran this code against a table in my environment, I got 28 rows (because CAS used 28 threads in this example).

As with the unique ID logic, producing a total aggregate is just a little more complicated in CAS. To make it work in CAS, we need a post-process step to bring the results together. So, our code would look like this:

DATA aggregatedTable ;
SET detailedTable end=eof;
 
      retain threadSalesAmt 0;
      threadSalesAmt = threadSalesAmt + SaleAmt;
      keep threadSalesAmt;
      if eof then output;
 
run;
 
DATA aggregatedTable / single=yes;
SET aggregatedTable end=eof;
 
      retain totSalesAmt 0;
      totSalesAmt = totSalesAmt + threadSalesAmt;
      if eof then output;
 
run;

In the first data step in the above example, we ran basically the same code as in the SAS DATA Step example. In that step, we let CAS do its distributed, multi-threaded processing because our table is large. Spreading the work over multiple threads makes the aggregation much quicker. After this, we execute a second DATA Step but here we force CAS to use only one thread with the single=yes option. This ensures we only get one output row because CAS only uses one thread. Using a single thread in this case is optimal because we’ll only have a few input records (one per thread from the previous step).

BY-GROUP Aggregation

Individual threads are then assigned to individual BY-Groups. Since each BY-Group is processed by one and only one thread, when we aggregate, we won’t see multiple output rows for a BY-Group. So, there shouldn’t be a need to consolidate the thread results like there was with “whole table” aggregation above.

Consequently, BY-Group aggregation DATA Step code should look exactly the same in CAS and SAS (at least for the basic stuff).

Concluding Thoughts

Coding DATA Step in CAS is very similar to coding DATA Step in SAS. If you’re a wizard in one, you’re likely a wizard in the other. The major difference is accounting for CAS’ massively parallel processing capabilities (which manifest as threads). For more insight into data processing with CAS, check out the SAS Global Forum paper.

Threads and CAS DATA Step was published on SAS Users.

4月 192018
 

A very common coding technique SAS programmers use is identifying the largest value for a given column using DATA Step BY statement with the DESCENDING option. In this example I wanted to find the largest value of the number of runs (nRuns) by each team in the SASHELP.BASEBALL dataset. Using a SAS workspace server, one would write:

DESCENDING BY Variables in SAS Viya

Figure 1. Single Threaded DATA Step in SAS Workspace Server

Figure 2 shows the results of the code we ran in Figure 1:

Figure 2. Result from SAS Code Displayed in Figure 1

To run this DATA Step distributed we will leveraging the SAS® Cloud Analytic Services in SAS® Viya™. Notice in Figure 3, there is no need for the PROC SORT step which is required when running DATA Step single threaded in a SAS workspace server. This is because SAS® Cloud Analytic Services in SAS® Viya™ . Instead we will use

Figure 3. Distributed DATA Step in SAS® Cloud Analytic Services in SAS® Viya™

Figure 4 shows the results when running distributed DATA Step in SAS® Cloud Analytic Services in SAS® Viya™.

Figure 4. Results of Distributed DATA Step in SAS® Cloud Analytic Services in SAS® Viya™

Conclusion

Until the BY statement running in the SAS® Cloud Analytic Services in SAS® Viya™ supports DESCENDING use this technique to ensure your DATA Step runs distributed.

Read more SAS Viya posts.

Read our SAS 9 to SAS Viya whitepaper.

How to Simulate DESCENDING BY Variables in DATA Step Code that Runs Distributed in SAS® Viya™ was published on SAS Users.