CAS

12月 072020
 

In my previous blog post, I talked about using PROC CAS to accomplish various data preparation tasks. Since then, my colleague Todd Braswell and I worked through some interesting challenges implementing an Extract, Transform, Load (ETL) process that continuously updates data in CAS. (Todd is really the brains behind getting this to work – I am just along for the ride.)

In a nutshell, the process goes like this:

  1. PDF documents drop into a "receiving" directory on the server. The documents have a unique SubmissionID. Some documents are very large – thousands of pages long.
  2. A Python job runs and converts PDFs into plain text. Python calls an API that performs Optical Character Recognition (OCR) and saves off the output as a CSV file, one row per page, in the PDF document.
  3. A SAS program, running in batch, loads the CSV file with new records into a CAS table. SubmissionID is passed to the batch program as a macro variable, which is used as part of the CAS table name.
  4. Records loaded from the CSV file are appended to the Main table. If records with the current SubmissionID already exist in the Main table, they are deleted and replaced with new records.
    The Main table is queried by downstream processes, which extract subsets of data, apply model score code, and generate results for the customer.

Continuously update data process flow

Due to the volume of data and the amount of time it takes to OCR large PDFs, the ETL process runs in multiple sessions simultaneously. And here is a key requirement: the Main table is always available, in a promoted state, with all up-to-date records, in order for the model score code to pick up the needed records.

What does "promoted state" mean?

The concept of table scope, which was introduced with the first release of SAS Viya, presents a challenge. CAS tables are in-memory tables that can have one of two "scopes":

  • Session scope – the table exists within the scope of your current CAS session and drops from memory as soon as the session ends. Functionally, this is somewhat similar to the data you write to the WORK library in SAS 9 – once you disconnect, the data drops from the WORK library.
  • Global scope – the table is available to all sessions. If your session ends, you will still have access to it when you start a new session. Your colleagues also maintain access, assuming they have the necessary permissions. For the table to assume global scope, it needs to be promoted.

Common promotion techniques for a table are the DATA STEP, PROC CAS, or PROC CASUTIL. For example:

/*promote a table using DATA STEP*/
*this snippet copies a SAS 9 table into CAS and promotes in one step;
data mycaslib.my_table (promote=yes);
     set mylib.my_table;
     run;
 
/*promote using PROC CASUTIL*/
*this snippet promotes a session-scope CAS table to global scope;
proc casutil incaslib='mycaslib' outcaslib='mycaslib';
     promote casdata='my_table';
     quit;
 
/*Promote using PROC CAS*/
*same as above, this snippet promotes a session-scope table to global scope;
proc cas;
table.promote / 
     caslib='mycaslib'
     targetcaslib='mycaslib' 
     name='my_table' 
     target='my_table';
    quit;

Fun Facts About Table Promotion

You cannot promote a table that has already been promoted. If you need to promote a new version of the same table, you need to first drop the existing table and promote the new version.

To discover the current table state, use the

proc cas;
     table.tableinfo / 
     caslib='mycaslib' 
     name='main';
     quit;

If you append rows to a promoted table using the DATA STEP append option, the new rows are automatically promoted. For example, in this snippet the mycaslib.main table, which is promoted, remains promoted when the rows from mycaslib.new_rows are appended to it:

data mycaslib.main(append=yes);
     set mycaslib.new_rows;
     run;

When you manipulate a promoted table using the DATA STEP apart from appending rows, it creates a new, session-scope version of the same table. You will have two versions of the table: the global-scope table, which remains unchanged, and the session-scope version which has the changes you implemented. Even if you don't change anything in the table and simply run:

data mycaslib.my_table;
     set mycaslib.my_table;
     run;

in which mycaslib.my_table is promoted, you end up with a promoted and an unpromoted version of this table in the mycaslib library – a somewhat unexpected and hardly desired result. Appendix 1 walks through a quick exercise you can try to verify this.

As you probably guessed, this is where we ran into trouble with our ETL process: the key requirement was for the Main table to remain promoted, yet we needed to continuously update it. The task was simple if we just needed to append the rows; however, we also needed to replace the rows if they already existed. If we tried to delete the existing rows using the DATA STEP, we would have to deal with the changes applied to a session-scope copy of the global-scope table.

Initially, we designed the flow to save off the session-scope table with changes, then drop the (original) global-scope version, and finally reload the up-to-date version. This was an acceptable workaround, but errors started to pop up when a session looked for the Main table to score data, while a different concurrent session reloaded the most up-to-date data. We were uncertain how this would scale as our data grew.

PROC CAS to the rescue!

After much research, we learned the deleteRows, which allows you to delete rows directly from a global-scope table. The data is never dropped to session-scope – exactly what we needed. Here's an example:

proc cas;
     table.deleteRows /
     table={caslib="mycaslib", name="Main", where="SubmissionID = 12345"};
     quit;

In case you are wondering, the Tables action set also has an

/*1. Load new rows. SubmissionID macro variable is a parameter passed to the batch program*/
/*New rows are written to the casuser library, but it does not really matter which caslib you choose – 
   we are not persisting them across sessions*/
proc casutil;
       load file="/path_to_new_rows/New_rows_&SubmissionID..csv" outcaslib="casuser" casout="new_rows";
       quit;
/*2. Delete rows with the current SubmissionID */
proc cas;
       table.deleteRows /
       table={caslib="prod", name="Main", where="SubmissionID = &SubmissionID."};
       quit;
/*3. Append new rows*/
data mycaslib.main(append=yes);
	set mycaslib.new_rows;
	run;
/*4. Save the main table to ensure we have a disk backup of in-memory data*/
proc casutil incaslib="prod" outcaslib="prod";
	save casdata="main" replace;
	quit;

Conclusion

We learned how to continuously update data in CAS while ensuring the data remains available to all sessions accessing it asynchronously. We learned the append option in DATA STEP automatically promotes new rows but manipulating the data in a global-scope table through DATA STEP in other ways leads to data being copied to session-scope. Finally, we learned that to ensure the table remains promoted while it is updated, we can fall back on PROC CAS.

Together, these techniques enabled implementation of a robust data flow that overcomes concurrency problems due to multiple processes updating and querying the data.

Acknowledgement

We thank Brian Kinnebrew for his generous help in investigating this topic and the technical review.

Appendix 1

Try the following exercise to verify that manipulating a promoted table in DATA STEP leads to two copies of the table – session- AND global-scope.

/*copy and promote a sample SAS 9 table*/
data casuser.cars(promote=yes);
     set sashelp.cars;
     run;
/*check the number of rows and confirm that the table is promoted*/
proc cas;
     table.tableinfo / caslib='casuser' name='cars';
     quit; /*The table is promoted and has 428 rows*/
 
/*delete some rows in the promoted table*/
data casuser.cars;
     set casuser.cars;
     if make='Acura' then delete;
     run;
/*check again – how may rows does the table have? Is it promoted?*/
proc cas;
     table.tableinfo / caslib='casuser' name='cars';
     quit;  /*The table is has 421 rows but it is no longer promoted*/
 
/*reset your CAS session*/
/*kill your current CAS session */
cas _all_ terminate;
/*start a new CAS session and assign caslibs*/
cas; 
caslib _all_ assign;
 
/*check again – how may rows does the table have? Is it promoted?*/
proc cas;
     table.tableinfo / caslib='casuser' name='cars';
     quit;  /*The table is promoted and has 428 rows*/

What we see here is, manipulating a global-scope table in DATA STEP leads to duplication of data. CAS copies the table to session-scope and applies the changes there. The changes go away if you terminate the session. One way to get around this issue is instead of trying to overwrite the promoted table, create a new table, then drop the old table and promote the new table under the old table's name. Otherwise, use table.Update actions to update/delete rows in place, as described in this post.

Append and Replace Records in a CAS Table was published on SAS Users.

10月 032020
 

If you're a SAS programmer who now uses SAS Viya and CAS, it's worth your time to optimize your existing programs to take advantage of the new environment. This post is a continuation of my SAS Global Forum 2020 paper Best Practices for Converting SAS® Code to Leverage SAS® Cloud Analytic Services and my SGF 2020 Super Demo.

The best approach for refactoring SAS code for SAS Viya has a few steps:

  • First, "lift and shift" your existing code to run successfully in the compute server for SAS Viya.
  • Next, create CASLIB statements to all of your data sources: i.e. sas7bdat, CSV files, parquet files, RDBMs, cloud data sources, etc.
  • Finally, identify the longest running steps so you know where you have the biggest opportunities. For example, look at steps where the "real time" is 30 minutes or longer, as well as steps that are CPU bound. CPU-bound steps are steps where the CPU time is equal to or greater than the real time for that step.

To help us identify those steps we can leverage a new utility to analyze SAS logs and create reports to help us understand the Real Time and CPU Time for each step. Read on the learn more about this final step in the code refactoring process.

Utility: SAS Log Parser

To generate these reports, I created a SAS program that will read all SAS log files in a directory and create one report per SAS log as well as a descending Real Time (Clock Time) and CPU Time reports. Figure 1 is an example of the report that is generated for each SAS log. In this report we see each step’s procedure or DATA Step’s Real Time and CPU Time. It's derived by picking up on SAS log entries like this:

NOTE: PROCEDURE SGPLOT used (Total process time):
      real time           2.79 seconds
      cpu time            0.08 seconds

NOTE: The SAS System used:
      real time           1:08.86
      cpu time            1:18.18

Note, the Total Time and Total CPU Time are fields that are populated when the SAS log note “NOTE: The SAS System used:” is encountered. SAS programs that are ran in batch or using an RSUBMIT process via

Figure 1. sampleETS.log report

Descending Real Time Report

Figure 2 contains an example of the descending real time report. In this report we observe in the Step column that the longest running step is a PROC LOGISTIC that takes over 14 hours (Real Time column) from the SAS log called Sample3.log (File Name column). The best way to use this report is to focus on steps that take longer than 30 minutes. In our case we have 9 steps from 3 SAS logs. Now that we know that we can review the details of each step and then benchmark if that step would run faster by leveraging SAS® Cloud Analytic Services (CAS). Note, for CAS to process data all source and target tables, all data must be in CAS tables and the step must be coded using CAS-enabled steps.

Figure 2. Descending Real Time Report

Descending CPU Time Report

Figure 3 contains an example of the descending CPU times report where we observe that the most CPU intensive step takes over 13 hours (CPU Time column) and is from the Samples3.log (File Name column). Note, if you review the Real Time and CPU Time columns you should notice that only observation 11 (Obs column) has a CPU Time that exceeds the Real Time making it CPU bound.  However, we would not focus on this step since the Real Time is less than 30 minutes.

Figure 3. Descending CPU Time Report

Source code for the SAS Log Parser

I've bundled my SAS code for these steps in my GitHub repository for SAS Global Forum 2020. You'll find these programs along with the other code that supports my previous topics of adapting SAS 9 code for SAS Viya.

sasLogParserMacros.sas contains three macros that drive the process. The macro program %LIST_FILES lists all files with a given extension, %CHECK checks for the existence of a file and deletes it if found, and %SASLOG parses a SAS log and provides the values found in the reports. When you save this file ensure you name it sasLogParserMacros.sas and in the same directory that you save sasLogParser.sas.

sasLogParser.sas is the program we submit to produce the reports. This code includes the code sasLogParserMacros.sas and then generate the repots. The only two statements we need to modify are the first two %LET statements in this program. The first %LET statement points to the location of the two SAS programs sasLogParserMacros.sas and sasLogParser.sas. The second %LET statement points to the directory containing all the SAS logs we want to parse. Note, outside of the two %let statements in sasLogParser.sas do not change any other statement in either program.

Conclusion

In order to understand which steps are good candidates for leveraging the in-memory engine CAS, we must first understand the real time and CPU time of each step. Then we can benchmark which engine in SAS Viya is appropriate for that step -- the compute server or CAS.  The code that I've shared for benchmarking can run within SAS 9 or SAS Viya, on both Windows and Linux.

7月 072020
 

Are you looking for a specific CAS action to use in your project? Maybe you need to create a linear or logistic regression and can't seem to find the CAS action? In this post in the Getting Started with Python Integration to SAS® Viya® series, we are going to look at exploring and loading CAS action sets.

In this example, I have already made my connection to CAS using SAS Viya for Learners, and my connection object is named conn. Visit Part 1 of the series to learn more about making a connection to CAS.

Exploring CAS Action Sets

To explore CAS action sets you can use the actionSetInfo CAS action. This should look familiar from Part 2 - Working with CAS Actions and CASResults Objects.

conn.actionsetinfo()['setinfo']

Your results might differ depending on your SAS Viya installation.

The results show the default CAS action sets loaded with SAS Viya. However, many more are available. To see all available CAS action sets use the parameter all=True. The results from running the command with the all=True parameter includes both loaded and available action sets.

conn.actionsetinfo(all=True)['setinfo']

Your results might differ depending on your SAS Viya installation.

As a result of the all=True parameter, we can see over a hundred available CAS action sets. The next question is, how do I load an action set?

Loading a CAS Action Set

I think of loading an action set like importing a package in Python.  To load an action set you need to complete two tasks:

  1. Find the action set
  2. Load the action set

Find the action set

First, let's find the action set. Let's assume we need to complete a logistic regression but are unsure of where the necessary CAS action is located. Instead of of manually scrolling through the CASResults object from the actionSetInfo CAS action, we can instead search the SASDataFrame (summarized data from CAS) for keywords.

First,  I'll set the SASDataFrame from the CASResults object equal to the variable df by calling the setinfo key.

df = conn.actionsetinfo(all=True)['setinfo']

Next, I'll use the loc method on the SASDataFrame to search for any action set that contains the string regression. There are a variety of ways to do this; I'll use the string contains method. I'll make the search case insensitive by using case=False.

df.loc[df['actionset'].str.contains('regression', case=False), :]

Great! In the results I see a regression CAS action set. That looks exactly like what I need. Next, it's time to load the action set.

Load the action set

To load the action use the loadActionSet action with the actionSet='regression' parameter.

conn.loadActionSet(actionSet='regression')


That's it! You have now loaded a CAS action set! Finally, let's explore the actions inside the regression CAS action set.

Exploring CAS Actions in a CAS Action Set

Once a CAS action set is loaded, you can explore the available CAS actions by using the help action with the parameter actionSet='regression'.

conn.help(actionSet='regression')

The results display a list of all the CAS actions in the regression action set. I see the logistic CAS action that I need!

Summary

In conclusion, exploring and loading CAS action sets is important when working on your projects in SAS Viya. A couple of key points to remember:

  • The actionSetInfo CAS action returns all loaded CAS action sets.
  • The actionSetInfo parameter all=True returns all available CAS action sets.
  • The loadActionSet CAS action loads a CAS action set.

In the next post we will talk about exploring data stored in the CAS environment in Exploring Caslibs.

If you enjoyed this blog post, share it with a friend!

Additional Information and Resources

SAS® Viya® 3.5 Actions and Action Sets by Name and Product

Getting Started with Python Integration to SAS® Viya® Index

Getting Started with Python Integration to SAS® Viya® - Part 3 - Loading a CAS Action Set was published on SAS Users.

6月 192020
 

In the second post of the Getting Started with Python Integration to SAS® Viya® series we will learn about Working with CAS Actions and CASResults Objects. CAS actions are commands sent to the CAS server to run a task, and CASResults objects contain information returned from the CAS server. This post will cover a few basic CAS actions, and how to easily work with the information returned.

CAS Actions Overview

First, you need to understand CAS actions. From performing data preparation, modeling, imputing missing values, or even retrieving information about your CAS session, CAS actions perform a single task on the CAS server. CAS actions are organized with other CAS actions in a CAS action set. CAS action sets contain actions that are based on common functionality.

In the end, I like to think of CAS action sets as a package, and all the CAS actions inside an action set as a method.

Getting Started with CAS Actions

We will start with a basic CAS action to view all available loaded action sets in the current CAS session. Before you use any CAS action you must connect to the CAS server. For more information on connecting to the CAS server, visit Part 1 of the series.

I have already made my connection to CAS using SAS Viya for Learners, and my connection object is named conn.

display(conn)
CAS('svflhost.demo.sas.com', 5570, 'peter.test@sas.com', protocol='cas', name='py-session-1', session='efff4323-a862-bd6e-beea737b4249')

Next, to view all available action sets loaded in your CAS session use the actionSetInfo CAS action from the Builtins action set on the CAS connection object. CAS actions and CAS action sets are case insensitive, and you do not need to qualify the CAS action with the action set name (although it is best practice to include both). In the example below I qualify the CAS action.

The actionSetInfo CAS action returns the following output:

Your output information might differ depending on your SAS Viya installation.

The output shows a list of the loaded action sets on the CAS server with additional information for each. With a quick scan, it almost looks like a Pandas DataFrame. However, let's go back and run the actionSetInfo CAS action, but this time use the Python type function to see the data type of the output.

type(conn.builtins.actionsetinfo())
swat.cas.results.CASResults

The actionSetInfo CAS action returns a CASResults object. The million dollar question is, what exactly is a CASResults object?

CASResults Object

CAS actions return a CASResults object. A CASResults object is an ordered Python dictionary with additional methods and attributes. There are no specific rules about how many keys are contained in the CASResults object, or what values the keys return. That information depends on the CAS action used.

One thing to note is that CASResults objects are local on your client machine. That is, the data is not in CAS anymore, it has been processed by CAS and returned locally. This is something to keep in mind when you are working with big data in CAS and try to return more data than your local computer can handle.

Let's continue by examining the output from our previous example. The CASResults object returns one key named setinfo, and one value. You can quickly tell by looking at the output.

However, another method to view all the keys in the CAResults object is to use the Python dictionary keys method. In this example, it will return one key named setinfo.

conn.actionsetinfo().keys()
odict_keys(['setinfo'])

Be aware of how I specified the CAS action in the above example. This time I did not qualify it with the Builtins CAS action set. While you can specify with or without the CAS action set, I recommend to be consistent. I typically do not type the CAS action set. I'll keep that method moving forward.

To view a specific value of a CASResults object you can call the key like a typical Python dictionary. In this example you can call the setinfo key to return the key's value.

conn.actionsetinfo()['setinfo']

As mentioned earlier, the output seems to resemble a Pandas DataFrame. To confirm our suspicion, we can use the type function to see the data type of the value returned by the setinfo key,

type(conn.actionsetinfo()['setinfo'])
swat.dataframe.SASDataFrame

Interesting! The CASResults object contains a SASDataFrame for the setinfo key. What exactly is a SASDataFrame?

Understanding a SASDataFrame

A SASDataFrame is a subclass of a Pandas DataFrame. As a result, you can work with them as you normally would a Pandas DataFrame! Another thing to note is a SASDataFrame is local data. Again, as you use CAS you must be aware CAS can handle more data than your local computer can handle. When bringing data locally make sure it's usable on your local machine. We will discuss this in future posts.

Let's work with the SASDataFrame. First, I will create a new variable named df that holds the SASDataFrame from the actionSetInfo action.

df = conn.actionsetinfo()['setinfo']

Next, I'll use some Pandas methods on the df object. Let's start with the head method to view the first five rows.

df.head()

Or filter the data using the loc method. In this scenario I want to find all action sets with the name simple, and return the actionset and label columns.

df.loc[df['actionset'] == 'simple', ['actionset', 'label']]

Or check unique values of the product_name column by using the value_counts method.

df['product_name'].value_counts()
tkcas        8
crsstat      3
crssearch    1

Or...

No that's it. I think you get my point! You can work with a SASDataFrame object like you would a Pandas DataFrame!

CASResults Object With Multiple Keys

Lastly, let's use another CAS action, but this time the CASResults object contains multiple keys, with multiple value types.

In this example I use the serverStatus CAS action from the Builtins CAS action set. The CAS action returns the status of the server.

conn.serverstatus()

From the looks of the output, I see there are three keys; About, server, and nodestatus. However, let's check our assumption by using the keys method on the CASResults object.

conn.serverstatus().keys()
odict_keys(['About', 'server', 'nodestatus'])

In the output there are three keys as expected. Let's look at the data type of each key by writing a quick Python loop to print the name of the key and the value type returned by that key.

for key,value in conn.serverstatus().items():
      print('Key : {}, Value Type : {}'.format(key,type(value)))
Key : About, Value Type : <class 'dict'>
Key : server, Value Type : <class 'swat.dataframe.SASDataFrame'>
Key : nodestatus, Value Type : <class 'swat.dataframe.SASDataFrame'>

The serverStatus CAS action contains three keys, one key contains a dictionary object, and the other two keys contain a SASDataFrame.

Summary

In conclusion, understanding CAS actions and the CASResults objects are essential when working with CAS. A couple of key points to remember:

  • CAS actions reside in CAS action sets and perform a single task on the CAS server.
  • CAS actions return a CASResults object, which is simply a Python dictionary.
  • A CASResults object contains a single or multiple keys, corresponding to a value or values of any data type.
  • A SASDataFrame resides in a CASResults object, and is a subclass of a Pandas DataFrame.

That was a quick overview of a few basic CAS actions, and how to work with the CASResults objects they return.  You can check out the SAS documentation below for all available CAS Actions.

Additional Resources

SAS® Viya® 3.5 Actions and Action Sets by Name and Product

Getting Started with Python Integration to SAS® Viya® Series Index

Getting Started with Python Integration to SAS® Viya® - Part 2 - Working with CAS Actions and CASResults Objects was published on SAS Users.

6月 192020
 

Index of articles on Getting Started with Python Integration to SAS® Viya®.
Part 1 - Making a Connection
Part 2 - Working with CAS Actions and CASResults Objects
Part 3 - Loading a CAS Action Set
Part 4 - Exploring Caslibs
Part 5 - Uploading Data into CAS
Part 6 - Session vs Global Scope Tables
Part 7 - Exploring Tables? CAS Actions vs SWAT API?

Getting Started with Python Integration to SAS® Viya® - Index was published on SAS Users.

3月 052020
 

Have you heard that SAS offers a collection of new, high-performance CAS procedures that are compatible with a multi-threaded approach? The free e-book Exploring SAS® Viya®: Data Mining and Machine Learning is a great resource to learn more about these procedures and the features of SAS® Visual Data Mining and Machine Learning. Download it today and keep reading for an excerpt from this free e-book!

In SAS Studio, you can access tasks that help automate your programming so that you do not have to manually write your code. However, there are three options for manually writing your programs in SAS® Viya®:

  1. SAS Studio provides a SAS programming environment for developing and submitting programs to the server.
  2. Batch submission is also still an option.
  3. Open-source languages such as Python, Lua, and Java can submit code to the CAS server.

In this blog post, you will learn the syntax for two of the new, advanced data mining and machine learning procedures: PROC TEXTMINE and PROCTMSCORE.

Overview

The TEXTMINE and TMSCORE procedures integrate the functionalities from both natural language processing and statistical analysis to provide essential functionalities for text mining. The procedures support essential natural language processing (NLP) features such as tokenizing, stemming, part-of-speech tagging, entity recognition, customized stop list, and so on. They also support dimensionality reduction and topic discovery through Singular Value Decomposition.

In this example, you will learn about some of the essential functionalities of PROC TEXTMINE and PROC TMSCORE by using a text data set containing 1,830 Amazon reviews of electronic gaming systems. The data set is named Amazon. You can find similar data sets of Amazon reviews at http://jmcauley.ucsd.edu/data/amazon/.

PROC TEXTMINE

The Amazon data set has already been loaded into CAS. The review content is stored in the variable ReviewBody, and we generate a unique review ID for each review. In the proc call shown in Program 1 we ask PROC TEXTMINE to do three tasks:

  1. parse the documents in table reviews and generate the term by document matrix
  2. perform dimensionality reduction via Singular Value Decomposition
  3. perform topic discovery based on Singular Value Decomposition results

Program 1: PROC TEXTMINE

data mycaslib.amazon;
    set mylib.amazon;
run;

data mycaslib.engstop;
    set mylib.engstop;
run;

proc textmine data=mycaslib.amazon;
    doc_id id;
    var reviewbody;

 /*(1)*/  parse reducef=2 entities=std stoplist=mycaslib.engstop 
          outterms=mycaslib.terms outparent=mycaslib.parent
          outconfig=mycaslib.config;

 /*(2)*/  svd k=10 svdu=mycaslib.svdu outdocpro=mycaslib.docpro
          outtopics=mycaslib.topics;

run;

(1) The first task (parsing) is specified in the PARSE statement. Parameter “reducef” specifies the minimum number of times a term needs to appear in the text to be included in the analysis. Parameter “stop” specifies a list of terms to be excluded from the analysis, such as “the”, “this”, and “that”. Outparent is the output table that stores the term by document matrix, and Outterms is the output table that stores the information of terms that are included in the term by document matrix. Outconfig is the output table that stores configuration information for future scoring.

(2) Tasks 2 and 3 (dimensionality reduction and topic discovery) are specified in the SVD statement. Parameter K specifies the desired number of dimensions and number of topics. Parameter SVDU is the output table that stores the U matrix from SVD calculations, which is needed in future scoring. Parameter OutDocPro is the output table that stores the new matrix with reduced dimensions. Parameter OutTopics specifies the output table that stores the topics discovered.

Click the Run shortcut button or press F3 to run Program 1. The terms table shown in Output 1 stores the tagging, stemming, and entity recognition results. It also stores the number of times each term appears in the text data.

Output 1: Results from Program 1

PROC TMSCORE

PROC TEXTMINE is used with large training data sets. When you have new documents coming in, you do not need to re-run all the parsing and SVD computations with PROC TEXTMINE. Instead, you can use PROC TMSCORE to score new text data. The scoring procedure parses the new document(s) and projects the text data into the same dimensions using the SVD weights derived from the original training data.

In order to use PROC TMSCORE to generate results consistent with PROC TEXTMINE, you need to provide the following tables generated by PROC TEXTMINE:

  • SVDU table – provides the required information for projection into the same dimensions.
  • Config table – provides parameter values for parsing.
  • Terms table – provides the terms that should be included in the analysis.

Program 2 shows an example of TMSCORE. It uses the same input data layout used for PROC TEXTMINE code, so it will generate the same docpro and parent output tables, as shown in Output 2.

Program 2: PROC TMSCORE

Proc tmscore data=mycaslib.amazon svdu=mycaslib.svdu
        config=mycaslib.config terms=mycaslib.terms
        svddocpro=mycaslib.score_docpro outparent=mycaslib.score_parent;
    var reviewbody;
    doc_id id;
run;

 

Output 2: Results from Program 2

To learn more about advanced data mining and machine learning procedures available in SAS Viya, including PROC FACTMAC, PROC TEXTMINE, and PROC NETWORK, you can download the free e-book, Exploring SAS® Viya®: Data Mining and Machine Learning. Exploring SAS® Viya® is a series of e-books that are based on content from SAS® Viya® Enablement, a free course available from SAS Education. You can follow along with examples in real time by watching the videos.

 

Learn about new data mining and machine learning procedures in SAS Viya was published on SAS Users.

2月 192020
 

Are you a statistical programmer whose company has adopted SAS Viya? If so, you probably know that the DATA step can run in parallel in SAS Cloud Analytic Services (CAS). As Sekosky (2017) says, "running in a single thread in SAS is different from running in many threads in CAS." He goes on to state, "you cannot take any DATA step, change the librefs used, and have it run correctly in parallel. You ... have to know what your program is doing to make sure you know what it does when it runs in parallel."

This article discusses one aspect of "know what your program is doing." Specifically, to run in parallel, the DATA step must use only functions and statements that are "CAS-enabled." Most DATA step functions run in CAS. However, there is a set of "SAS-only" functions that do not run in CAS. This article discusses these functions and provides a link to a list of the SAS-only functions. It also shows how you can get SAS to tell you whether your DATA step contains a SAS-only function.

DATA steps that run in CAS

By default, the DATA step will attempt to run in parallel when the step satisfies three conditions (Bultman and Secosky, 2018, p. 2):

  1. All librefs in the step are CAS engine librefs to the same CAS session.
  2. All statements in the step are supported by the CAS DATA step.
  3. All functions, CALL routines, formats, and informats in the step are available in CAS.

The present article is concerned with the third condition. How can you know in advance whether all functions and CALL routines are available in CAS?

A list of DATA step functions that do not run in CAS

For every DATA step function, the SAS Viya documentation indicates whether the function is supported on CAS or not. For example, the following screenshots of the Viya 3.5 documentation show the documentation for the RANUNI function and the RAND function:

Notice that the documentation for the RANUNI function says, "not supported in a DATA step that runs in CAS," whereas the RAND function is in the "CAS category," which means that it is supported in CAS. This means that if you use the RANUNI function in a DATA step, the DATA step will not run in CAS. (Similarly, for the other old-style random number functions, such as RANNOR.) Instead, it will try to run in SAS. This could result in copying input data from CAS, running the program in a single thread, and copying the final data set into a CAS table. Copying all that data is not efficient.

Fortunately, you do not need to look up every function to determine if it is a CAS-enabled or SAS-only function. The documentation now includes a list, by category, of the Base SAS functions (and CALL routines) that do not run in CAS. The following screenshot shows the top of the documentation page.

Can you always rewrite a DATA step to run in CAS?

For the example in the previous section (the RANUNI function), there is a newer function (the RAND function) that has the same functionality and is supported in CAS. Thus, if you have an old SAS program that uses the RANUNI function, you can replace that call with RAND("UNIFORM") and the modified DATA step will run in CAS. Unfortunately, not all functions have a CAS-enabled replacement. There are about 200 Base SAS functions that do not run in CAS, and most of them cannot be replaced by an equivalent function that runs in CAS.

The curious reader might wonder why certain classes of functions cannot run in CAS. Here are a few sets of functions that do not run in CAS, along with a few reasons:

  • Functions specific to the English language, such as the SOUNDEX and SPEDIS functions. Also, functions that are specific to single-byte character sets (especially I18N Level 0). Most of these functions are not applicable to an international audience that uses UTF-8 encoding.
  • Functions and statements for reading and writing text files. For example, INFILE, INPUT, and FOPEN/FCLOSE. There are other ways to import text files into CAS.
  • Macro-related functions such as SYMPUT and SYMGET. Remember: There are no macro variables in CAS! The macro pre-processor is a SAS-specific feature, and one of the principles of SAS Viya is that programmers who use other languages (such as Python or Lua) should have equal access to the functionality in Viya.
  • Old-style functions for generating random numbers from probability distributions. Use the RAND function instead.
  • Functions that rely on a single-threaded execution on a data set that has ordered rows. Examples include DIF, LAG, and some permutation/combination functions such as ALLCOMB. Remember: A CAS data table does not have an inherent order.
  • The US-centric state and ZIP code functions.
  • Functions for working with Git repositories.

How to force the DATA step to stop if it is not able to run in CAS

When a DATA step runs in CAS, you will see a note in the log that says:
NOTE: Running DATA step in Cloud Analytic Services.

If the DATA step runs in SAS, no note is displayed. Suppose that you intend for a DATA step to run in CAS but you make a mistake and include a SAS-only function. What happens? The default behavior is to (silently) run the DATA step in SAS and then copy the (possibly huge) data into a CAS table. As discussed previously, this is not efficient.

You might want to tell the DATA step that it should run in CAS or else report an error. You can use the SESSREF= option to specify that the DATA step must run in a CAS session. For example, if MySess is the name of your CAS session, you can submit the following DATA step:

/* use the SESSREF= option to force the DATA step to report an ERROR if it cannot run in CAS */
data MyCASLib.Want / sessref=MySess;
   x = ranuni(1234);    /* this function is not supported in the CAS DATA step */
run;
NOTE: Running DATA step in Cloud Analytic Services.
ERROR: The function RANUNI is unknown, or cannot be accessed.
ERROR: The action stopped due to errors.

The log is shown. The NOTE says that the step was submitted to a CAS session. The first ERROR message reports that your program contains a function that is not available. The second ERROR message reports that the DATA step stopped running. A DATA step that runs on CAS calls the dataStep.runCode action, which is why the message says "the action stopped."

This is a useful trick. The SESSREF= option forces the DATA step to run in CAS. If it cannot, it tells you which function is not CAS-enabled.

Other ways to monitor where the DATA steps runs

The DATA step documentation in SAS Viya contains more information about how to control and monitor where the DATA step runs. In particular, it discusses how to use the MSGLEVEL= system option to get detailed information about where a DATA step ran and in how many threads. The documentation also includes additional examples and best practices for running the DATA step in CAS. I recommend the SAS Cloud Analytic Services: DATA Step Programming documentation as the first step towards learning the advantages and potential pitfalls of running a DATA step in CAS.

Summary

The main purpose of this article is to provide a list of Base SAS functions (and CALL routines) that do not run in CAS. If you include one of these functions in a DATA step, the DATA step cannot run in CAS. This can be inefficient. You can use the SESSREF= option on the DATA statement to force the DATA step to run in CAS. If it cannot run in CAS, it stops with an error and informs you which function is not supported in CAS.

The post A list of SAS DATA step functions that do not run in CAS appeared first on The DO Loop.

2月 172020
 

Administrators like to be able to keep track of resource usage and who is using what in a system. When an administrator has this capability, they can look out for issues of high resource usage that may have an impact on overall system performance. In Viya, data is accessed in caslibs. In this post, I will show you how an administrator can track and control resource usage of personal caslibs.

A Caslib is an in-memory space to hold tables, access control lists, and data source information. In GEL enablement classes as we have discussed CAS and caslibs, one of the big asks we have had was related to personal caslibs and how can an administrator keep track of the resources that they use. Until recently we didn’t have a great answer, the good news is now we do.

Personal caslibs are, by default, the active caslib when a user starts a CAS session. This gives each user a default location to access and load data (sort of like a home directory). As the name suggests, they are personal and only the user who starts the session can access the data. In early releases, administrators saw this as a big problem because personal caslibs were basically invisible to them. There was no way to monitor what was going on with a personal caslib, leaving the system open to issues where one user could consume a lot of resources and have an adverse impact.

Introduced in Viya 3.4, the accessControl.accessPersonalCaslibs action brings all existing personal Caslibs into a session where an administrator has assumed the data or superuser role.

Running the accessControl.accessPersonalCaslibs action has the following effects. The administrator can:

  • See all personal caslibs that existed at the time the action was run
  • View promoted tables characteristics within the personal caslibs
  • Drop promoted tables within other users’ personal caslibs.

This elevation of access remains in effect for the duration of the session. The action does not give an administrator full access to personal caslibs, an administrator can never fetch data from other users’ personal caslibs, drop any personal caslib, set access controls on any personal caslib, or set access controls on any table in any personal caslib. What it does is give the administrator a view into personal caslibs to be able to monitor and troubleshooting their resource usage.

Let’s look at how it works. Logged into Viya as Viya administrator (by default also a CAS administrator), I can use the table.caslibinfo action to see all the caslibs that the administrator has permission to view. In the output below, I see my own personal caslib, and all the other caslibs that the administrator has permissions (set by the CAS authorization system) to view.

cas mysess;
proc cas;
table.caslibinfo;
quit;
cas mysess terminate;

In the code below, the super user role is assumed for this session (SAS Administrators by default can also assume the super user role in CAS). With the role assumed, the administrator can execute the accessControl.accessPersonalCaslibs action and the subsequent table.caslibinfo action returns all caslibs including any personal caslibs that existed when the session started.

cas adminsession;
proc cas;
/* need to be a super user or data administrator */
accessControl.assumeRole / adminRole="superuser";
accessControl.accessPersonalCaslibs;
table.caslibinfo ;
quit;
cas adminsession terminate;

That helps, but what about the details? How can the administrator see how many resources the tables in a personal CASLIB are using? To get the details, we can access an individual CASLIB and its tables, and for each table, execute the table.tabledetails action. The program below will loop all the personal caslibs and, for each of the caslibs, it will loop the in-memory tables and execute the table.tabledetails action. The output of tabledetails gives an idea of how many resources (memory, disk etc.) a table is using.

cas adminsession;
proc cas;
/* need to be a super user */
accessControl.assumeRole / adminRole=”superuser”;
accessControl.accessPersonalCaslibs;table.caslibinfo result=fileresult;
casliblist=findtable(fileresult);
 
/* loop caslibs */
do cvalue over casliblist;
 
if cvalue.name==: ‘CASUSER’ then
do; /* only look at caslibs that contain CASUSER */
 
table.tableinfo result=tabresult / caslib=cvalue.name;
 
tablelist=findtable(tabresult);
x=dim(tablelist);
 
if x>1 then
do; /* there are tables available */
 
do tvalue over tablelist; /* loop all tables in the caslib */
 
table.tabledetails / caslib=cvalue.name name=tvalue.name;
table.tableinfo / caslib=cvalue.name name=tvalue.name;
 
end; /* loop all tables in the caslib */
end; /* there are tables available */
end; /* only look at caslibs that contain CASUSER */
end; /* loop caslibs */
 
accessControl.dropRole / adminRole=”superuser”;
quit;
cas adminsession terminate;

Two fields that can give an administrator an idea of how big a table is are:

  • Data size: the size of the SAS Dataset in memory
  • Memory Mapped: the part of the data that has been “memory mapped” and is backed up in the CAS Disk Cache memory-mapped files.

The table below show the output for one users personal caslib.

If one table in particular is causing problems, it is possible for the administrator to drop the table from memory.

cas adminsession;
proc cas;
 
accessControl.assumeRole / adminRole=”superuser”;
accessControl.accessPersonalCaslibs;
sessionProp.setSessOpt / caslib=”CASUSER(gatedemo499)”;
table.droptable / name=”TRAIN”;
 
quit;
cas adminsession terminate;

Visibility into personal caslibs will be a big help to Viya administrators monitoring CAS resource usage. Check out the following for more details:

Viya administrators can now get personal with users' Caslibs was published on SAS Users.

11月 222019
 

sxlion

大家有一个普通的印象:SAS的更新很慢,很老很落后。可能跟它的版本命名有关,SAS9.0是2004年出来的,到现在都快20年了,版本号 还停留在9字头,并且还没有继续更新的迹象。当然这个与SAS公司的明面

原创文章: ”难道是SAS10?云分析服务时代的到来“,转载请注明: 转自SAS资源资讯列表

本文链接地址: http://saslist.net/archives/454

11月 222019
 

sxlion

大家有一个普通的印象:SAS的更新很慢,很老很落后。可能跟它的版本命名有关,SAS9.0是2004年出来的,到现在都快20年了,版本号 还停留在9字头,并且还没有继续更新的迹象。当然这个与SAS公司的明面

原创文章: ”难道是SAS10?云分析服务时代的到来“,转载请注明: 转自SAS资源资讯列表

本文链接地址: http://saslist.net/archives/454