CAS

10月 032020
 

If you're a SAS programmer who now uses SAS Viya and CAS, it's worth your time to optimize your existing programs to take advantage of the new environment. This post is a continuation of my SAS Global Forum 2020 paper Best Practices for Converting SAS® Code to Leverage SAS® Cloud Analytic Services and my SGF 2020 Super Demo.

The best approach for refactoring SAS code for SAS Viya has a few steps:

  • First, "lift and shift" your existing code to run successfully in the compute server for SAS Viya.
  • Next, create CASLIB statements to all of your data sources: i.e. sas7bdat, CSV files, parquet files, RDBMs, cloud data sources, etc.
  • Finally, identify the longest running steps so you know where you have the biggest opportunities. For example, look at steps where the "real time" is 30 minutes or longer, as well as steps that are CPU bound. CPU-bound steps are steps where the CPU time is equal to or greater than the real time for that step.

To help us identify those steps we can leverage a new utility to analyze SAS logs and create reports to help us understand the Real Time and CPU Time for each step. Read on the learn more about this final step in the code refactoring process.

Utility: SAS Log Parser

To generate these reports, I created a SAS program that will read all SAS log files in a directory and create one report per SAS log as well as a descending Real Time (Clock Time) and CPU Time reports. Figure 1 is an example of the report that is generated for each SAS log. In this report we see each step’s procedure or DATA Step’s Real Time and CPU Time. It's derived by picking up on SAS log entries like this:

NOTE: PROCEDURE SGPLOT used (Total process time):
      real time           2.79 seconds
      cpu time            0.08 seconds

NOTE: The SAS System used:
      real time           1:08.86
      cpu time            1:18.18

Note, the Total Time and Total CPU Time are fields that are populated when the SAS log note “NOTE: The SAS System used:” is encountered. SAS programs that are ran in batch or using an RSUBMIT process via

Figure 1. sampleETS.log report

Descending Real Time Report

Figure 2 contains an example of the descending real time report. In this report we observe in the Step column that the longest running step is a PROC LOGISTIC that takes over 14 hours (Real Time column) from the SAS log called Sample3.log (File Name column). The best way to use this report is to focus on steps that take longer than 30 minutes. In our case we have 9 steps from 3 SAS logs. Now that we know that we can review the details of each step and then benchmark if that step would run faster by leveraging SAS® Cloud Analytic Services (CAS). Note, for CAS to process data all source and target tables, all data must be in CAS tables and the step must be coded using CAS-enabled steps.

Figure 2. Descending Real Time Report

Descending CPU Time Report

Figure 3 contains an example of the descending CPU times report where we observe that the most CPU intensive step takes over 13 hours (CPU Time column) and is from the Samples3.log (File Name column). Note, if you review the Real Time and CPU Time columns you should notice that only observation 11 (Obs column) has a CPU Time that exceeds the Real Time making it CPU bound.  However, we would not focus on this step since the Real Time is less than 30 minutes.

Figure 3. Descending CPU Time Report

Source code for the SAS Log Parser

I've bundled my SAS code for these steps in my GitHub repository for SAS Global Forum 2020. You'll find these programs along with the other code that supports my previous topics of adapting SAS 9 code for SAS Viya.

sasLogParserMacros.sas contains three macros that drive the process. The macro program %LIST_FILES lists all files with a given extension, %CHECK checks for the existence of a file and deletes it if found, and %SASLOG parses a SAS log and provides the values found in the reports. When you save this file ensure you name it sasLogParserMacros.sas and in the same directory that you save sasLogParser.sas.

sasLogParser.sas is the program we submit to produce the reports. This code includes the code sasLogParserMacros.sas and then generate the repots. The only two statements we need to modify are the first two %LET statements in this program. The first %LET statement points to the location of the two SAS programs sasLogParserMacros.sas and sasLogParser.sas. The second %LET statement points to the directory containing all the SAS logs we want to parse. Note, outside of the two %let statements in sasLogParser.sas do not change any other statement in either program.

Conclusion

In order to understand which steps are good candidates for leveraging the in-memory engine CAS, we must first understand the real time and CPU time of each step. Then we can benchmark which engine in SAS Viya is appropriate for that step -- the compute server or CAS.  The code that I've shared for benchmarking can run within SAS 9 or SAS Viya, on both Windows and Linux.

7月 072020
 

Are you looking for a specific CAS action to use in your project? Maybe you need to create a linear or logistic regression and can't seem to find the CAS action? In this post in the Getting Started with Python Integration to SAS® Viya® series, we are going to look at exploring and loading CAS action sets.

In this example, I have already made my connection to CAS using SAS Viya for Learners, and my connection object is named conn. Visit Part 1 of the series to learn more about making a connection to CAS.

Exploring CAS Action Sets

To explore CAS action sets you can use the actionSetInfo CAS action. This should look familiar from Part 2 - Working with CAS Actions and CASResults Objects.

conn.actionsetinfo()['setinfo']

Your results might differ depending on your SAS Viya installation.

The results show the default CAS action sets loaded with SAS Viya. However, many more are available. To see all available CAS action sets use the parameter all=True. The results from running the command with the all=True parameter includes both loaded and available action sets.

conn.actionsetinfo(all=True)['setinfo']

Your results might differ depending on your SAS Viya installation.

As a result of the all=True parameter, we can see over a hundred available CAS action sets. The next question is, how do I load an action set?

Loading a CAS Action Set

I think of loading an action set like importing a package in Python.  To load an action set you need to complete two tasks:

  1. Find the action set
  2. Load the action set

Find the action set

First, let's find the action set. Let's assume we need to complete a logistic regression but are unsure of where the necessary CAS action is located. Instead of of manually scrolling through the CASResults object from the actionSetInfo CAS action, we can instead search the SASDataFrame (summarized data from CAS) for keywords.

First,  I'll set the SASDataFrame from the CASResults object equal to the variable df by calling the setinfo key.

df = conn.actionsetinfo(all=True)['setinfo']

Next, I'll use the loc method on the SASDataFrame to search for any action set that contains the string regression. There are a variety of ways to do this; I'll use the string contains method. I'll make the search case insensitive by using case=False.

df.loc[df['actionset'].str.contains('regression', case=False), :]

Great! In the results I see a regression CAS action set. That looks exactly like what I need. Next, it's time to load the action set.

Load the action set

To load the action use the loadActionSet action with the actionSet='regression' parameter.

conn.loadActionSet(actionSet='regression')


That's it! You have now loaded a CAS action set! Finally, let's explore the actions inside the regression CAS action set.

Exploring CAS Actions in a CAS Action Set

Once a CAS action set is loaded, you can explore the available CAS actions by using the help action with the parameter actionSet='regression'.

conn.help(actionSet='regression')

The results display a list of all the CAS actions in the regression action set. I see the logistic CAS action that I need!

Summary

In conclusion, exploring and loading CAS action sets is important when working on your projects in SAS Viya. A couple of key points to remember:

  • The actionSetInfo CAS action returns all loaded CAS action sets.
  • The actionSetInfo parameter all=True returns all available CAS action sets.
  • The loadActionSet CAS action loads a CAS action set.

In the next post we will talk about exploring data stored in the CAS environment in Exploring Caslibs.

If you enjoyed this blog post, share it with a friend!

Additional Information and Resources

SAS® Viya® 3.5 Actions and Action Sets by Name and Product

Getting Started with Python Integration to SAS® Viya® Index

Getting Started with Python Integration to SAS® Viya® - Part 3 - Loading a CAS Action Set was published on SAS Users.

6月 192020
 

In the second post of the Getting Started with Python Integration to SAS® Viya® series we will learn about Working with CAS Actions and CASResults Objects. CAS actions are commands sent to the CAS server to run a task, and CASResults objects contain information returned from the CAS server. This post will cover a few basic CAS actions, and how to easily work with the information returned.

CAS Actions Overview

First, you need to understand CAS actions. From performing data preparation, modeling, imputing missing values, or even retrieving information about your CAS session, CAS actions perform a single task on the CAS server. CAS actions are organized with other CAS actions in a CAS action set. CAS action sets contain actions that are based on common functionality.

In the end, I like to think of CAS action sets as a package, and all the CAS actions inside an action set as a method.

Getting Started with CAS Actions

We will start with a basic CAS action to view all available loaded action sets in the current CAS session. Before you use any CAS action you must connect to the CAS server. For more information on connecting to the CAS server, visit Part 1 of the series.

I have already made my connection to CAS using SAS Viya for Learners, and my connection object is named conn.

display(conn)
CAS('svflhost.demo.sas.com', 5570, 'peter.test@sas.com', protocol='cas', name='py-session-1', session='efff4323-a862-bd6e-beea737b4249')

Next, to view all available action sets loaded in your CAS session use the actionSetInfo CAS action from the Builtins action set on the CAS connection object. CAS actions and CAS action sets are case insensitive, and you do not need to qualify the CAS action with the action set name (although it is best practice to include both). In the example below I qualify the CAS action.

The actionSetInfo CAS action returns the following output:

Your output information might differ depending on your SAS Viya installation.

The output shows a list of the loaded action sets on the CAS server with additional information for each. With a quick scan, it almost looks like a Pandas DataFrame. However, let's go back and run the actionSetInfo CAS action, but this time use the Python type function to see the data type of the output.

type(conn.builtins.actionsetinfo())
swat.cas.results.CASResults

The actionSetInfo CAS action returns a CASResults object. The million dollar question is, what exactly is a CASResults object?

CASResults Object

CAS actions return a CASResults object. A CASResults object is an ordered Python dictionary with additional methods and attributes. There are no specific rules about how many keys are contained in the CASResults object, or what values the keys return. That information depends on the CAS action used.

One thing to note is that CASResults objects are local on your client machine. That is, the data is not in CAS anymore, it has been processed by CAS and returned locally. This is something to keep in mind when you are working with big data in CAS and try to return more data than your local computer can handle.

Let's continue by examining the output from our previous example. The CASResults object returns one key named setinfo, and one value. You can quickly tell by looking at the output.

However, another method to view all the keys in the CAResults object is to use the Python dictionary keys method. In this example, it will return one key named setinfo.

conn.actionsetinfo().keys()
odict_keys(['setinfo'])

Be aware of how I specified the CAS action in the above example. This time I did not qualify it with the Builtins CAS action set. While you can specify with or without the CAS action set, I recommend to be consistent. I typically do not type the CAS action set. I'll keep that method moving forward.

To view a specific value of a CASResults object you can call the key like a typical Python dictionary. In this example you can call the setinfo key to return the key's value.

conn.actionsetinfo()['setinfo']

As mentioned earlier, the output seems to resemble a Pandas DataFrame. To confirm our suspicion, we can use the type function to see the data type of the value returned by the setinfo key,

type(conn.actionsetinfo()['setinfo'])
swat.dataframe.SASDataFrame

Interesting! The CASResults object contains a SASDataFrame for the setinfo key. What exactly is a SASDataFrame?

Understanding a SASDataFrame

A SASDataFrame is a subclass of a Pandas DataFrame. As a result, you can work with them as you normally would a Pandas DataFrame! Another thing to note is a SASDataFrame is local data. Again, as you use CAS you must be aware CAS can handle more data than your local computer can handle. When bringing data locally make sure it's usable on your local machine. We will discuss this in future posts.

Let's work with the SASDataFrame. First, I will create a new variable named df that holds the SASDataFrame from the actionSetInfo action.

df = conn.actionsetinfo()['setinfo']

Next, I'll use some Pandas methods on the df object. Let's start with the head method to view the first five rows.

df.head()

Or filter the data using the loc method. In this scenario I want to find all action sets with the name simple, and return the actionset and label columns.

df.loc[df['actionset'] == 'simple', ['actionset', 'label']]

Or check unique values of the product_name column by using the value_counts method.

df['product_name'].value_counts()
tkcas        8
crsstat      3
crssearch    1

Or...

No that's it. I think you get my point! You can work with a SASDataFrame object like you would a Pandas DataFrame!

CASResults Object With Multiple Keys

Lastly, let's use another CAS action, but this time the CASResults object contains multiple keys, with multiple value types.

In this example I use the serverStatus CAS action from the Builtins CAS action set. The CAS action returns the status of the server.

conn.serverstatus()

From the looks of the output, I see there are three keys; About, server, and nodestatus. However, let's check our assumption by using the keys method on the CASResults object.

conn.serverstatus().keys()
odict_keys(['About', 'server', 'nodestatus'])

In the output there are three keys as expected. Let's look at the data type of each key by writing a quick Python loop to print the name of the key and the value type returned by that key.

for key,value in conn.serverstatus().items():
      print('Key : {}, Value Type : {}'.format(key,type(value)))
Key : About, Value Type : <class 'dict'>
Key : server, Value Type : <class 'swat.dataframe.SASDataFrame'>
Key : nodestatus, Value Type : <class 'swat.dataframe.SASDataFrame'>

The serverStatus CAS action contains three keys, one key contains a dictionary object, and the other two keys contain a SASDataFrame.

Summary

In conclusion, understanding CAS actions and the CASResults objects are essential when working with CAS. A couple of key points to remember:

  • CAS actions reside in CAS action sets and perform a single task on the CAS server.
  • CAS actions return a CASResults object, which is simply a Python dictionary.
  • A CASResults object contains a single or multiple keys, corresponding to a value or values of any data type.
  • A SASDataFrame resides in a CASResults object, and is a subclass of a Pandas DataFrame.

That was a quick overview of a few basic CAS actions, and how to work with the CASResults objects they return.  You can check out the SAS documentation below for all available CAS Actions.

Additional Resources

SAS® Viya® 3.5 Actions and Action Sets by Name and Product

Getting Started with Python Integration to SAS® Viya® Series Index

Getting Started with Python Integration to SAS® Viya® - Part 2 - Working with CAS Actions and CASResults Objects was published on SAS Users.

6月 192020
 

Index of articles on Getting Started with Python Integration to SAS® Viya®.
Part 1 - Making a Connection
Part 2 - Working with CAS Actions and CASResults Objects
Part 3 - Loading a CAS Action Set
Part 4 - Exploring Caslibs
Part 5 - Uploading Data into CAS
Part 6 - Session vs Global Scope Tables
Part 7 - Exploring Tables? CAS Actions vs SWAT API?

Getting Started with Python Integration to SAS® Viya® - Index was published on SAS Users.

3月 052020
 

Have you heard that SAS offers a collection of new, high-performance CAS procedures that are compatible with a multi-threaded approach? The free e-book Exploring SAS® Viya®: Data Mining and Machine Learning is a great resource to learn more about these procedures and the features of SAS® Visual Data Mining and Machine Learning. Download it today and keep reading for an excerpt from this free e-book!

In SAS Studio, you can access tasks that help automate your programming so that you do not have to manually write your code. However, there are three options for manually writing your programs in SAS® Viya®:

  1. SAS Studio provides a SAS programming environment for developing and submitting programs to the server.
  2. Batch submission is also still an option.
  3. Open-source languages such as Python, Lua, and Java can submit code to the CAS server.

In this blog post, you will learn the syntax for two of the new, advanced data mining and machine learning procedures: PROC TEXTMINE and PROCTMSCORE.

Overview

The TEXTMINE and TMSCORE procedures integrate the functionalities from both natural language processing and statistical analysis to provide essential functionalities for text mining. The procedures support essential natural language processing (NLP) features such as tokenizing, stemming, part-of-speech tagging, entity recognition, customized stop list, and so on. They also support dimensionality reduction and topic discovery through Singular Value Decomposition.

In this example, you will learn about some of the essential functionalities of PROC TEXTMINE and PROC TMSCORE by using a text data set containing 1,830 Amazon reviews of electronic gaming systems. The data set is named Amazon. You can find similar data sets of Amazon reviews at http://jmcauley.ucsd.edu/data/amazon/.

PROC TEXTMINE

The Amazon data set has already been loaded into CAS. The review content is stored in the variable ReviewBody, and we generate a unique review ID for each review. In the proc call shown in Program 1 we ask PROC TEXTMINE to do three tasks:

  1. parse the documents in table reviews and generate the term by document matrix
  2. perform dimensionality reduction via Singular Value Decomposition
  3. perform topic discovery based on Singular Value Decomposition results

Program 1: PROC TEXTMINE

data mycaslib.amazon;
    set mylib.amazon;
run;

data mycaslib.engstop;
    set mylib.engstop;
run;

proc textmine data=mycaslib.amazon;
    doc_id id;
    var reviewbody;

 /*(1)*/  parse reducef=2 entities=std stoplist=mycaslib.engstop 
          outterms=mycaslib.terms outparent=mycaslib.parent
          outconfig=mycaslib.config;

 /*(2)*/  svd k=10 svdu=mycaslib.svdu outdocpro=mycaslib.docpro
          outtopics=mycaslib.topics;

run;

(1) The first task (parsing) is specified in the PARSE statement. Parameter “reducef” specifies the minimum number of times a term needs to appear in the text to be included in the analysis. Parameter “stop” specifies a list of terms to be excluded from the analysis, such as “the”, “this”, and “that”. Outparent is the output table that stores the term by document matrix, and Outterms is the output table that stores the information of terms that are included in the term by document matrix. Outconfig is the output table that stores configuration information for future scoring.

(2) Tasks 2 and 3 (dimensionality reduction and topic discovery) are specified in the SVD statement. Parameter K specifies the desired number of dimensions and number of topics. Parameter SVDU is the output table that stores the U matrix from SVD calculations, which is needed in future scoring. Parameter OutDocPro is the output table that stores the new matrix with reduced dimensions. Parameter OutTopics specifies the output table that stores the topics discovered.

Click the Run shortcut button or press F3 to run Program 1. The terms table shown in Output 1 stores the tagging, stemming, and entity recognition results. It also stores the number of times each term appears in the text data.

Output 1: Results from Program 1

PROC TMSCORE

PROC TEXTMINE is used with large training data sets. When you have new documents coming in, you do not need to re-run all the parsing and SVD computations with PROC TEXTMINE. Instead, you can use PROC TMSCORE to score new text data. The scoring procedure parses the new document(s) and projects the text data into the same dimensions using the SVD weights derived from the original training data.

In order to use PROC TMSCORE to generate results consistent with PROC TEXTMINE, you need to provide the following tables generated by PROC TEXTMINE:

  • SVDU table – provides the required information for projection into the same dimensions.
  • Config table – provides parameter values for parsing.
  • Terms table – provides the terms that should be included in the analysis.

Program 2 shows an example of TMSCORE. It uses the same input data layout used for PROC TEXTMINE code, so it will generate the same docpro and parent output tables, as shown in Output 2.

Program 2: PROC TMSCORE

Proc tmscore data=mycaslib.amazon svdu=mycaslib.svdu
        config=mycaslib.config terms=mycaslib.terms
        svddocpro=mycaslib.score_docpro outparent=mycaslib.score_parent;
    var reviewbody;
    doc_id id;
run;

 

Output 2: Results from Program 2

To learn more about advanced data mining and machine learning procedures available in SAS Viya, including PROC FACTMAC, PROC TEXTMINE, and PROC NETWORK, you can download the free e-book, Exploring SAS® Viya®: Data Mining and Machine Learning. Exploring SAS® Viya® is a series of e-books that are based on content from SAS® Viya® Enablement, a free course available from SAS Education. You can follow along with examples in real time by watching the videos.

 

Learn about new data mining and machine learning procedures in SAS Viya was published on SAS Users.

2月 192020
 

Are you a statistical programmer whose company has adopted SAS Viya? If so, you probably know that the DATA step can run in parallel in SAS Cloud Analytic Services (CAS). As Sekosky (2017) says, "running in a single thread in SAS is different from running in many threads in CAS." He goes on to state, "you cannot take any DATA step, change the librefs used, and have it run correctly in parallel. You ... have to know what your program is doing to make sure you know what it does when it runs in parallel."

This article discusses one aspect of "know what your program is doing." Specifically, to run in parallel, the DATA step must use only functions and statements that are "CAS-enabled." Most DATA step functions run in CAS. However, there is a set of "SAS-only" functions that do not run in CAS. This article discusses these functions and provides a link to a list of the SAS-only functions. It also shows how you can get SAS to tell you whether your DATA step contains a SAS-only function.

DATA steps that run in CAS

By default, the DATA step will attempt to run in parallel when the step satisfies three conditions (Bultman and Secosky, 2018, p. 2):

  1. All librefs in the step are CAS engine librefs to the same CAS session.
  2. All statements in the step are supported by the CAS DATA step.
  3. All functions, CALL routines, formats, and informats in the step are available in CAS.

The present article is concerned with the third condition. How can you know in advance whether all functions and CALL routines are available in CAS?

A list of DATA step functions that do not run in CAS

For every DATA step function, the SAS Viya documentation indicates whether the function is supported on CAS or not. For example, the following screenshots of the Viya 3.5 documentation show the documentation for the RANUNI function and the RAND function:

Notice that the documentation for the RANUNI function says, "not supported in a DATA step that runs in CAS," whereas the RAND function is in the "CAS category," which means that it is supported in CAS. This means that if you use the RANUNI function in a DATA step, the DATA step will not run in CAS. (Similarly, for the other old-style random number functions, such as RANNOR.) Instead, it will try to run in SAS. This could result in copying input data from CAS, running the program in a single thread, and copying the final data set into a CAS table. Copying all that data is not efficient.

Fortunately, you do not need to look up every function to determine if it is a CAS-enabled or SAS-only function. The documentation now includes a list, by category, of the Base SAS functions (and CALL routines) that do not run in CAS. The following screenshot shows the top of the documentation page.

Can you always rewrite a DATA step to run in CAS?

For the example in the previous section (the RANUNI function), there is a newer function (the RAND function) that has the same functionality and is supported in CAS. Thus, if you have an old SAS program that uses the RANUNI function, you can replace that call with RAND("UNIFORM") and the modified DATA step will run in CAS. Unfortunately, not all functions have a CAS-enabled replacement. There are about 200 Base SAS functions that do not run in CAS, and most of them cannot be replaced by an equivalent function that runs in CAS.

The curious reader might wonder why certain classes of functions cannot run in CAS. Here are a few sets of functions that do not run in CAS, along with a few reasons:

  • Functions specific to the English language, such as the SOUNDEX and SPEDIS functions. Also, functions that are specific to single-byte character sets (especially I18N Level 0). Most of these functions are not applicable to an international audience that uses UTF-8 encoding.
  • Functions and statements for reading and writing text files. For example, INFILE, INPUT, and FOPEN/FCLOSE. There are other ways to import text files into CAS.
  • Macro-related functions such as SYMPUT and SYMGET. Remember: There are no macro variables in CAS! The macro pre-processor is a SAS-specific feature, and one of the principles of SAS Viya is that programmers who use other languages (such as Python or Lua) should have equal access to the functionality in Viya.
  • Old-style functions for generating random numbers from probability distributions. Use the RAND function instead.
  • Functions that rely on a single-threaded execution on a data set that has ordered rows. Examples include DIF, LAG, and some permutation/combination functions such as ALLCOMB. Remember: A CAS data table does not have an inherent order.
  • The US-centric state and ZIP code functions.
  • Functions for working with Git repositories.

How to force the DATA step to stop if it is not able to run in CAS

When a DATA step runs in CAS, you will see a note in the log that says:
NOTE: Running DATA step in Cloud Analytic Services.

If the DATA step runs in SAS, no note is displayed. Suppose that you intend for a DATA step to run in CAS but you make a mistake and include a SAS-only function. What happens? The default behavior is to (silently) run the DATA step in SAS and then copy the (possibly huge) data into a CAS table. As discussed previously, this is not efficient.

You might want to tell the DATA step that it should run in CAS or else report an error. You can use the SESSREF= option to specify that the DATA step must run in a CAS session. For example, if MySess is the name of your CAS session, you can submit the following DATA step:

/* use the SESSREF= option to force the DATA step to report an ERROR if it cannot run in CAS */
data MyCASLib.Want / sessref=MySess;
   x = ranuni(1234);    /* this function is not supported in the CAS DATA step */
run;
NOTE: Running DATA step in Cloud Analytic Services.
ERROR: The function RANUNI is unknown, or cannot be accessed.
ERROR: The action stopped due to errors.

The log is shown. The NOTE says that the step was submitted to a CAS session. The first ERROR message reports that your program contains a function that is not available. The second ERROR message reports that the DATA step stopped running. A DATA step that runs on CAS calls the dataStep.runCode action, which is why the message says "the action stopped."

This is a useful trick. The SESSREF= option forces the DATA step to run in CAS. If it cannot, it tells you which function is not CAS-enabled.

Other ways to monitor where the DATA steps runs

The DATA step documentation in SAS Viya contains more information about how to control and monitor where the DATA step runs. In particular, it discusses how to use the MSGLEVEL= system option to get detailed information about where a DATA step ran and in how many threads. The documentation also includes additional examples and best practices for running the DATA step in CAS. I recommend the SAS Cloud Analytic Services: DATA Step Programming documentation as the first step towards learning the advantages and potential pitfalls of running a DATA step in CAS.

Summary

The main purpose of this article is to provide a list of Base SAS functions (and CALL routines) that do not run in CAS. If you include one of these functions in a DATA step, the DATA step cannot run in CAS. This can be inefficient. You can use the SESSREF= option on the DATA statement to force the DATA step to run in CAS. If it cannot run in CAS, it stops with an error and informs you which function is not supported in CAS.

The post A list of SAS DATA step functions that do not run in CAS appeared first on The DO Loop.

2月 172020
 

Administrators like to be able to keep track of resource usage and who is using what in a system. When an administrator has this capability, they can look out for issues of high resource usage that may have an impact on overall system performance. In Viya, data is accessed in caslibs. In this post, I will show you how an administrator can track and control resource usage of personal caslibs.

A Caslib is an in-memory space to hold tables, access control lists, and data source information. In GEL enablement classes as we have discussed CAS and caslibs, one of the big asks we have had was related to personal caslibs and how can an administrator keep track of the resources that they use. Until recently we didn’t have a great answer, the good news is now we do.

Personal caslibs are, by default, the active caslib when a user starts a CAS session. This gives each user a default location to access and load data (sort of like a home directory). As the name suggests, they are personal and only the user who starts the session can access the data. In early releases, administrators saw this as a big problem because personal caslibs were basically invisible to them. There was no way to monitor what was going on with a personal caslib, leaving the system open to issues where one user could consume a lot of resources and have an adverse impact.

Introduced in Viya 3.4, the accessControl.accessPersonalCaslibs action brings all existing personal Caslibs into a session where an administrator has assumed the data or superuser role.

Running the accessControl.accessPersonalCaslibs action has the following effects. The administrator can:

  • See all personal caslibs that existed at the time the action was run
  • View promoted tables characteristics within the personal caslibs
  • Drop promoted tables within other users’ personal caslibs.

This elevation of access remains in effect for the duration of the session. The action does not give an administrator full access to personal caslibs, an administrator can never fetch data from other users’ personal caslibs, drop any personal caslib, set access controls on any personal caslib, or set access controls on any table in any personal caslib. What it does is give the administrator a view into personal caslibs to be able to monitor and troubleshooting their resource usage.

Let’s look at how it works. Logged into Viya as Viya administrator (by default also a CAS administrator), I can use the table.caslibinfo action to see all the caslibs that the administrator has permission to view. In the output below, I see my own personal caslib, and all the other caslibs that the administrator has permissions (set by the CAS authorization system) to view.

cas mysess;
proc cas;
table.caslibinfo;
quit;
cas mysess terminate;

In the code below, the super user role is assumed for this session (SAS Administrators by default can also assume the super user role in CAS). With the role assumed, the administrator can execute the accessControl.accessPersonalCaslibs action and the subsequent table.caslibinfo action returns all caslibs including any personal caslibs that existed when the session started.

cas adminsession;
proc cas;
/* need to be a super user or data administrator */
accessControl.assumeRole / adminRole="superuser";
accessControl.accessPersonalCaslibs;
table.caslibinfo ;
quit;
cas adminsession terminate;

That helps, but what about the details? How can the administrator see how many resources the tables in a personal CASLIB are using? To get the details, we can access an individual CASLIB and its tables, and for each table, execute the table.tabledetails action. The program below will loop all the personal caslibs and, for each of the caslibs, it will loop the in-memory tables and execute the table.tabledetails action. The output of tabledetails gives an idea of how many resources (memory, disk etc.) a table is using.

cas adminsession;
proc cas;
/* need to be a super user */
accessControl.assumeRole / adminRole=”superuser”;
accessControl.accessPersonalCaslibs;table.caslibinfo result=fileresult;
casliblist=findtable(fileresult);
 
/* loop caslibs */
do cvalue over casliblist;
 
if cvalue.name==: ‘CASUSER’ then
do; /* only look at caslibs that contain CASUSER */
 
table.tableinfo result=tabresult / caslib=cvalue.name;
 
tablelist=findtable(tabresult);
x=dim(tablelist);
 
if x>1 then
do; /* there are tables available */
 
do tvalue over tablelist; /* loop all tables in the caslib */
 
table.tabledetails / caslib=cvalue.name name=tvalue.name;
table.tableinfo / caslib=cvalue.name name=tvalue.name;
 
end; /* loop all tables in the caslib */
end; /* there are tables available */
end; /* only look at caslibs that contain CASUSER */
end; /* loop caslibs */
 
accessControl.dropRole / adminRole=”superuser”;
quit;
cas adminsession terminate;

Two fields that can give an administrator an idea of how big a table is are:

  • Data size: the size of the SAS Dataset in memory
  • Memory Mapped: the part of the data that has been “memory mapped” and is backed up in the CAS Disk Cache memory-mapped files.

The table below show the output for one users personal caslib.

If one table in particular is causing problems, it is possible for the administrator to drop the table from memory.

cas adminsession;
proc cas;
 
accessControl.assumeRole / adminRole=”superuser”;
accessControl.accessPersonalCaslibs;
sessionProp.setSessOpt / caslib=”CASUSER(gatedemo499)”;
table.droptable / name=”TRAIN”;
 
quit;
cas adminsession terminate;

Visibility into personal caslibs will be a big help to Viya administrators monitoring CAS resource usage. Check out the following for more details:

Viya administrators can now get personal with users' Caslibs was published on SAS Users.

11月 222019
 

sxlion

大家有一个普通的印象:SAS的更新很慢,很老很落后。可能跟它的版本命名有关,SAS9.0是2004年出来的,到现在都快20年了,版本号 还停留在9字头,并且还没有继续更新的迹象。当然这个与SAS公司的明面

原创文章: ”难道是SAS10?云分析服务时代的到来“,转载请注明: 转自SAS资源资讯列表

本文链接地址: http://saslist.net/archives/454

11月 222019
 

sxlion

大家有一个普通的印象:SAS的更新很慢,很老很落后。可能跟它的版本命名有关,SAS9.0是2004年出来的,到现在都快20年了,版本号 还停留在9字头,并且还没有继续更新的迹象。当然这个与SAS公司的明面

原创文章: ”难道是SAS10?云分析服务时代的到来“,转载请注明: 转自SAS资源资讯列表

本文链接地址: http://saslist.net/archives/454

9月 262019
 

In part 1 of this post, we looked at setting up Spark jobs from Cloud Analytics Services (CAS) to load and save data to and from Hadoop. Now we are moving on to the next step in the analytic cycle, scoring data in Hadoop and executing SAS code as a Spark job. The Spark scoring jobs execute using SAS In-Database Technologies for Hadoop.

The integration of the SAS Embedded Process and Hadoop allows scoring code to run directly on Hadoop. As a result, publishing and scoring of both DS2 and Data step models occur inside Hadoop. Furthermore, access to Spark data exists through the SAS Workspace Server or the SAS Compute Server using SAS/ACCESS to Hadoop, or from the CAS server using SAS Data Connectors.

Scoring Data from CAS using Spark

SAS PROC SCOREACCEL provides an interface to the CAS server for DS2 and DATA step model publishing and scoring. Model code is published from CAS to Spark and then executed via the SAS Embedded Process.

PROC SCOREACCEL supports a file interface for passing the model components (model program, format XML, and analytic stores). The procedure reads the specified files and passes their contents on to the model-publishing CAS action. In this case, the files must be visible from the SAS client.

CAS publishModel and runModel actions publish and execute score data in Spark:

 
%let CLUSTER="/opt/sas/viya/config/data/hadoop/lib:
  /opt/sas/viya/config/data/hadoop/lib/spark:/opt/sas/viya/config/data/hadoop/conf";
proc scoreaccel sessref=mysess1;
publishmodel
   target=hadoop
   modelname="simple01"
   modeltype=DS2
/* 
   filelocation=local */
programfile="/demo/code/simple.ds2"
username="cas"
modeldir="/user/cas"
classpath=&CLUSTER.
; runmodel 
    target=hadoop
    modelname="simple01"
    username="cas"
    modeldir="/user/cas"
    server=hadoop.server.com'
    intable="simple01_scoredata"
    outtable="simple01_outdata"
    forceoverwrite=yes
    classpath=&CLUSTER.
    platform=SPARK
; 
quit;

In the PROC SCOREACCEL example above, a DS2 model is published to Hadoop and executed with the Spark processing engine. The CLASSPATH statement specifies a link to the Hadoop cluster. The input and output tables, simple01_scoredata and simple01_outdata, already exist on the Hadoop cluster.

Score data in Spark from CAS using SAS Scoring Accelerator

SAS Scoring Accelerator execution status in YARN as a Spark job from CAS

As you can see in the image above, the model is scored in Spark using the SAS Scoring Accelerator. The Spark job name reflects the input and output tables.

Scoring Data from MVA SAS using Spark

Steps to run a scoring model in Hadoop:

  1. Create a traditional scoring model using SAS Enterprise Miner or an analytic store scoring model, generated using SAS Factory Miner HPFOREST or HPSVM components.
  2. Specify the Hadoop connection attributes: %let indconn= user=myuserid;
  3. Use the INDCONN macro variable to provide credentials to connect to the Hadoop HDFS and Spark. Assign the INDCONN macro before running the %INDHD_PUBLISH_MODEL and the %INDHD_RUN_MODEL macros.
  4. Run the %INDHD_PUBLISH_MODEL macro.
  5. With traditional model scoring, the %INDHD_PUBLISH_MODEL performs multiple tasks using some of the files created by the SAS Enterprise Miner Score Code Export node. Using the scoring model program (score.sas file), the properties file (score.xml file), and (if the training data includes SAS user-defined formats) a format catalog, this model performs the following tasks:

    • translates the scoring model into the sasscore_modelname.ds2 file, runs scoring inside the SAS Embedded Process
    • takes the format catalog, if available, and produces the sasscore_modelname.xml file. This file contains user-defined formats for the published scoring model.
    • uses SAS/ACCESS Interface to Hadoop to copy the sasscore_modelname.ds2 and sasscore_modelname.xml scoring files to HDFS
  6. Run the %INDHD_RUN_MODEL macro.

The %INDHD_RUN_MODEL macro initiates a Spark job that uses the files generated by the %INDHD_PUBLISH_MODEL to execute the DS2 program. The Spark job stores the DS2 program output in the HDFS location specified by either the OUTPUTDATADIR= argument or by the element in the HDMD file.

Here is an example:

 
option set=SAS_HADOOP_CONFIG_PATH="/opt/sas9.4/Config/Lev1/HadoopServer/conf";
option set=SAS_HADOOP_JAR_PATH="/opt/sas9.4/Config/Lev1/HadoopServer/lib:/opt/sas9.4/Config/Lev1/HadoopServer/lib/spark";
 
%let scorename=m6sccode;
%let scoredir=/opt/code/score;
option sastrace=',,,d' sastraceloc=saslog;
option set=HADOOPPLATFORM=SPARK;
 
%let indconn = %str(USER=hive HIVE_SERVER=’hadoop.server.com');
%put &indconn;
%INDHD_PUBLISH_MODEL( dir=&scoredir., 
        datastep=&scorename..sas,
        xml=&scorename..xml,
        modeldir=/sasmodels,
        modelname=m6score,
        action=replace);
 
%INDHD_RUN_MODEL(inputtable=sampledata, 
outputtable=sampledata9score, 
scorepgm=/sasmodels/m6score/m6score.ds2, 
trace=yes, 
    platform=spark);
Score data in Spark from MVA SAS using Spark

SAS Scoring Accelerator execution status in YARN as a Spark job from MVA SAS

To execute the job in Spark, either set the HADOOPPLATFORM= option to SPARK or set PLATFORM= to SPARK inside the INDHD_RUN_MODEL macro. The SAS Scoring Accelerator uses SAS Embedded Process to execute the Spark job with the job name containing the input table and output table.

Executing user-written DS2 code using Spark

User-written DS2 programs can be complex. When running inside a database, a code accelerator execution plan might require multiple phases. Because of Scala program generation that integrates with the SAS Embedded Process program interface to Spark, the many phases of a Code Accelerator job reduces to one single Spark job.

In-Database Code Accelerator

The SAS In-Database Code Accelerator on Spark is a combination of generated Scala programs, Spark SQL statements, HDFS files access, and DS2 programs. The SAS In-Database Code Accelerator for Hadoop enables the publishing of user-written DS2 thread or data programs to Spark, executes in parallel, and exploits Spark’s massively parallel processing. Examples of DS2 thread programs include large transpositions, computationally complex programs, scoring models, and BY-group processing.

Below is a table of required DS2 options to execute as a Spark job.

DS2ACCEL Set to YES
HADOOPPLATFORM Set to SPARK

 

There are six different ways to run the code accelerator inside Spark, called CASES. The generation of the Scala program by the SAS Embedded Process Client Interface depends on how the DS2 program is written. In the following example, we are looking at Case 2, which is a thread and a data program, neither of them with a BY statement:

 
proc ds2 ds2accel=yes;
thread work.workthread / overwrite=yes; 
        method run();
          set hdplib.cars;
          output;
end; endthread; run; 
  data hdplib.carsout (overwrite=yes); dcl thread work.workthread m;
  dcl double count;
  keep count make model;
method run(); set from m; count+1; output; 
end; enddata; run; quit;

The entire DS2 program runs in two phases. The DS2 thread program runs during Phase One, and its tasks execute in parallel. The DS2 data program runs during Phase Two using a single task.

Finally

With SAS Scoring Accelerator and Spark integration, users have the power and flexibility to process and score modeled data in Spark. SAS Code Accelerator and Spark integration takes the flexibility further to process any Spark data using DS2 code. Furthermore it is now possible for business to respond to use cases immediately and with higher reliability in the big data space.

Data and Analytics Innovation using SAS & Spark - part 2 was published on SAS Users.