6月 232020
 

When you execute code on the SAS® Cloud Analytic Services (CAS) server, one way you can improve performance is by partitioning a table. A partitioned table enables the data to load faster when that data needs to be grouped by the common values of a variable.

This post explains what it means to partition a table and describes the advantages of a partitioned table. I'll illustrate these concepts with example code that shows improved processing time when you use a partitioned table.

SAS BY-group requires SORT

BY-group processing enables you to group your data by unique variable values. This processing is used, for example, in tasks like merging data sets by a common variable and producing reports that contain data that is grouped by the value of a classification variable. You also use BY-group processing when you execute code in CAS. A key difference in creating BY-groups in SAS versus in CAS is that SAS requires a SORT procedure to sort the data by the specified BY variable in order to create the BY groups.

BY-group sorting implicit in CAS

This step is not required in CAS. When you perform BY-group processing on a CAS table, an implicit sorting action takes place, and each BY-group is distributed to the available threads. This implicit sort process takes place each time that the table is accessed and the BY-groups are requested.

Partitioning a CAS table permanently stores the table such that the values of the BY variable are grouped. Using a partitioned CAS table enables you to skip the implicit sort process each time the table is used, which can greatly improve performance.

Partition CAS action example

You create a partitioned CAS table by using the partition CAS action. The following example shows how to partition the CARS table (created from the SASHELP.CARS data set) by the MAKE variable.

caslib _all_ assign;    ①
data casuser.cars;      
set sashelp.cars;  ②
run;
proc cas;
table.partition /     ③                                              
casout={caslib="casuser", name="cars2"} ④                            
table={caslib="casuser", name="cars", groupby={name="make"}}; ⑤
quit;

In this code:

①  The CASLIB statement creates SAS librefs that point to all existing caslibs. CASUSER, which is used in the subsequent DATA step, is one of the librefs that are created by this statement.

②  The DATA step creates the CARS CAS table in the CASUSER caslib.

③  The partition action in the CAS procedure is part of the TABLE action set.

④  The casout= parameter contains the caslib= parameter, which points to the caslib where the partitioned CAS table named CARS2 will be stored.

⑤ The table= parameter contains the name= parameter, which lists the name of the table that is being partitioned. It also contains the CASLIB= option, which points to the caslib in which the table is stored. The groupby= parameter contains the name= option, which names the variable by which to partition the table.

You can confirm that the table has been partitioned by submitting the following CAS procedure with the tabledetails action.

proc cas;
table.tabledetails result=r / level='partition' ①
name='cars2'
caslib='casuser';
run;
describe r; ②
quit;

In this code:
① The LEVEL= parameter specifies the aggregation level of the TABLEDETAILS output.
② The DESCRIBE statement writes the output of the TABLEDETAILS action to the log.

The following output is displayed in the resulting log. The KEY column shows the variable the table has been partitioned by.

As mentioned earlier, the purpose of partitioning a table is to improve performance. The following example uses two CAS tables that consist of an ID variable with 10 possible values and 10 character variables. Each table contains 5,000,000 rows. This example illustrates how much performance improvement you can gain by partitioning the tables. In this case, the ID variable merges two tables.

First, you create the tables by submitting the following DATA steps:

data casuser.one;
array vars(10) $8 x1-x10;
do j=1 to 5000000;
id=put(rand('integer',1,10),8.);
do i=1 to 10;
vars(i)=byte(rand('integer',65,90));
end;
output;
end;
drop i j;
run;
 
data casuser.two;
array vars(10) $8 x1-x10;
do j=1 to 5000000;
id=put(rand('integer',1,10),8.);
do i=1 to 10;
vars(i)=byte(rand('integer',65,90));
end;
output;
end;
drop i j;
run;

The DATA steps above show how to merge non-partitioned tables. In the log output shown below, you can see that the total, real time (highlighted) took almost 45 seconds to run.

Partitioned tables example

The next example runs the same DATA step code, but it uses partitioned tables. The first step is to partition the tables, as shown below:

proc cas;
table.partition /                                                   
casout={caslib="casuser", name="onepart"}                             
table={caslib="casuser", name="one", groupby={name="id"}};   
run; 
 
table.partition /                                                   
casout={caslib="casuser", name="twopart"}                             
table={caslib="casuser", name="two", groupby={name="id"}};   
quit;

To merge the two tables and to product the log, submit the following code. (The real time is highlighted in the log.)

data casuser.nopart;
merge casuser.onepart casuser.twopart;
by id;
run;

The output for this code is show below:

This time, the DATA Step took only 25.43 seconds to execute. That is a 43% improvement in execution time!

Partition for improved performance

If your analysis requires you to use the same table multiple times to perform BY-group processing, then I strongly recommend that you partition the table. As the last example shows, partitioning your table can greatly improve performance!

Partition your CAS tables to greatly improve performance was published on SAS Users.

6月 222020
 

Sometimes in matrix computations, it is important to display the nonzero elements of a matrix. This can be useful for visualizing the structure of a sparse matrix (one that has many zeros) and it is also useful when describing a matrix algorithm (such as Gaussian elimination) that introduces zeros at each step of the algorithm.

Let A be the matrix you want to visualize. You can visualize the nonzero elements of the matrix A in a graph or in a table:

  • Use a custom SAS format that prints an asterisk (*) for nonzero values and a blank for zero values.
  • Create a discrete heat map of the binary matrix A^=0, which replaces the nonzero cells of A with the value 1.

A text display method

For small matrices, you can use a custom SAS format to print the matrix. The format is defined by using PROC FORMAT. This format displays a star (*) for nonzero values and displays a blank for zero values:

proc format;
value SparseFmt   
      0 = " "      /* display blank instead of 0 */
      other = "*"; /* display '*' instead of number */
run;
 
proc iml;
A = {1.0  0.2  0.0  0.0  0.0  0.0 1e-6,
     0.2  1.0  0.2  0.0  0.0  0.0 0.0,
     0.0  0.2  1.0  0.2  0.0  0.0 0.0,
     0.0  0.0  0.2  1.0  0.2  0.0 0.0,
     1e-6 0.0  0.0  0.2  1.0  0.2 0.0,
     2e-6 1e-6 0.0  0.0  0.2  1.0 0.2,
     3e-6 2e-6 1e-6 0.0  0.0  0.2 1.0 };
 
/* use a custom format to print * or blank in each cell */
print A[format=SparseFmt.];
Visualize a sparse matrix by using a custom format that print a blank or an asterisk

You can see the banded structure of the matrix and also see that elements near the corners are nonzero. This textual visualization is concise and is used in textbooks about numerical linear algebra.

A graphical method

For larger matrices, a graphical method is probably better. You can form the binary logical matrix B=(A^=0) and then plot B by using a two-color heat map. In the SAS/IML language, you can use the HEATMAPDISC call to create a discrete heat map of a matrix. I like to use white for the zeroes. The following SAS/IML statements define a 7 x 7 matrix and use a discrete heat map to display the nonzero values:

/* discrete heat map of binary matrix A^=0 */
ods graphics / width=400px height=400px;
call heatmapdisc(A^=0) colorramp={white steelblue}
         title="Discrete heat map of binary matrix";
Visualize a sparse matrix by using a heat map of a binary matrix where each cell is zero (0) or nonzero (1)

The information in the heat map is identical to the printed display. However, for large matrices, the heat map is easier to view. As shown, you can also use the ODS GRAPHICS statement to control the dimensions of the graph. So, for example, if you are visualizing a square matrix, you can make the graph square.

The post Visualize the structure of a sparse matrix appeared first on The DO Loop.

6月 192020
 

Adverse outcomes, and the rapid spread of COVID-19, have accelerated research on all aspects of the disease. You may have found it overwhelming, and very time-consuming, to find relevant and specialized insights in all the scientific literature out there. To aid researchers in quickly identifying relevant literature about key topics [...]

Speed up your COVID-19 research with text analysis: step-by-step was published on SAS Voices by Melanie Carey

6月 192020
 

In the second post of the Getting Started with Python Integration to SAS® Viya® series we will learn about Working with CAS Actions and CASResults Objects. CAS actions are commands sent to the CAS server to run a task, and CASResults objects contain information returned from the CAS server. This post will cover a few basic CAS actions, and how to easily work with the information returned.

CAS Actions Overview

First, you need to understand CAS actions. From performing data preparation, modeling, imputing missing values, or even retrieving information about your CAS session, CAS actions perform a single task on the CAS server. CAS actions are organized with other CAS actions in a CAS action set. CAS action sets contain actions that are based on common functionality.

In the end, I like to think of CAS action sets as a package, and all the CAS actions inside an action set as a method.

Getting Started with CAS Actions

We will start with a basic CAS action to view all available loaded action sets in the current CAS session. Before you use any CAS action you must connect to the CAS server. For more information on connecting to the CAS server, visit Part 1 of the series.

I have already made my connection to CAS using SAS Viya for Learners, and my connection object is named conn.

display(conn)
CAS('svflhost.demo.sas.com', 5570, 'peter.test@sas.com', protocol='cas', name='py-session-1', session='efff4323-a862-bd6e-beea737b4249')

Next, to view all available action sets loaded in your CAS session use the actionSetInfo CAS action from the Builtins action set on the CAS connection object. CAS actions and CAS action sets are case insensitive, and you do not need to qualify the CAS action with the action set name (although it is best practice to include both). In the example below I qualify the CAS action.

The actionSetInfo CAS action returns the following output:

Your output information might differ depending on your SAS Viya installation.

The output shows a list of the loaded action sets on the CAS server with additional information for each. With a quick scan, it almost looks like a Pandas DataFrame. However, let's go back and run the actionSetInfo CAS action, but this time use the Python type function to see the data type of the output.

type(conn.builtins.actionsetinfo())
swat.cas.results.CASResults

The actionSetInfo CAS action returns a CASResults object. The million dollar question is, what exactly is a CASResults object?

CASResults Object

CAS actions return a CASResults object. A CASResults object is an ordered Python dictionary with additional methods and attributes. There are no specific rules about how many keys are contained in the CASResults object, or what values the keys return. That information depends on the CAS action used.

One thing to note is that CASResults objects are local on your client machine. That is, the data is not in CAS anymore, it has been processed by CAS and returned locally. This is something to keep in mind when you are working with big data in CAS and try to return more data than your local computer can handle.

Let's continue by examining the output from our previous example. The CASResults object returns one key named setinfo, and one value. You can quickly tell by looking at the output.

However, another method to view all the keys in the CAResults object is to use the Python dictionary keys method. In this example, it will return one key named setinfo.

conn.actionsetinfo().keys()
odict_keys(['setinfo'])

Be aware of how I specified the CAS action in the above example. This time I did not qualify it with the Builtins CAS action set. While you can specify with or without the CAS action set, I recommend to be consistent. I typically do not type the CAS action set. I'll keep that method moving forward.

To view a specific value of a CASResults object you can call the key like a typical Python dictionary. In this example you can call the setinfo key to return the key's value.

conn.actionsetinfo()['setinfo']

As mentioned earlier, the output seems to resemble a Pandas DataFrame. To confirm our suspicion, we can use the type function to see the data type of the value returned by the setinfo key,

type(conn.actionsetinfo()['setinfo'])
swat.dataframe.SASDataFrame

Interesting! The CASResults object contains a SASDataFrame for the setinfo key. What exactly is a SASDataFrame?

Understanding a SASDataFrame

A SASDataFrame is a subclass of a Pandas DataFrame. As a result, you can work with them as you normally would a Pandas DataFrame! Another thing to note is a SASDataFrame is local data. Again, as you use CAS you must be aware CAS can handle more data than your local computer can handle. When bringing data locally make sure it's usable on your local machine. We will discuss this in future posts.

Let's work with the SASDataFrame. First, I will create a new variable named df that holds the SASDataFrame from the actionSetInfo action.

df = conn.actionsetinfo()['setinfo']

Next, I'll use some Pandas methods on the df object. Let's start with the head method to view the first five rows.

df.head()

Or filter the data using the loc method. In this scenario I want to find all action sets with the name simple, and return the actionset and label columns.

df.loc[df['actionset'] == 'simple', ['actionset', 'label']]

Or check unique values of the product_name column by using the value_counts method.

df['product_name'].value_counts()
tkcas        8
crsstat      3
crssearch    1

Or...

No that's it. I think you get my point! You can work with a SASDataFrame object like you would a Pandas DataFrame!

CASResults Object With Multiple Keys

Lastly, let's use another CAS action, but this time the CASResults object contains multiple keys, with multiple value types.

In this example I use the serverStatus CAS action from the Builtins CAS action set. The CAS action returns the status of the server.

conn.serverstatus()

From the looks of the output, I see there are three keys; About, server, and nodestatus. However, let's check our assumption by using the keys method on the CASResults object.

conn.serverstatus().keys()
odict_keys(['About', 'server', 'nodestatus'])

In the output there are three keys as expected. Let's look at the data type of each key by writing a quick Python loop to print the name of the key and the value type returned by that key.

for key,value in conn.serverstatus().items():
      print('Key : {}, Value Type : {}'.format(key,type(value)))
Key : About, Value Type : <class 'dict'>
Key : server, Value Type : <class 'swat.dataframe.SASDataFrame'>
Key : nodestatus, Value Type : <class 'swat.dataframe.SASDataFrame'>

The serverStatus CAS action contains three keys, one key contains a dictionary object, and the other two keys contain a SASDataFrame.

Summary

In conclusion, understanding CAS actions and the CASResults objects are essential when working with CAS. A couple of key points to remember:

  • CAS actions reside in CAS action sets and perform a single task on the CAS server.
  • CAS actions return a CASResults object, which is simply a Python dictionary.
  • A CASResults object contains a single or multiple keys, corresponding to a value or values of any data type.
  • A SASDataFrame resides in a CASResults object, and is a subclass of a Pandas DataFrame.

That was a quick overview of a few basic CAS actions, and how to work with the CASResults objects they return.  You can check out the SAS documentation below for all available CAS Actions.

Additional Resources

SAS® Viya® 3.5 Actions and Action Sets by Name and Product

Getting Started with Python Integration to SAS® Viya® Series Index

Getting Started with Python Integration to SAS® Viya® - Part 2 - Working with CAS Actions and CASResults Objects was published on SAS Users.

6月 192020
 

Index of articles on Getting Started with Python Integration to SAS® Viya®.
Part 1 - Making a Connection
Part 2 - Working with CAS Actions and CASResults Objects
Part 3 - Loading a CAS Action Set
Part 4 - Exploring Caslibs
Part 5 - Uploading Data into CAS
Part 6 - Session vs Global Scope Tables
Part 7 - Exploring Tables? CAS Actions vs SWAT API?

Getting Started with Python Integration to SAS® Viya® - Index was published on SAS Users.

6月 172020
 

If you're my kids, the term, “continuous integration,” might have you thinking about how much time you’ve spent lately with the family, and “continuous delivery” is what's been happening on the front porch the past few weeks. But to a software developer, these terms mean something entirely different. Combined and [...]

Coming soon: Your SAS software, continuously updated was published on SAS Voices by Keith Renison