Tech

12月 022022
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. If you'd like to start by learning more about the distributed CAS server and CAS actions, please see CAS Actions and Action Sets - a brief intro. Otherwise, let's learn how to rename columns in CAS tables.

In this example, I will use the CAS language (CASL) to execute the alterTable CAS action. Be aware, instead of using CASL, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language. Refer to the documentation for syntax in other languages.

Load the demonstration data into memory

I'll start by executing the loadTable action to load the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory in the Casuser caslib. By default the Samples caslib should be available in your SAS Viya environment. Then I'll preview the CAS table using the columnInfo and fetch CAS actions.

* Connect to the CAS server and name the connection CONN *;
cas conn;
 
proc cas;
   * Specify the output CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Load the CAS table *;
   table.loadtable / 
      path = "WARRANTY_CLAIMS_0117.sashdat", caslib = "samples",
      casOut = casTbl;
 
    * Preview the CAS table *;
    table.columnInfo / table = casTbl;
    table.fetch / table = casTbl, to = 5;
quit;

The columnInfo action returns information about each column. Notice that the WARRANTY_CLAIMS CAS table has column names and columns labels.

The fetch CAS action returns five rows.

Notice that by default the fetch action uses columns labels in the header.

Rename columns in a CAS table

To rename columns in a CAS table, use the alterTable CAS action. In the alterTable action, specify the CAS table using the name and caslib parameters. Additionally, use the columns parameter to specify the columns to modify. The columns parameter requires a list of dictionaries, each dictionary specifies the column to modify.

Here, I'll rename the claim_attribute_1, seller_attribute_5 and product_attribute_1 columns. Then I'll execute the columnInfo action to view the updated column information.

proc cas;
   * Reference the CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Rename columns *;
   table.alterTable / 
      name = casTbl['name'], caslib = casTbl['caslib'],
      columns = {
	{name = 'claim_attribute_1', rename = 'Campaign_Type'},
	{name = 'seller_attribute_5', rename = 'Selling_Dealer'},
	{name = 'product_attribute_1', rename = 'Vehicle_Class'}
      };
 
   * View column metadata *;
   table.columnInfo / table = casTbl;
quit;

The results show that the alterTable CAS action renamed the columns to Campaign_Type, Selling_Dealer and Vehicle_Class. While this worked, what if you wanted to rename all columns in the CAS table using the column labels?

Rename all columns using the column labels

I'll dynamically rename the CAS table columns using the column labels. Since the column labels contain spaces, I'll also replace all spaces with an underscore. Now, I could manually specify each column and column label in the alterTable action, but why do all that work? Instead you can dynamically create a list of dictionaries for use in the alterTable action.

proc cas;
* Reference the CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
  * Rename columns with the labels. Spaces replaced with underscores *;
 
   *1. Store the results of the columnInfo action in a dictionary *;
   table.columnInfo result=cr / table = casTbl;
 
   *Loop over the columnInfo result table and create a list of dictionaries *;
   *2*;
   listElementCounter = 0;
   *3*;
   do columnMetadata over cr.ColumnInfo;
	*4.1*; listElementCounter = listElementCounter + 1;
	*4.2*; convertColLabel = tranwrd(columnMetadata['Label'],' ','_');
	*4.3*; renameColumns[listElementCounter] = {name = columnMetadata['Column'], rename = convertColLabel};
   end;
 
   *5. Rename columns *;
   table.alterTable / 
	name = casTbl['Name'], 
	caslib = casTbl['caslib'], 
	columns=renameColumns;
 
   *6. Preview CAS table *;
   table.columnInfo / table = casTbl;
quit;
  1. The columnInfo action will store the results in a dictionary named cr.
  2. The variable listElementCounter will act as a counter that can be used to append each dictionary to the list.
  3. Loop over the result table stored in the cr dictionary. When you loop over a result table, each row is treated as a dictionary. The key is the column name and it returns the value of that column.
  4. In the loop:
    1. accumulate the counter
    2. access the column label and replace all spaces with underscores using the tranwrd function
    3. create a list named renamedColumns that contains each dictionary with the column to rename and it's new name.
  5. The alterTable action will use the list of dictionaries to rename each column.
  6. The columnInfo action will display the new column information.

The results show that each column was dynamically renamed using the column label and the spaces replaced with underscores.

Summary

In summary, using the alterTable CAS action enables you to rename columns in a CAS table.  With some knowledge of lists, dictionaries and loops in the CAS language, you can dynamically use the column labels to rename the columns. When using the alterTable action remember that:

  • The name and caslib parameters specify the CAS table.
  • The columns parameter requires a list of dictionaries.
  • Each dictionary specifies the column to modify.

Want to learn how to do this using Python? Check out my post Getting started with Python integration to SAS® Viya® - Part 11 - Rename Columns.

Additional resources

simple.freq CAS action
SAS® Cloud Analytic Services: CASL Programmer’s Guide 
CAS Action! - a series on fundamentals
Getting Started with Python Integration to SAS® Viya® - Index
SAS® Cloud Analytic Services: Fundamentals

CAS-Action! Rename Columns in a CAS Table was published on SAS Users.

11月 042022
 

This blog is developed from my keynote presentation delivered at the U.S. Department of Energy Solar Energy Technology Office (SETO) one-day workshop focused on building community engagement for lasting impact.

The community that develops around projects is one of the richest aspects of using, developing, and sharing open source software (OSS). This community consists of individuals who are interested enough in your software project to contribute and you, likewise, engage with their work fostering a reciprocal relationship that enables projects to benefit. These communities must be created and nurtured given the large number of projects and domain expertise that is often required to use and contribute to software code bases.

I'm offering the following considerations that will help you when building and working in an OSS community. This is a comprehensive but not an exhaustive list and they do not have to take place sequentially. The first two considerations are action-oriented and the last is prescriptive for maintaining an OSS code repository.

Consideration 1: Human

  • Make it easy to use and contribute to the code.
  • Be responsive to individuals when they inquire about the project through issues, pull requests (PRs), and external repository communication.
  • Build personal relationships to cultivate frequent contributors and long-term support maintainers.
  • Be inclusive by offering suggestions and being open to various types of contributions including code, documentation, and general suggestions.
  • Build careers by finding new user groups to market your project and include newbies or those that want to change careers into technology (e.g., Girls Who Code, Operation Code).
  • Never tolerate bad actors and make sure to resolve conflicts quickly. A code of conduct file in the repository is very important.

Consideration 2: Program

  • Establish an outreach approach to engage with individuals who are most likely to be interested in your project and code base (i.e., who has the domain knowledge to help?).
  • Meet people where they are located by considering online and offline community events (e.g., Meetups).
  • Update the repository consistently because individuals will ignore stale repos.
  • Use media for consistent communication including issues, PRs, chat, groups, listservs, and email.
  • Use social media with blogs, tweets, and featured topics and community members.
  • Attend and hold events such as OSS conferences, challenges, hackathons, and office hours.
  • Give out free stuff like stickers, t-shirts, and fun things. Everyone likes a reward!

Consideration 3: Repository

  • Documentation is often overlooked but probably most important. Individuals will readily participate with better technical documentation, "how to" contribute, and code review process.
  • Code of Conduct file should include the project owner's pledge, standards, responsibilities, scope, and enforcement in case of bad actors.
  • Contributing file informs individuals on how to contribute to the project and code base.
  • License file needs to be included for the project to be considered a "real" OSS project. If needed, work with a legal team to consider proper licensing for contribution and use.
  • Readme file should include a project overview, easy to understand instructions for using the software, and API keys if needed.
  • Issue and PR templates provide workflow for addressing issues and completing PR merges. One way to complete is to assign at least one team member or another contributor to review.
  • Labels including "help wanted" and "good first issue" provide a starting point to notify users of easiest first contributions.

The Open Source program at SAS

SAS sticker
Did you know that SAS sponsors a robust open source software program? We host hundreds of code repositories at github.com/sassoftware. Before sharing, each project undergoes a diligent review for appropriate content, license terms, contribution guidelines, and legal considerations. A few of our most popular projects include SASPy (a Python library to work with SAS), the SAS Extension for VS Code, and relic (a tool for software developers, not specific to SAS).

Participation in the open source community is energizing for SAS developers too. Many developers attended the recent All Things Open conference in Raleigh (where SAS was a sponsor). The open source ecosystem benefits SAS developers, the tools that they use and the software they build. By giving back with their own contributions, the developers in SAS R&D feel the connection to a larger community of software professionals.

Building an Open Source Software Community was published on SAS Users.

11月 012022
 

SAS SQL handles missing values differently than the ANSI standard for SQL. PROC SQL follows the SAS convention for handling missing values: numerical missing values are always interpreted as less or smaller than all nonmissing values. My first blog showed that missing values can be troublemakers in non-grouped descriptive statistics. This blog now focuses on how to handle tables if they contain missing values.

In a single table

Missing values can lead to unexpected results while working with only one table, for example, in WHERE conditions, when sorting data using ORDER BY, and when grouping data using GROUP BY. (See the following examples.)

WHERE

If a column contains missing values, a WHERE condition can lead to undesirable results under certain circumstances. If you want to query all S0666 values smaller than 1.5 in the example code below, then a simple WHERE would also include missing values.

To keep S0666 values less than 1.5, but without missing values, you can extend the WHERE condition by adding IS NOT MISSING

Example: WHERE

proc sql ;
  select * 
  from SASHELP.SYR1001
  where S0666 < 1.5 
            and S0666 is not missing ;
quit ;

ORDER BY

When sorting data in ascending order, PROC SQL places missing values before data of type numeric or string. If that ORDER BY does not contain any other column to sort by, the values in other columns are only sorted if you expressly do so. If your data contain missing values, you may have to add further sorting variables to ORDER BY after the column containing missing values, for example, T and S0502 in the example below.

Example: ORDER BY

proc sql ;
  select * 
  from SASHELP.SYR1001
  order by S0666, T, S0502 ;
quit ;

GROUP BY

Missing values have subtle effects that may lead to fake statistics. The following example demonstrates the effect that missing values have on a seemingly simple calculation of means. To demonstrate, I replaced the missing values in column S0666 with 0s and saved the new column as S0666_0. The means of both S0666 and S0666_0 are calculated based on the sum of all available values divided by the number of nonmissing values data rows.

Example: GROUP BY

proc sql ;
       select T, S0666, 
	       avg(S0666) as S0666_MEAN,  
                case
		    when S0666 is missing then 0
                         else S0666 
                  end as S0666_0, 
          avg(calculated S0666_0) as S0666_0_MEAN 
	from SASHELP.SYR1001 ;
quit ;

Output

                   S0666_              S0666_0_
 T      S0666       mean     S0666_0       mean       

 52    1.52000    1.75444    1.52000    0.90229
 53    1.84000    1.75444    1.84000    0.90229
 54    1.96000    1.75444    1.96000    0.90229
 55     .         1.75444    0.00000    0.90229
 56     .         1.75444    0.00000    0.90229
 57     .         1.75444    0.00000    0.90229
 58     .          ...

Because the denominator for S0666_0_mean is larger, the resulting mean is (not surprisingly) lower in the output. However, S0666_mean also returns means for rows that did not contribute values to its calculation (only for the denominator, if you will).

When joining tables

If you are about to join tables, you need to check whether the keys of the tables to be joined contain missing values. And if they contain missing values, you need to check if they are of the same type. SAS defines missing values in different ways. Unfortunately, a numerical column may contain different types of missing values at the same time. Missing numerical values and missing strings are handled completely differently.

Even if your data contain only one type of numerical or string missing value, you may be in for a big surprise. In the following example, the data sets ONE and TWO are identical except for the names of the columns in TWO. Each second and third row have missing values in the key; the following example highlights its effect.

 

Data ONE Data TWO
data ONE ;
   input ID A B C ;
datalines;
01 1   2  3
.  1   2  3
.  99 99 99
05 1   2  3
;
run ;
data TWO ;
   input ID D E F;
datalines;
01 1   2  3
.  1   2  3
.  99 99 99
05 1   2  3
;
run ;

 

Example: Join

proc sql ; 
create table FJ as select
a.ID, A,B,C,D,E,F
from ONE a full join TWO b
on a.ID=b.ID
order by ID ;
quit ;

 

Output

ID     A     B     C     D     E     F
 .     1     2     3    99    99    99
 .    99    99    99     1     2     3
 .    99    99    99    99    99    99
 .     1     2     3     1     2     3
 1     1     2     3     1     2     3
 5     1     2     3     1     2     3

The missing values in the key of ONE and TWO generate a Cartesian product. Each missing value of ONE is combined with each missing value of TWO (the same effect applies to non-missing values as well). With 2 times 2 missing values, two missing values are enough to multiply the associated data rows. Please note also the limited effect of ORDER BY in the presence of missing values. Beware if you may have more than one missing value in keys and are about to join one or more tables.

Have you had unexpected results from missing values in your data when using PROC SQL? You can learn more about effects caused by missing values in Chapter 2 of my book Advanced SQL with SAS.

Unexpected results from missing values with PROC SQL was published on SAS Users.

10月 282022
 

Payment Fraud continues to be a challenge for banks. With the increasing number of digital payment types and the ever-growing volumes of Real-Time Payments, real-time fraud detection and prevention are vital.

At the same time, customers are demanding a frictionless customer experience, so fraud detection methods need to be sophisticated enough to maintain the required balance.

While rules-based detection can be a good start, it can be difficult to keep up with fraud shifts and maintain efficacy. Rules tend to be reactively built based on individual fraud cases rather than considering full customer data and behavior.

Machine learning combined with behavior profiling alongside rules are best deployed together as part of a layered fraud prevention approach. Machine learning models use advanced methodologies and statistical techniques to identify risky payments. They are highly predictive not only in identifying fraud but also in being able to identify those genuine customer transactions to ensure that the high False positive ratios necessary in today’s environment are achievable.

Supervised vs Unsupervised learning

Supervised learning is when the model is trained on labeled data. In the case of a fraud model, this means having accurately tagged fraudulent transactions within the training data set. The model learns from this data to help predict outcomes in the future. Supervised methods include methodologies like linear regression, logistic regression, decision trees and random forest.

In unsupervised learning, the model is not trained on labeled data and instead works to gain its own insights from the data. Unsupervised learning uses clustering and association techniques.

Supervised models generally have better performance and are more predictable than unsupervised ones. However, supervised Machine Learning techniques require an upfront investment to ensure that the data is tagged correctly to achieve the best results. This may not be possible for organizations whose fraud reporting is not well-controlled or in the case of RTP they have few historical fraud events. In these cases, semi-supervised models may be ideal where supervised learning methods are first used to derive a preliminary set of features or models followed by an application of unsupervised methods to refine the results.

SAS models

SAS provides industry leading predictive models using a range of machine learning. These are generally built with the bank’s own data to give them the optimal fraud detection, unique to their customers and unique to the fraud patterns which they have.

To support behavior profiling, each SAS model is built including SAS’s patented Signature technology, which is a method of storing an entity’s historical transactional information and allows the model to determine someone’s typical habits. SAS models are able to support multi-entity Signatures; these might include customer, account, beneficiary or device, for instance. When scoring a transaction, the model accounts for not only the current transaction, but also the historical behavioral activity of all relevant entities captured by the Signatures. It is widely recognized that historical behavior is predictive of fraud. Typically, there are regular patterns of usage or spending; therefore, deviations from these established patterns may indicate suspicious activity and fraud.

SAS uses the information in these Signatures to derive hundreds of statistical variables that target unusual activity and collectively serve as inputs for the fraud detection model.

Examples of behavioral variables that can be derived from Signatures include, but are not limited to, the following: Average number of transactions in a given period, Typical spending amounts, Typical bill pay usage, Geographical variance and Spend velocity.

Geographic location related variables are very useful information for fraud detection, however customer locations on file aren’t always reliable. Sometimes that information is unavailable, so trying to use this fixed home location to calculate distance from home may not be appropriate. So, SAS also developed a proprietary dynamic home which allows the model to infer the current location of the customer. It uses the locations of clustered transactions to help reduce the false positives where the customer's home location is not actually representative of their true home or the area in which they usually transact.

In addition to behavioral variables, the fraud model relies on risk variables that capture relative fraud risk of various aspects of transactions based on historical data. These risk variables are complementary to behavioral characteristics described above and play a particularly important role in scoring customers and users without regular patterns of spending activity.

SAS begins with thousands of possible model features derived from various input fields. During the modelling process, the number of variables is reduced through a combination of techniques including applying transformations, Kolmogorov–Smirnov tests, correlation with target, Linear / logistic regression, linear interdependency and missing value imputation.

For each of the steps above, SAS follows a threshold-based approach. For example, SAS picks a threshold for the correlation between target and variable. All variables that fall below this threshold are eliminated from modelling. The exact value of the threshold is based on the SAS modelling teams' previous experience, so that the number of candidate variables after filtering remains sufficient to proceed with the subsequent steps.

Finally, SAS also performs several checks to filter out unstable variables.

This reduction of variables invariably results in the selection of the most significant variables that contribute the most to model performance. Among the final set of variables, different variables are important as different scenarios are encountered.

SAS models can also benefit from behavior segmentation. An example of behavior segmentation is to separate business and consumer customers early in the modelling process and essentially build separate models with the variables that are most relevant to each segment.

Different modelling techniques such as gradient boosting and neural networks are considered to determine the best technique. In fact, a combination of techniques are often used to achieve the best performance through an ensemble model.

Model Processing graphic

SAS models output a score from 0 to 999 to indicate the likelihood of the transaction being fraudulent. SAS has developed proprietary, reason code-generation technology which allows the model output to also include a list of risk reason codes. These reason codes are designed to give end users insights of the relevant risks associated with the model output score. Unlike traditional techniques that use individual variables as reasons, the SAS methodology first groups variables that are correlated and have a similar concept into different risk factors. Each of these risk factors are represented by a reason code. The model produces three such risk reason codes that indicate the top three highest priority reasons for why a transaction is likely to be in a fraudulent state (if it is indeed at risk at the time of the transaction).

Case Study 1:

Client A required a model to score real-time payments transactions.

The modelling period was based on 18 months of historical data which included non-monetary transactions such as logins and detail changes, payments and deposit transactions and fraud data. This date range ensures coverage of seasonality and major recurring socio-economic events such as tax payments. As fraud reporting is usually delayed, additional months of fraud data were also included to ensure that all the fraud that occurred during the 18-month period was included.

Various data issues were identified by SAS, the modelling impact was determined and the issues were rectified in the data at source where possible or documented for consideration in modelling.

The main categories of fraud seen in the data were Account Takeover cases where the customer’s credentials were stolen, cases where the customer was involved, cases where the customer was implicit in giving the fraudster their money or credentials such as with an investment scam, romance scam, phone phishing, remote access fraud or an email hack.

The Account Takeover cases made up the majority of the fraud transactions, but the customer involved case contributed to most of the fraud money lost.

Signature entities at the Online user ID and customer level were used to build up a behavioral history across the various channels and transaction types to identify unusual behavior. Variables were developed which looked at the login events and financial transaction patterns. Additionally, SAS’s proprietary dynamic home idea was used to infer the typical locale of the customer to calculate a more accurate distance from home variable.

SAS tried various modelling methods and found that in this case, the gradient boosting method proved to be the most effective at detecting the highly complex and nonlinear patterns in this fraud detection problem.

The model evaluation indicated around a 17% rise in transaction detection rate and 11% rise in Value Detection rate over the existing payments model.

Case Study 2:

Client B required a model to score payment transactions made from their mobile, personal and business internet banking channels.

The modelling period was based on 17 months of historical data which include payments and fraud data. As with Client A, the data range ensured seasonality was considered and that the fraud during the period was captured.

A customer account Signature and a separate beneficiary account Signature were developed in this case, allowing the model to consider the holistic picture of the sender account and the receiving account independently. Variables were then able to be developed based around the patterns of these entities considering things such as transaction amount, time of day, sender and receiver relationships and maturity. Also, although only a portion of the IP address was provided in the data to maintain PII standards, SAS was able to create geolocation centroids from this which allowed SAS’s proprietary dynamic home idea to be used to inform the model as well.

A strong fraud pattern was detected that could not be identified solely on Signature information where accounts that only received money for a long time suddenly decided to send money. In order to catch such patterns in real-time via the model, SAS designed a new batch job feature to inform the model.

Behavior segmentation was implemented which split the transactions into two segments based on transaction types. The best model performance was achieved by using a gradient boosting trees model for one segment and a neural network model for the other.

During the evaluation period, the model performed very well with a Transaction Detection rate of 79.1% and a Value Detection Rate of 75% at a 11.4:1 False Positive Rate.

Conclusion

Between the dynamic payments landscape and the ever-adapting fraudsters who continue to find new ways to exploit technology and customers, identifying fraud is only becoming more complex. Standard fraud prevention approaches are no longer as effective as they once were. SAS’ advanced machine learning models, which can analyze behaviors and detect suspicious patterns in real-time, can go a long way in helping organizations to better prevent fraud and protect customers.

Learn more

Machine learning models for payment fraud was published on SAS Users.

10月 212022
 

Introduction

The ODS Excel destination is popular because it generates presentational ready output directly from your SAS® software. This destination provides options to the Microsoft Excel worksheet and workbook that automate setting up custom features that you would otherwise manually add. This destination also includes other features such as the ability to add graphics generated by both SAS/GRAPH® and ODS Graphics, the ability to add formulas, and the ability to add SAS formats that are directly converted to the appropriate Excel formatting. This post demonstrates various formatting issues that can occur when using ODS Excel. It also describes the following:

  • handling those issues using formatting
  • formatting values after you export to Excel

Getting the desired formatting for your values

The ODS Excel destination and the Microsoft Excel application use the Excel General format for all numbers that do not have an Excel format applied. In ODS Excel, you can apply Excel formats using either a SAS format, which internally converts it to the comparable Excel format, or by applying the Excel format directly using the TAGATTR= style attribute. When you do not apply an Excel format, the Excel General format is used, which makes a best guess about how the output should be formatted. Here are some issues to consider when the Excel General format is used along with other Excel or ODS Excel defaults in ODS Excel:

  • Leading zeros are omitted when displaying values in the spreadsheet
  • Numbers greater than 11 digits might be displayed in scientific notation
  • Numbers greater than 15 digits are padded with zeros
  • Characters variables that contain commas or other punctuation get stripped

Retaining the leading zero

A number with a leading zero does not get preserved when it is exported to a worksheet using ODS Excel. This behavior occurs because the defaults in the Excel General format do not display numbers with a leading zero and is true whether the value is defined as character or numeric in SAS. The absence of the leading zero matters if this is an actual account number. Output 1 demonstrates the absence of the leading zero for both acct_num and acct_char in record 1.

Output 1. Exported output from ODS Excel does not include the leading zeros

To display the leading zero for variables that are defined as numbers in SAS using SAS formatting, apply the Z. SAS format, which is demonstrated in Example 1 below for the numeric variable acct_num. Character variables that are all numbers also use the Excel General format by default and require that you apply the character format $ to preserve the leading zero. You can also add Excel formats using the TAGATTR= attribute, which is used to apply the 0 format for the numeric field. This attribute uses the TYPE:STRING parameter in the TAGATTR= style attribute, which modifies the data type to a string that retains the leading zero in Excel. Output 2 shows the formatting that results from running the code in Example 1.

Example 1

data example_1;
input acct_num acct_char $;
cards;
0912556 0912556 
2330202 2330202 
;
run;
 
 
/* Using SAS formatting to preserve the leading zero. */
 
ods excel;
proc print data=example_1;
format acct_num z7.;
format acct_char $7.;
run;
ods excel close;
 
/* Using the TAGATTR= style attribute to preserve the leading zero. */
 
ods excel;
proc print data=example_1;
var acct_num /   style(data)={tagattr="format:0000000"};
var acct_char /  style(data)={tagattr="type:String"};
run;
ods excel close;

Output 2. The leading zero is retained

Preventing formatting with scientific notation

Exporting numbers using ODS Excel with values greater than 11 digits might display cells using scientific notation in the worksheet. This issue occurs regardless of how this number was defined in SAS unless an Excel format has been applied. Output 3 illustrates the issue.

Output 3. Outputs long numbers using scientific notation

To correct the formatting, add a SAS format such as the BEST. format for the numeric variable acct_num and the character format $ for the character variable acct_char. You can also apply a custom Excel format directly using a custom number format for the numeric variable and the TYPE:STRING parameter within the TAGATTR= attribute, which forces the type to be a string in Excel. Output 4 shows the formatting that results from running the code in Example 2.

Example 2.

data sample_2;
 input acct_num acct_char $12.;
 cards;
 123456789012 123456789012 
 456789012345 456789012345 
 ;
 run;
 
 
/* Using SAS formatting to display the full value. */
 
 ods excel file="c:\output\temp.xlsx";
 
 proc print data=sample_2;
 format acct_num best12.;
 format acct_char $12.;
 run;
 
 ods excel close;
 
/* Using the TAGATTR= style attribute to display the full value. */
 
 ods excel file="c:\output\temp.xlsx";
 
 proc print data=sample_2;
 var acct_num  / style(data)={tagattr="format:#####"};
 var acct_char / style(data)={tagattr="type:String"}; 
 run;
 
 ods excel close;

Output 4. Displays the full value

Excel pads values with zeros

Excel can store a number only up to 15 significant digits. After the 15th digit, Excel begins to pad the value with zeros, as shown in Output 5.

Output 5. Values are incompletely displayed

If you have a meaningful number that is greater than 15 digits, you must make sure that all digits are displayed. Since Excel cannot display more than 15 digits, this number must be treated as a string, which can be accomplished by using the TYPE:STRING parameter within the TAGATTR= attribute. However, you first must use the BEST. format to obtain all of the digits before storing this value as a string. Output 6 illustrates the formatting that you get after running the code from Example 3.

Example 3.

data sample_3;
 input acct_num acct_char $16.;
 cards;
 1234567890123456 1234567890123456 
 4567890123456789 4567890123456789 
 ;
 run;
 
 ods excel file="c:\output\temp.xlsx";
 
 proc print data=sample_3;
 var acct_num / style(data)={tagattr="type:String"};
 var acct_char; 
 format acct_num best16.;
 format acct_char $16.;
 
 run;
 
 ods excel close;

Output 6. The complete value is displayed

Character variables are stripped of punctuation

If a number contains punctuation such as a comma, it is stripped when it is exported to Excel using ODS Excel, as shown in Output 7.

Output 7. Output from the spreadsheet

If the expected behavior is to format the number with the punctuation in Excel, you can format it using both SAS formatting and Excel formatting. Example 4 uses the COMMA SAS format to format the value with commas. This example uses the character format $ to maintain the embedded punctuation within the character variable. Excel formatting is also applied to add the comma to both variables using the TAGATTR= style override. Output 8 shows the formatting that results from running the code in Example 4.

Example 4.

data example_4;
input acct_num comma5. +1 acct_char $5.;
cards;
3,000 4,000 
1,000 6,000 
;
run;
 
/* Using SAS Formatting to add or retain punctuation. */
 
ods excel;
 
proc print data=example_4;
format acct_num comma5.; 
format acct_char $5.;
var acct_num ;
var acct_char;
run;
 
ods excel close; 
/* Using the TAGATTR= style attribute to display the punctuation. */
ods excel;
proc print data=example_4;
var acct_num / style(data)={tagattr="format:#,###"};
var acct_char / style(data)={tagattr="format:#,###"};
run;
ods excel close;

Output 8. Punctuation is added or retained

Conclusion

This blog highlighted some of the potential formatting issues that you could encounter when exporting to Excel using the ODS Excel destination. ODS Excel does an amazing job at limiting some of these issues that exist with Excel in general. Hopefully the tips provided here will help you if you encounter any of these issues.

LEARN MORE | SAS® Output Delivery System (ODS)

Formatting your Microsoft Excel output using ODS Excel was published on SAS Users.

10月 182022
 

Chameleons are examples of adaptability to environments for they change their colors to better blend into their surroundingThe SAS applications development process typically includes the following three phases of the software development life cycle (SDLC): Development (Dev), Testing (Test) and Production (Prod).

In order to protect the Production (or Operation) environment from unforeseen disruptions, these stages are usually implemented in their own separate environments. Sometimes, Development and Testing environments are combined into a single Development environment, which just treats Testing as an integral part of the Development stage.

(For brevity of the further narrative, we will use the two-stage process/environment paradigm, which can be easily expanded as needed.)

During the software development life cycle, when our programming modules are designed and built (developed), tested, validated and have proved to reliably produce the desired results, we promote (transition) them from Dev to the Prod environment.

The software promotions are often managed by change control procedures adopted by your organization.

Software promotion limitations

However, no matter how strict the change control procedures are or how scrupulously we conduct testing and promotion from Dev to Prod, we must recognize the limitations of this paradigm. The alternative would be a fallacy of dreams in “wishful thinking” territory.

Let’s look into two highly desirable, but rarely attainable, “dreams”.

Dream 1

If a software is validated and reliably works in Dev, it is guaranteed that it will work just fine in Prod and therefore no testing in Prod is required.

Reality 1

Although a thorough testing in Dev will keep failures in Prod to a minimum, it cannot completely eliminate them or guarantee your code’s success when running in Prod. Even if a promoted module does not require even a single change and you just copy it from Dev to Prod, there are still many possibilities that it will not work properly in Prod. Consider potential environmental differences such as data sources (structures and access), file system permissions and SAS metadata groups and permissions.

You will need to account for these environmental differences and still conduct some testing in Prod to make sure that the environment itself does not break your code/software.

Dream 2

When software is developed, tested and validated in Dev, promoting it to Prod is just a matter of copying it from Dev to Prod without any changes/adjustments.

Reality 2

However, in actuality, again environmental differences may require adjusting your code for the new environment. For example, data sources and data targets may have different library names or library references, or different directories or files names. If your application automatically sends out an email to a distribution list of your application users, you most likely will have different distribution lists in Dev vs. Prod.

Of course, you may capture all these differences into a configuration file, which will keep all the differences between Dev and Prod environments out of your code. However, in this case, your application becomes data-driven (with the configuration file being a data driver) and, as such, the configuration file itself becomes part of your software. Obviously, these configuration files simply cannot be copied from Dev to Prod during the promotion since they are inherently different.

Therefore, you cannot say that for promotion from Dev to Prod you can just copy your software. At least that is not true for that configuration file.

This reality makes the promotion process quite sensitive and susceptible to errors. Besides, we must maintain two versions of the application (Dev and Prod) and make efforts not to mix them up and not overwrite each other.

Adaptive programming

Adaptability is the ability to adjust to different environments, much like a chameleon changes its color to better blend into its surroundings.

If we can develop a program that automatically adjusts itself to run in either the Development or Production environment, it would mean that we have found a solution making Dream 2 a reality.

You can copy such a program between the environments back and forth and you don’t need to change a thing in it. Because it is identical in both, Dev and Prod.

You can even store a single version of this adaptive program in a centralized shared location and access it to run by either the Development or Production server. This holds true even though these environments may use different data sources, different data targets, different email distribution lists…

The key for such an implementation is the program’s self-awareness when it knows which environment it is running in, and based on that knowledge adapts to the environment by adjusting its configuration variables.

But these are just words of make-believe promises. Let’s get to the proof.

Identifying the environment your SAS program runs in

Suppose you have SAS software installed in the following two environments:

  • Development, where the SAS application server runs on computer DEVSRV.YOURDOMAIN.COM
  • Production, where the SAS application server runs on computer PRODSRV.YOURDOMAIN.COM

If you have SAS® Enterprise BI Server or SAS® BI Server installed you may have your metadata servers either on the same device/computer/machine/server as the application servers or on separate machines. Let’s say these are separate machines: DEVMETASRV.YOURDOMAIN.COM and PRODMETASRV.YOURDOMAIN.COM.

When we run a program in a particular environment (server), we usually know in which environment we run this program. However, in order to build an adaptive program we need to make sure the program itself knows which environment it is running in.

The key here is the SYSHOSTNAME automatic macro variable that the SAS System conveniently makes available within any SAS session. It contains the host name of the computer that is running the SAS program. In our adaptive program, &SYSHOSTNAME macro variable reference will be equal to either DEVSRV or PRODSRV depending on which environment our program runs in.

Making a SAS program adapt to the Development or Production environment

Now, when our program knows its running environment, we can make it adjust to that environment.

Let’s consider a simplified scenario where our program needs to do the following different things depending on the running environment.

Adaptability use case

1. Read a source data table SALES from a metadata-bound library:

  • In DEV: library named ‘SQL CORP DEV’;
  • In PROD: library named ‘SQL CORP’.

2. Produce target Excel file, report.xlsx and save it a specified directory on a non-SAS machine:

  • In DEV: \\DEVOUTSRV\outpath
  • In PROD: \\PRODOUTSRV\outpath

3. Send out an email to a distribution list (contact group) of users when the report is ready:

Adaptive code implementation

The following SAS code illustrates this adaptive technique:

/* ----- Configuration section ----- */
 
%if "&syshostname"="DEVSRV" %then
%do; /* Development server */
   %let metasrv = DEVMETASRV;
   %let lname = SQL CORP DEV;
   %let outsrv = DEVOUTSRV;
   %let tolist = "developers@yourdomain.com";
   options symbolgen mprint mlogic fullstimer;
%end;
%else
%if "&syshostname"="PRODSRV" %then
%do; /* Production server */
   %let metasrv = PRODMETASRV;
   %let lname = SQL CORP;
   %let outsrv = PRODOUTSRV;
   %let tolist = "businessusers@yourdomain.com" "developers@yourdomain.com";
   options nosymbolgen nomprint nomlogic nofullstimer;
 
   /* adjust email distribution list based on test_trigger_YYYYMMDD.txt file existence */
   %let yyyymmdd = %sysfunc(date(),yymmddn8.);
   %if %sysfunc(fileexist(&projpath\test_trigger_&yyyymmdd..txt) %then
      %let tolist = "testers@yourdomain.com";
%end;
%else
%do; /* Unauthorized server */
   %put ERROR: This program is not designed to run on &syshostname server;
   %put ERROR: SAS session has been terminated.;
   endsas;
%end;
 
options 
   metaserver="&metasrv" metaport=8561 metarepository=Foundation metaprotocol=bridge
   metauser='service-accout-ID' metapass="encrypted-sas-password";
 
libname SRCLIB meta "&lname";
 
%let outpath = \\&outsrv\outpath;
%let outname = Report.xlsx;
 
/* ----- End of Configuration section ----- */
 
/* Produce Excel report */
ods excel file="&outpath\&outname";
proc print data=SRCLIB.SOMETABLE;
   where Product='Loan';
run;
ods excel close;
 
/* Send out email */
filename fm email to=(&tolist) from='sender@mydomain.com' subject='Your subject';
 
data _null_;
   file fm;
   put '*** THIS IS AUTOMATICALLY GENERATED EMAIL ***' //
       "Loan report job completed successfully on &syshostname server." /
       "The following Excel file has been generated: &outpath\&outname" //
       'Sincerely,' / 'Your automation team';
run;

Code highlights

As you can see, in the configuration section we conditionally set various macro variables (including &tolist distribution list) and global options depending on where this code runs, in Dev or Prod, determined by the &SYSHOSTNAME macro variable. For unauthorized &SYSHOSTNAME values, we write a relevant message to the SAS log and terminate the SAS session.

Then we establish connection to the metadata server, assign source data library SRCLIB as well as output directory (&outpath) and output file name (&outname).

Based on these conditionally defined macro variables and libref, in the remaining sections, we:

You can wrap this configuration section into a macro and store it as a separate file; then the main program would just invoke this macro. In either case, the code achieves total environmental awareness and naturally, logically adapts to the environment in which we designed it to function.

This adaptive coding approach ensures 100% equality between Dev and Prod code versions. In fact, the two versions of the code are identical. They are self-contained, self-reliant, and do not require any adjustments during promotion throughout the development life cycle. You can copy them from one environment to another (and vice versa) without fear of accidentally wiping out and replacing the right file with the wrong one.

Can we say that this implementation makes Dream 2 come true?

Testing in the Production environment

Notice a little section dubbed /* adjust email distribution list based on test_trigger_YYYYMMDD.txt file existence */ within the Production Server logic. It allows for triggering a test run, in this case by limiting the scope of email distribution in Prod. Just plant / create a file named test_trigger_YYYYMMDD.txt in a specified location &projpath and run your code on a desired test date YYYYMMDD.  You can delete this file afterwards if a full-scale run is scheduled for the same date or otherwise keep it for future reference (it becomes “harmless” and irrelevant for any other dates). You can use the same trigger file tactic to modify other parts of your code as needed in both Prod and Dev.

Even though this feature does not make Dream 1 come true, it does alleviate its Reality.

Questions? Thoughts? Comments?

Do you find this blog post useful? Do you have questions, concerns, suggestions, or comments? Please share with us below in the Comments section.

Additional Resources

Adaptive SAS programming for the Software Development Life Cycle was published on SAS Users.

10月 152022
 

In my previous blog Programmatically export a Visual Analytics report to PDF - SAS Users, I use the SAS Visual Analytics SDK to export a report to PDF, which is quite simple if we have basic knowledge with JavaScript programming. It works for both the latest version of SAS Viya and version 3.5. The new version of SAS Viya offers improvements and we have the option to export VA report to PDF -- using REST API, without need of JavaScript programming. This is what I’ll discuss in this post.

The API under Visual Analytics category in latest SAS Viya, provides the ability to export a report, or a report object, to a PDF file. It also provides the ability to create and run a job to do the exporting. Actually, we can export a report PDF, image, package, and data using the APIs. All are quite straight forward. In this article, I will show how to export a report or report object to PDF file directly, and how to create and run a job to export to a PDF.

Get all the API links of Visual Analytics

The API under Visual Analytics provides the ability to retrieve all the API links via the http ‘GET’ method. Be sure to set the "Accept" = "application/vnd.sas.api+json" in the HEADERS of PROC http. Below is my sample code snippet, I define a json library so we can view the output of PROC http visually.

%let BASE_URI=%sysfunc(getoption(SERVICESBASEURL));
FILENAME vaJason TEMP ENCODING='UTF-8';
FILENAME hdrout TEMP ENCODING='UTF-8';
PROC HTTP METHOD="GET" oauth_bearer=sas_services out=vaJason headerout=hdrout
    URL = "&BASE_URI/visualAnalytics/";
    HEADERS "Accept" = "application/vnd.sas.api+json";
RUN;
LIBNAME vaJason json;

If we see the message of ‘200 OK’ returned (something like below), we know the PROC runs successfully.

Now in SAS Studio, if I go to the ‘Libraries’ tab, double click the LINKS table in the VAJASON library, all the API links of Visual Analytics are listed in the ‘href’ columns as shown below. We see the support of exporting the report PDF, image, package, and data with corresponding method and href.

Export a report or report object to PDF

Now, let me export a report to PDF directly. Below is the code snippet I used. With the FILENAME statement, the exported PDF will be saved in a physical location (I save it as rpt.pdf file in the /tmp/ folder). Be sure to set the "Accept" = "application/pdf" in the HEADERS of PROC http. In my example, I export a report with two report objects: a bar chart and a forecasting object.

%let BASE_URI=%sysfunc(getoption(SERVICESBASEURL));
FILENAME rptFile "/tmp/rpt.pdf"; 
PROC HTTP METHOD="GET" oauth_bearer=sas_services OUT=rptFile headerout=hdrout 
    URL = "&BASE_URI/visualAnalytics/reports/d940126c-f917-4a13-8e1a-51b6729f50ec/pdf";
    HEADERS "Accept" = "application/pdf"
            "Accept-Language" = "*"
            "Accept-Locale" = "en-US";
RUN;

Run the code, and if we see the message of ‘200 OK’ returned, we know the export succeeded. We can go to the /tmp/ folder and check the rpt.pdf file there.

Next, let me export a report object to PDF. If you are not familiar with objects composition of a VA report, refer to my earlier post Discover Visual Analytics Report Paths with REST APIs. Different from exporting a report, I need to set the parameter ‘reportObjects’ for the exported object. With the ‘GET’ method in PROC http, I use the QUERY option to set all the parameters I want to use for the object. For example, I set some cover page text. Below is the code snippet for report object exporting.

%let BASE_URI=%sysfunc(getoption(SERVICESBASEURL));
FILENAME rptFile "/tmp/rpt.pdf"; 
PROC HTTP METHOD="GET" oauth_bearer=sas_services OUT=rptFile headerout=hdrout 
    URL = "&BASE_URI/visualAnalytics/reports/d940126c-f917-4a13-8e1a-51b6729f50ec/pdf"  
    QUERY = ("reportObjects"="ve58" "includeCoverPage"=true "coverPageText"="This is cover page for a report object.");
    HEADERS "Accept" = "application/pdf"
            "Accept-Language" = "*"
            "Accept-Locale" = "en-US";
RUN;

Similarly, if we see the message of ‘200 OK’ returned, we know the export runs successfully. The following screenshots show the exported report PDF and the exported report object PDF, respectively.

Create and run a job to export a PDF

Besides exporting a report or report object directly, the API under Visual Analytics provides the ability to asynchronously execute the export job. The differences between directly export and job export are:

  • The ‘POST’ method is used for the job export action.
  • To export a report or report object by running a job, we need to apply the rendering option values in the request object, as well as options for the creation of the PDF.
  • Job export will save the export pdf file to the SAS Viya Content server folder, not a physical disk location. The PDF file can be then downloaded to local disk from SAS Studio or SAS Drive.

Below is the code snippet of creating a job to export report pdf. Be sure to set the "Accept" = "application/vnd.sas.visual.analytics.report.export.pdf.job+json", and "Content-Type" = "application/vnd.sas.visual.analytics.report.export.pdf.request+json" in the HEADERS of PROC http.

%let BASE_URI=%sysfunc(getoption(SERVICESBASEURL));
FILENAME hdrout TEMP ENCODING='UTF-8';
 
PROC HTTP METHOD="POST" oauth_bearer=sas_services headerout=hdrout 
    URL = "&BASE_URI/visualAnalytics/reports/d940126c-f917-4a13-8e1a-51b6729f50ec/exportPdf"  
     IN = '{
            "resultFolder": "/folders/folders/9d78f045-e7d9-4e82-b4aa-c7220cb85558",
            "resultFilename": "Exported PDF File.pdf",
            "nameConflict": "replace",
            "wait": 30,
            "timeout": 60,
            "options": {
                "orientation": "landscape",
                "paperSize": "A4",
                "showPageNumbers": true,
                "includeCoverPage": true,
                "coverPageText": "This is cover page for export pdf job."
                },
            "version": 1
            }'
      ;
    HEADERS "Accept" = "application/vnd.sas.visual.analytics.report.export.pdf.job+json"
            "Content-Type" = "application/vnd.sas.visual.analytics.report.export.pdf.request+json" 
            "Accept-Language" = "*"
            "Accept-Locale" = "en-US";
RUN;

If we see the message of ‘201 Created’ returned as shown below, we know the export job runs successfully.

Below screenshot shows the exported report PDF.

Finally

In summary, for the latest version of SAS Viya, the REST API under Visual Analytics category provides an easy way to export a report or a report object to a PDF file, either directly, or by a job asynchronously. We can also easily export the report object to image, the report data to CSV, TSV, XLSX, and the report resources to a package. You are encouraged to find more at Visualization – Visualization API Reference (sas.com).

Export a Visual Analytics report using REST APIs was published on SAS Users.

10月 142022
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how to summarize columns. Now it's time to focus on how to rename columns in CAS tables.

Load and explore data

In SAS Viya we can use the Samples caslib for demonstration data. I'll load the data into the distributed CAS server using my CAS connection object conn, followed by the loadTable action. The loadTable action loads a server-side file into memory. Then I'll reference the CAS table in the tbl object and view the column information using the columnInfo CAS action. Remember, the data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data.

conn.loadTable(path = 'WATER_CLUSTER.sashdat', caslib = 'samples',
                            casOut = dict(caslib = 'casuser'))
 
tbl = conn.CASTable('water_cluster', caslib='casuser')
 
tbl.columnInfo()

Rename CAS table columns

To rename a column in a CAS table you will need to use the alterTable CAS action. The alterTable action can rename columns, but it can also rename tables, add labels to tables, drop or keep columns, and modify column attributes like column labels. This post will only focus on how use the rename feature of the action.

As of the SWAT package version 1.11.0 there is no Pandas equivalent to rename the column. Techniques like the rename method or by explicitly assigning a list of new column names to the CASTable object are not available.

Rename a single column

For example, what if you want to rename the Year column to Year_Value? To rename columns with the alterTable action you specify the columns parameter with a list of dictionaries. Each dictionary specifies the column to modify. In this example specify the columns parameter and add a list with a single dictionary. The dictionary requires the name key to specify the column name and the rename key specifies the new column name. Lastly, execute the columnInfo action to view the updated columns names.

tbl.alterTable(columns = [
                {'name':'Year', 'rename':'Year_Value'}
              ])
tbl.columnInfo()

 

Notice in the results the Year column has been renamed to Year_Value.

Rename multiple columns

To rename multiple columns you simply add additional dictionaries to the list. For example, if you want to rename the Month column to Month_Value, and the Day column to Day_Value, add two dictionaries to the list. Here I'll create the dictionary in the variable renameColumns then use that in the alterTable action.

## Columns to rename
renameColumns = [
    {'name': 'Month', 'rename':'Month_Value'},
    {'name': 'Day', 'rename':'Day_Value'}
]
 
## Rename the CAS table columns
tbl.alterTable(columns = renameColumns)
 
## View the column information of the CAS table
tbl.columnInfo()

In the results notice that the Month column has been renamed to Month_Value, and the Day column has been renamed to Day_Value.

Dynamically rename columns

What if you want to dynamically rename columns? For example, what if you want to remove all underscores in the column names and lowercase all characters? You could manually create a dictionary for each column to rename, but that's a lot of manual work. Instead, you can use a Python list comprehension to create a list of dictionaries with the columns to rename. I'll store the results of the list comprehension in the variable newColumnNames.

newColumnNames = [{'name':colName, 'rename':colName.replace("_","").lower()} for colName in tbl.columns]
display(newColumnNames)
# and the results
[{'name': 'Year_Value', 'rename': 'yearvalue'},
 {'name': 'Month', 'rename': 'month'},
 {'name': 'Day', 'rename': 'day'},
 {'name': 'Date', 'rename': 'date'},
 {'name': 'Serial', 'rename': 'serial'},
 {'name': 'Property', 'rename': 'property'},
 {'name': 'Address', 'rename': 'address'},
 {'name': 'City', 'rename': 'city'},
 {'name': 'Zip', 'rename': 'zip'},
 {'name': 'Lat', 'rename': 'lat'},
 {'name': 'Long', 'rename': 'long'},
 {'name': 'Property_type', 'rename': 'propertytype'},
 {'name': 'Meter_Location', 'rename': 'meterlocation'},
 {'name': 'Clli', 'rename': 'clli'},
 {'name': 'DMA', 'rename': 'dma'},
 {'name': 'Weekday', 'rename': 'weekday'},
 {'name': 'Weekend', 'rename': 'weekend'},
 {'name': 'Daily_W_C_M3', 'rename': 'dailywcm3'},
 {'name': 'Week', 'rename': 'week'},
 {'name': 'US Holiday', 'rename': 'us holiday'},
 {'name': 'CLUSTER', 'rename': 'cluster'}]

Notice it creates a list of dictionaries, one for each column to rename. The list comprehension removed all underscores if they exist in a column name, then lowercases all the characters in the column.

Now that the list of dictionaries is created, add it to the alterTable action's columns parameter.

"tbl.alterTable(columns = newColumnNames)
tbl.columnInfo()

Summary

The SWAT package blends the world of Pandas and SAS, making available many familiar Pandas techniques to work with your distributed CAS tables. However, in this example, familiar Pandas techniques to rename columns is currently not available in the SWAT package. Instead, you can use the alterTable action to easily rename columns.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 11 - Rename Columns was published on SAS Users.

10月 062022
 

Group and aggregate CAS tables

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how to summarize columns. Now it's time to focus on how to group and aggregate CAS tables.

Load and explore data

First, let's load some data. In SAS Viya we can use the Samples caslib for demonstration data. I'll load the data into the distributed CAS server using my CAS connection object conn, followed by the loadTable action. The loadTable action loads a server-side file into memory. Then I'll reference the CAS table in the tbl object and preview 5 rows using the SWAT head method. Remember, the data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data.

conn.loadTable(path = 'WATER_CLUSTER.sashdat', caslib = 'samples',
                            casOut = dict(caslib = 'casuser'))
 
tbl = conn.CASTable('water_cluster', caslib='casuser')
 
tbl.head()

 

The water_cluster CAS table contains daily water usage for a variety properties. The Daily_W_C_M3 column displays the water used per location in cubic meters (m3), the Serial column identifies each property, and the Weekend column identifies if the reading occurred on the weekend or not.

Using the SWAT groupby method

The SWAT package contains the groupby method. It is defined to match the pandas.DataFrame.groupby() method. For example, I'll specify the CAS table object tbl, then the groupby method and specify the Serial column, and then enclose it with the type function to view the object.

type(tbl.groupby('Serial')) 
# and the results
swat.cas.table.CASTableGroupBy

Notice the results return a CASTableGroupBy object. This works similarly to the pandas DataFrameGroupBy object. This enables you to add aggregation methods to aggregate by groups. For example, what if you want to see the total water consumption for each Serial value?

First, specify the CASTable object tbl, followed by the groupby method to aggregate by each Serial group. Then specify the column to aggregate, Daily_W_C_M3, and then add the sum aggregation method. Here the results are stored in the variable df_serial and displayed.

df_serial = (tbl                    ## CAS table reference
             .groupby('Serial')     ## Group the CAS table
             .Daily_W_C_M3          ## Specify the CAS table column to aggregate
             .sum()                 ## Specify the aggregation
)
 
display(df_serial)

partial results

The code executes in the distributed CAS server and returns a Series object to the client. Once you have a Series on the client you can work with it as your normally would in Pandas.

Let's answer another question, is water consumption higher on the weekend or weekday? Using the same technique from above, let's view the mean water consumption on weekends and weekdays. I'll begin by specifying the CASTable object followed by the groupby method to group the Weekend column. The Weekend column indicates if it's a weekend, 1, or a weekday, 0. Then I'll specify the CAS column Daily_W_C_M3 and the mean method.  Lastly, after the CAS server processes the data in parallel it returns a Series to the client. On the client-side Series object I'll chain the rename method to rename the values 0 and 1 to Weekday and Weekend respectively.

(tbl                      ## CAS table reference          
 .groupby('Weekend')      ## Group the CAS table
 .Daily_W_C_M3            ## Specify the CAS table column to aggregate
 .mean()                  ## Specify the aggregation
 .rename({0:'Weekday',    ## Rename the values in the Series object returned from the CAS server on the client
          1:'Weekend'})
)

The results show that water consumption seems to be higher on weekends.

Using the CASTable groupby parameter

You can also achieve the same results using the CASTable object with the groupby parameter. For example, what if I wanted to answer the same question as the previous example about water consumption on weekends vs weekdays?

First, let's display the parameters of the tbl object using the params attribute.

tbl.params
# and the results
{'name': 'water_cluster', 'caslib': 'casuser'}

The results show the CASTable object has the name and caslib parameters. The name and caslib parameters simply reference the CAS table.

Next, add the groupby parameter to the tbl CASTable object with the column or columns to group by. Here I'll specify the Weekend column.

tbl.groupby = 'Weekend'
# and the results
CASTable('water_cluster', caslib='casuser', groupby='Weekend')

Notice that the CASTable object now contains the groupby parameter with the value Weekend. I'll confirm that groupby parameter was permanently added to the CASTable object by checking the parameters of the object again.

tbl.params
# and the results
{'name': 'water_cluster', 'caslib': 'casuser', 'groupby': 'Weekend'}

The results show the groupby parameter was added to the CASTable object. Now I can simply use the CASTable object to aggregate by the group. Specify the CASTable object followed by the CAS table column to aggregate. Then add the mean aggregation method to aggregate the data by groups on the CAS server. Lastly, the rename method will rename the Series object returned by the CAS server on the client.

(tbl                      ## CAS table reference with the groupby parameter
 .Daily_W_C_M3            ## Specify the CAS table column to aggregate
 .mean()                  ## Specify the aggregation
 .rename({0:'Weekday',    ## Rename the values in the Series object returned from the CAS server on the client
          1:'Weekend'})
)


Notice the results are the same.

Summary

The SWAT package blends the world of Pandas and SAS, making available many familiar Pandas techniques to work with your distributed CAS tables. This enables you to utilize the massively parallel processing power of the CAS server in SAS Viya using Python. Here we learned about using the familiar Pandas groupby method in the SWAT package and the CAS table groupby parameter to aggregate CAS tables by groups.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 10 - Group and Aggregate CAS Tables was published on SAS Users.