The CAS physical data model, i.e.what features CAS offers for data storage, and how to use them to maximize performance in CAS (and consequently SAS Visual Analytics 8.1 too).
So, specifically let’s answer the question:
What CAS physical table storage features can we use to get better performance in CAS and SAS Visual Analytics/CAS?
CAS Physical Table Storage Features
The following data storage features affect how CAS tables are physically structured:
- Repeated Tables
- Extended Data Types (Varchar)
- User Defined Formats
Compression — the Storage Option that Degrades Performance
data public.MegaCorp (compress=yes); set baselib.MegaCorp; run;
Partitioning and Sorting
When performing analyses/processing, CAS first groups the data into the required BY-groups. Pre-partitioning on commonly-used BY-groups means CAS can skip this step, vastly improving performance.
Within partitions, tables can be sorted by non-partition-key variables. Pre-sorting by natural ordering variables (e.g. time) allows CAS to skip the ordering step in many cases just like partitioning allows CAS to skip the grouping step.
For a full use-case, consider a line graph that groups sales by region and plots by date. This graph object would benefit greatly from a CAS table that is pre-partitioned by region and pre-sorted by date.
Partitioning can also support join operations since both the CAS FedSQL Merge Join algorithm utilize BY-GROUP operations to support their processing.
Pre-partitioning tables in anticipation of joins will greatly improve join performance. A good use case is partitioning both a large transaction table and an equally large reference table (e.g. an enormous Customer table) by the common field, customerID. When a DATA Step MERGE or a FedSQL join is performed between the two tables on that field, the join/merge will take advantage of partitioning for the BY-GROUP operation resulting in something similar to a partition-wise join.
Like Compression, partitioning and sorting can be implemented via CAS actions as well as data set options. Using the data set options is demonstrated below:
data mycas.bigOrderTable (partition=(region division) orderby=(year quarter month)); set CASorBase.bigOrderTable; run;
By default, in distributed CAS Server deployments, CAS divides incoming tables into blocks and distributes those blocks among its DUPLICATE data set option or the Repeated Tables have two main use-cases in CAS:
1. Join Optimization
2. Small Table Operation Optimization
For join operations, the default data distribution scheme can result in significant network traffic as matching records from the two tables travel between worker nodes to meet. If one of the two tables was created with the DUPLICATE/REPEAT option, then every possible record from that table is available on every node to the other table. There is no need for any network traffic.
Small Table Operation Optimization
For small tables, even single table operations can perform better with repeated instead of divided distribution. LASR actually implemented the “High Volume Access to Smaller Tables” feature for the same reason. When a table is repeated, CAS runs any required operation on a single worker node against the full copy of the table that resides there, instead of distributing the work.
As stated, repeated tables can be implemented with the DUPLICATE data set option, it can also be implemented with the REPEAT option on the PROC CASUTIL LOAD statement. The CASUTIL method is shown below:
proc casutil ; load data=sashelp.prdsale outcaslib=”caspath” casout=”prdsale” replace REPEAT ; quit ;
Extended Data Types (VARCHAR)
With Viya 3.2 comes SAS’ first widespread implementation of variable length character fields. While Base SAS offers variable length character fields through compression, Viya 3.2 is the first major SAS release to include a save storage space, it also improves performance by reducing the size of the record being processed. CAS, like any other processing engine, will process narrower records more quickly than wide records.
User Defined Formats
User defined formats (UDFs) exist in CAS in much the same way they do in Base SAS. Their primary function, of course, is to provide display formatting for raw data values. Think about a format for direction. The raw data might be: “E”, “W”, “N”, “S” while the corresponding format values might be “East”, “West”, “North”, “South.”
So how might user defined formats improve performance in CAS? The same way they do in Base SAS, and the same way that VARCHAR does, by reducing the size of the record that CAS has to process. Imagine replacing multiple 200 byte description fields with 1 byte codes. If you had 10 such fields, the record length would decrease 1990 bytes ((10 X 200) – 10). This is an extreme example but it illustrates the point: User defined formats can reduce the amount of data that CAS has to process and, consequently, will lead to performance gains.