Stephen Foerster

6月 192018
 

CAS DATA StepCloud Analytic Services (CAS) is really exciting. It’s open. It’s multi-threaded. It’s distributed. And, best of all for SAS programmers, it’s SAS. It looks like SAS. It feels like SAS. In fact, you can even run DATA Step in CAS. But, how does DATA Step work in a multi-threaded, distributed context? What’s new? What’s different? If I’m a SAS programming wizard, am I automatically a CAS programming wizard?

While there are certain _n_ automatic variable as shown below:

DATA tableWithUniqueID;
SET tableWithOutUniqueID; 
 
        uniqueID = _n_;
 
run;

CAS DATA Step

Creating a unique ID in CAS DATA Step is a bit more complicated. Each thread maintains its own _n_. So, if we just use _n_, we’ll get duplicate IDs. Each thread will produce an uniqueID field value of 1, 2..and so on. …. When the thread output is combined, we’ll have a bunch of records with an uniqueID of 1 and a bunch with an uniqueID of 2…. This is not useful.

To produce a truly unique ID, you need to augment _n_ with something else. _threadID_ automatic variable can help us get our unique ID as shown below:

DATA tableWithUniqueID;
SET tableWithOutUniqueID;
 
        uniqueID = put(_threadid_,8.) || || '_' || Put(_n_,8.);
 
run;

While there are surely other ways of doing it, concatenating _threadID_ with _n_ ensures uniqueness because the _threadID_ uniquely identifies a single thread and _n_ uniquely identifies a single row output by that thread.

Aggregation with DATA Step

Now, let’s look at “whole table” aggregation (no BY Groups).

SAS DATA Step

Aggregating an entire table in SAS DATA Step usually looks something like below. We create an aggregator field (totSalesAmt) and then add the detail records’ amount field (SaleAmt) to it as we process each record. Finally, when there are no more records (eof), we output the single aggregate row.

DATA aggregatedTable ;
SET detailedTable end=eof;
 
      retain totSalesAmt 0;
      totSalesAmt = totSalesAmt + SaleAmt;
      keep totSalesAmt;
      if eof then output;
 
run;

CAS DATA Step

While the above code returns one row in single-engine SAS, the same code returns multiple rows in CAS — one per thread. When I ran this code against a table in my environment, I got 28 rows (because CAS used 28 threads in this example).

As with the unique ID logic, producing a total aggregate is just a little more complicated in CAS. To make it work in CAS, we need a post-process step to bring the results together. So, our code would look like this:

DATA aggregatedTable ;
SET detailedTable end=eof;
 
      retain threadSalesAmt 0;
      threadSalesAmt = threadSalesAmt + SaleAmt;
      keep threadSalesAmt;
      if eof then output;
 
run;
 
DATA aggregatedTable / single=yes;
SET aggregatedTable end=eof;
 
      retain totSalesAmt 0;
      totSalesAmt = totSalesAmt + threadSalesAmt;
      if eof then output;
 
run;

In the first data step in the above example, we ran basically the same code as in the SAS DATA Step example. In that step, we let CAS do its distributed, multi-threaded processing because our table is large. Spreading the work over multiple threads makes the aggregation much quicker. After this, we execute a second DATA Step but here we force CAS to use only one thread with the single=yes option. This ensures we only get one output row because CAS only uses one thread. Using a single thread in this case is optimal because we’ll only have a few input records (one per thread from the previous step).

BY-GROUP Aggregation

Individual threads are then assigned to individual BY-Groups. Since each BY-Group is processed by one and only one thread, when we aggregate, we won’t see multiple output rows for a BY-Group. So, there shouldn’t be a need to consolidate the thread results like there was with “whole table” aggregation above.

Consequently, BY-Group aggregation DATA Step code should look exactly the same in CAS and SAS (at least for the basic stuff).

Concluding Thoughts

Coding DATA Step in CAS is very similar to coding DATA Step in SAS. If you’re a wizard in one, you’re likely a wizard in the other. The major difference is accounting for CAS’ massively parallel processing capabilities (which manifest as threads). For more insight into data processing with CAS, check out the SAS Global Forum paper.

Threads and CAS DATA Step was published on SAS Users.

11月 202017
 

Many SAS users have inquired about SAS Cloud Analytic Services’ (CAS) Distributed Network File System (Learn more about CAS.)

The “NFS” in “DNFS”

Let’s start at the beginning. The “NFS” in DNFS stands for “Network File System” and refers to the ability to share files across a network. As the picture below illustrates, a network file system lets numerous remote hosts access another host’s files.

Understanding DNFS

NFS

There are numerous network file system protocols that can be used for file sharing – e.g. CIFS, Ceph, Lustre – but the most common on Linux is NFS. While NFS is the dominant file-sharing protocol, the “NFS” part of the DNFS does not correspond to the NFS protocol. Currently all the DNFS supported file systems are based on NFS, but DNFS may support file systems based on other protocols in the future. So, it’s best to think of the “NFS” part of “DNFS” as a generic “network file system” (clustered file system) and not the specific NFS protocol.

The “D” in “DNFS”

The “D” in DNFS stands for “Distributed” but it does not refer to the network file system. By definition, that is already distributed since the file system is external to the machines accessing it. The “Distributed” in DNFS refers to CAS’ ability to use a network file system in a massively parallel way. With a supported file system mounted identically on each CAS node, CAS can access (both write) the file system’s CSV and SASHDAT files from every worker node in parallel.

This parallel file access is not an attribute of the file system, it is a capability of CAS. By definition, network file systems facilitate access at the file level, not the block level. With DNFS, CAS actively manages parallel block level I/O to the network file system, making sure file seek locations and block I/O operations do not overlap, etc.

DNFS

 

DNFS as CAS Backing Store

Not only can CAS perform multi-machine parallel I/O from network file systems, it can also memory-map NFS SASHDAT files directly into CAS. Thus, SASHDAT files on DNFS act as both the CASlib data source as well as the virtual memory “backing store,” often negating the need for CAS to utilize memory mapping (mmap()).

Note 1: Data transformations on load, such as row filtering and field selection, as well as encryption can trigger CAS_DISK_CACHE usage. Since the data must be transformed (subset and/or decrypted), CAS copies the transformed data into CAS_DISK_CACHE to support CAS processing.

Note 2: It is possible to use DNFS atop an encrypting file system or hardware storage device. Here, the HDAT blocks are stored encrypted but transmitted to the CAS I/O layer decrypted. Assuming no other transformations, CAS_DISK_CACHE will not be used in this scenario.

DNFS Memory Mapping

Performance Considerations

DNFS-based CAS loading will only be as fast as the slowest component involved. The chosen NFS architecture (hardware and CAS connectivity) should support I/O throughput commensurate with the CAS installation and in-line with the implementation’s service level agreements. So, while NetApp ONTAP clustering architecture. A different file system technology might look a little different but the same basic ideas will apply.

DNFS w/ Multi Machine File System

As described earlier, CAS manages the parallel I/O operations. Requests from CAS are sent to the appliance and handled by the NFS metadata server. The storage device implementing the NFS protocol points CAS DNFS to the proper file and block locations on the NFS data servers which pass the data to the CAS worker nodes directly.

Understanding DNFS was published on SAS Users.

8月 152017
 

CAS data modelingThe CAS physical data model, i.e.what features CAS offers for data storage, and how to use them to maximize performance in CAS (and consequently SAS Visual Analytics 8.1 too).

So, specifically let’s answer the question:

What CAS physical table storage features can we use to get better performance in CAS and SAS Visual Analytics/CAS?

CAS Physical Table Storage Features

The following data storage features affect how CAS tables are physically structured:

  • Compression
  • Partitioning
  • Sorting
  • Repeated Tables
  • Extended Data Types (Varchar)
  • User Defined Formats

Compression — the Storage Option that Degrades Performance

data public.MegaCorp (compress=yes);
   set baselib.MegaCorp;
run;

Partitioning and Sorting

Partitioning is a powerful tool for improving Bar Charts, Decision Tree, Linear Regression) provide grouping as well as classification functionality.

When performing analyses/processing, CAS first groups the data into the required BY-groups. Pre-partitioning on commonly-used BY-groups means CAS can skip this step, vastly improving performance.

Within partitions, tables can be sorted by non-partition-key variables. Pre-sorting by natural ordering variables (e.g. time) allows CAS to skip the ordering step in many cases just like partitioning allows CAS to skip the grouping step.

For a full use-case, consider a line graph that groups sales by region and plots by date. This graph object would benefit greatly from a CAS table that is pre-partitioned by region and pre-sorted by date.

Join Optimization

Partitioning can also support join operations since both the CAS FedSQL Merge Join algorithm utilize BY-GROUP operations to support their processing.

Pre-partitioning tables in anticipation of joins will greatly improve join performance. A good use case is partitioning both a large transaction table and an equally large reference table (e.g. an enormous Customer table) by the common field, customerID. When a DATA Step MERGE or a FedSQL join is performed between the two tables on that field, the join/merge will take advantage of partitioning for the BY-GROUP operation resulting in something similar to a partition-wise join.

Like Compression, partitioning and sorting can be implemented via CAS actions as well as data set options. Using the data set options is demonstrated below:

data mycas.bigOrderTable (partition=(region division) orderby=(year quarter month));
   set CASorBase.bigOrderTable;
run;

Repeated Tables

By default, in distributed CAS Server deployments, CAS divides incoming tables into blocks and distributes those blocks among its DUPLICATE data set option or the Repeated Tables have two main use-cases in CAS:

1.     Join Optimization
2.     Small Table Operation Optimization

Join Optimization

For join operations, the default data distribution scheme can result in significant network traffic as matching records from the two tables travel between worker nodes to meet. If one of the two tables was created with the DUPLICATE/REPEAT option, then every possible record from that table is available on every node to the other table. There is no need for any network traffic.

Small Table Operation Optimization

For small tables, even single table operations can perform better with repeated instead of divided distribution. LASR actually implemented the “High Volume Access to Smaller Tables” feature for the same reason. When a table is repeated, CAS runs any required operation on a single worker node against the full copy of the table that resides there, instead of distributing the work.

As stated, repeated tables can be implemented with the DUPLICATE data set option, it can also be implemented with the REPEAT option on the PROC CASUTIL LOAD statement. The CASUTIL method is shown below:

proc casutil ;
   load data=sashelp.prdsale outcaslib=”caspath”
           casout=”prdsale” replace REPEAT ;
quit ;

Extended Data Types (VARCHAR)

With Viya 3.2 comes SAS’ first widespread implementation of variable length character fields. While Base SAS offers variable length character fields through compression, Viya 3.2 is the first major SAS release to include a save storage space, it also improves performance by reducing the size of the record being processed. CAS, like any other processing engine, will process narrower records more quickly than wide records.

User Defined Formats

User defined formats (UDFs) exist in CAS in much the same way they do in Base SAS. Their primary function, of course, is to provide display formatting for raw data values. Think about a format for direction. The raw data might be: “E”, “W”, “N”, “S” while the corresponding format values might be “East”, “West”, “North”, “South.”

So how might user defined formats improve performance in CAS? The same way they do in Base SAS, and the same way that VARCHAR does, by reducing the size of the record that CAS has to process. Imagine replacing multiple 200 byte description fields with 1 byte codes. If you had 10 such fields, the record length would decrease 1990 bytes ((10 X 200) – 10). This is an extreme example but it illustrates the point: User defined formats can reduce the amount of data that CAS has to process and, consequently, will lead to performance gains.

CAS data modeling for performance was published on SAS Users.

10月 302015
 

Is your LASR implementation running short on memory? Since LASR tables are stored in memory, it can become scarce. So what can we do to minimize LASR table size and still get LASR’s legendary performance? Here are a few strategies on how to shrink LASR tables:

Compression: When compression was first introduced, many people thought it would be the best and only solution for reducing LASR table sizes. However it is actually just one of many approaches. While LASR compression can reduce tables sizes dramatically (94% in a test I did on the MegaCorp sample table), it can affect performance. Compression is easily implemented, so it should be the first strategy you try.

Pivoting: Pivoting transforms rows into columns to save space. See the example below for reference: ShrinkingLASR1Even in this simple example, we were able to cut the number table cells in half (from 12 to 6). However space savings from pivoting can be even more dramatic than that. Imagine pivoting by a simple DIRECTION field (East, West, North, South), This would cut the number of rows by three fourths (75%) since all of the direction amounts would now be stored in a single row. Now imagine privoting by mutliple fields and you can see how cut tables down by 90% or more.Pivoting changes the structure of the table so analysis logic will need to change. However, as an analytic engine, LASR likes wide data. This combined with the reduced table size means that pivoting actually increases LASR performance.

Star Schema Views: LASR Star Schema Views are implemented similarly to other data management system’s star schemas and offer the same benefits. While Star Schema views carry a modest performance impact, space savings can be dramatic (70%+) if you have many low cardinality categorical fields that can be moved to dimension tables.

Computed Columns: LASR Computed Columns / Calculated Data Items are non-materialized columns defined by expressions. They take up no memory for storage. As compared to a materialized version of the same column, they save 100% of the column space. Computed columns incur no performance penalty and should be used in place of ETL where possible. Some computed column examples are shown below. Note that both numeric and character columns are supported along with conditional logic via SAS DATA Step If-Then-Else syntax.

proc imstat nopreparse;
table lasr1.mega_corp;
compute productName "productname=scan(productdescription,1)";
compute NetProfit   "netprofit=Profit - ExpensesStaffing";
compute UnitConditn "if unitlifespan > unitage then unitconditn='Past 
                     Due'; else unitconditn='Good';";
quit;

Custom Formats: Custom formats had some performance problems in early releases. Those performance problems are gone now and custom formats can be used to save space by replacing long categorical variables with short formatted code fields. A graphic below shows a 9+ character REGION field beig replaced by a formatted one character Region_Cd field. This saves 8+ bytes per row.

ShrinkingLASR2

SASHDAT Memory Mapping: SASHDAT memory mapping extends LASR memory by mapping SASHDAT blocks to virtual memory where the OS can swap them in and out of RAM as necessary. This allows for better memory utilization and a more dynamic environment since tables that aren’t being used will be moved out of memory in favor of tables that are being accessed.

 

tags: data management, LASR, SAS Professional Services

How to shrink LASR tables was published on SAS Users.

8月 112015
 

The SASIOLA libname engine is now delivered with most SAS software orders so that we can load LASR environments remotely from separate SAS environments. For example, maybe you have an existing SAS “ETL” machine that you use to access your source environments and load your target environments. With the SASIOLA libname engine, you can now use this machine to load your distributed LASR environment as well.

In addition to SASIOLA, another technology comes to mind when we talk about reading and writing between separate SAS environments, SAS/CONNECT. The natural question is, “Can I utilize SAS/CONNECT to load LASR from separate SAS environments?” Yes, you can.

Remote SAS datasets can be accessed via SAS/CONNECT Remote Library Services (RLS), but RLS is considered inefficient for large data especially when any transformation is required. A better approach is to utilize SAS/CONNECT’s Compute Services to transform the data and Data Transfer Services to move the data into LASR via the LASR SAS server machine. This is a particularly effective strategy with data that is too large for the LASR SAS server’s local disk (while small enough to fit when distributed across the entire LASR hardware array).

Below is an example using DI Studio. DI Studio’s ability to control SAS processing on multiple hosts (by generating SAS/CONNECT code) is actually one of my favorite features in the SAS stack of products.

LoadingData

As the picture shows, the final job only contains two transformations — one to perform the join on the remote SAS server and one to transfer the output of the join into LASR on the SAS VA Server.

To make this all work, both machines — the remote SAS server as well as the SAS VA server — need SAS/CONNECT installed, licensed and configured. Additionally, the SAS metadata environment that DI Studio is using needs to have both SAS machines (and their libraries and tables) registered in metadata. In the picture below, we can see such a metadata environment with a VA SAS machine named SASAppVA (sasserver01) and a remote SAS machine named SASAppEBI (sasserver05).

LoadingData2

And if you don’t have DI Studio, the code is pretty easy to write too.

OPTION SET=GRIDINSTALLLOC="/opt/TKGrid";
OPTION SET=GRIDHOST="sasserver01";

libname lasr1 sasiola port=10010 tag='hps';

/* Connect to the remote server */
  options comamid=TCP;
  %let local = remotesassrv 7551;
  data _null_;
  signon local user=xxxxx password=XXXXXX
  noscript;
  run;

/* Transform the data on the remote server */
  rsubmit local sysrputsync = yes ;
    libname basedir "/sas94/data";
    proc sql noprint;
      create table Complete_Cars as
      Select *
        From basedir.Cars1 a
        basedir.Cars2 b
        on a.car_id = b.car_id;
    Quit;

/* Transfer the data to LASR on the VA SAS Server */
    proc download data=basedir.Complete_Cars
    out=lasr1.Cars_LASR;
    run;
  endrsubmit; 

Note that all transformation is done on the remote host, not the LASR SAS Server. Also note that the data loaded from the remote location straight into LASR. There is no staging on LASR SAS Server.
tags: Loading data in LASR, SAS LASR Analytic Engine

Loading LASR Remotely with DI Studio (SAS/Connect) was published on SAS Users.

4月 302015
 

SAS Event Stream Processing that is! The latest release of SAS Event Stream Processing will launch May 12, and numerous customers around the globe are already using it. So what’s the big deal?

Why event streams are important to business

SAS Event Stream Processing allows organizations to react to events virtually instantaneously. Consider the following scenarios. Imagine if:

  • An online retailer creates custom offers as customers click around the web
  • An oil company automatically vents a pipeline to a reservoir when sensors detect an increase in pressure
  • A financial regulator reverses predatory trading immediately after it occurred

These are all extremely high value scenarios, and the reason is the rapid reaction time. If the retailer markets to the customer after he or she has already ended their session, it’s less effective. If the oil pipeline breaks, it is a disaster. If the predatory trading isn’t reversed for months, markets and peoples’ lives are negatively affected.

The point is that the value of information decreases dramatically over time. While it’s often helpful to analyze historical data, the faster you can use that analysis to react to actual events, the more valuable it can be to the bottom line. As the graphic below shows, the quicker you take action following a business event, the more that action will be worth in money earned or saved.

ESP_XML_capture.png

Developing event stream models

The recommended path for executing SAS Event Stream Processing models is through the XML Factory Server. To support faster model development, SAS recently added a visual development environment called the SAS Event Stream Processing Studio. Using this graphical interface, model designers can drag and drop windows onto the workspace area to create the appropriate data-centric flow and apply processing rules to address any kind of business need. Simple and intuitive.

SAS Event Stream Processing Studio showing code and associated process flow.

Once the project is designed it can be tested directly within SAS Event Stream Processing Studio, and if everything is fine, the XML model that’s automatically generated from the interface and published to the appropriate server for execution.

SAS Event Stream Studio supports faster analysis and detection of events with:

  • an intuitive environment for developing and testing projects (aka models)
  • a palette of windows and connectors that can be used to design even the most complex event streaming models
  • a definition and testing environment that reduces the need for programming in XML or C++
  • ability to easily instantiate visually-defined models to the XML factory server, connecting to live data streams for model validation
  • full visibility into the automatically generated XML code, which can be further customized with edits and additions

For more information

tags: event stream processing, SAS Event Stream Processing, SAS Professional Services, SAS Programmers

The post You react so quickly! Do you have ESP? appeared first on SAS Users.

1月 232015
 

Everyone is always looking for test data. Business analysts want it for demos and prototypes. Software developers want it for development and unit testing. Testers want it for system and integration testing.

I’ve written many programs to generate test data over the years, as have many other SAS users. Generated data can be great, but it’s always nice to have real data when you can get it. It demos better. It highlights issues better (Believe me, real people will enter things you never dreamed of including in your generated data). It even helps you be a better software designer since you can actually look at real data and react to what you’re seeing.

So how do you get real data when individuals, businesses and organizations are increasingly protective of it?

Well, actually, in some areas of business that seems to be changing. In an effort to increase transparency and visibility, many organizations make some data available for use by outsiders. For example, Amazon, Google Shopping, and BestBuy all make their product and review data available for integration into other web sites.

This product data is meant for web integration so PROC HTTP is the key to downloading, and it could hardly be easier!

Let’s look at getting some BestBuy product review data. Looking at the documentation for the Reviews API, we are told to format our request like this:

http://api.remix.bestbuy.com/v1/reviews(sku=1780275)?apiKey=YourAPIKey&s

Here are the key parameters:

  • sku parameter denotes the particular product being reviewed
  • apiKey parameter denotes your authentication key (you must register to get it)
  • show parameter denotes which columns to return

From there, downloading the data is just a matter of putting the request into PROC HTTP and changing the parameters to meet your needs. Here’s some code I ran to get iPad 2 Air reviews:

filename out temp;

PROC HTTP out=out url=’http://api.remix.bestbuy.com/v1/reviews(sku=3315023)?apiKey=your-authentication-key&page=1&pageSize=100&show=id,sku,comment,rating’
method=”get” ;
run;

libname out xmlv2;

proc append base=iPadReviews data=out.review; run;

BestBuy places a limited of 100 records (pagesize parameter) per request, so I wrote the code so it could be used iteratively. To do so, simply create a SAS macro from this example and increment the page parameter via macro substitution to read through all the reviews.

Now, let’s look our new iPad 2 Air product review test data!

BestBuydata

tags: PROC HTTP, SAS Professional Services, SAS Programmers, Web data