3月 282017
 

I have been using the SAS Viya environment for just over six months now and I absolutely love it.  As a long-time SAS coder and data scientist I’m thrilled with the speed and greater accuracy I’m getting out of a lot of the same statistical techniques I once used in SAS9.  So why would a data scientist want to switch over to the new SAS Viya platform? The simple response is “better, faster answers.”  There are some features that are endemic to the SAS Viya architecture that provide advantages, and there are also benefits specific to different products as well.  So, let me try to distinguish between these.

SAS Viya Platform Advantages

To begin, I want to talk about the SAS Viya platform advantages.  For data processing, SAS Viya uses something called the CAS (Cloud Analytic Services) server – which takes the place of the SAS9 workspace server.  You can still use your SAS9 installation, as SAS has made it easy to work between SAS9 and SAS Viya using SAS/CONNECT, a feature that will be automated later in 2017.

Parallel Data Loads

One thing I immediately noticed was the speed with which large data sets are loaded into SAS Viya memory.  Using Hadoop, we can stage input files in either HDFS or Hive, and CAS will lift that data in parallel into its pooled memory area.  The same data conversion is occurring, like what happened in SAS9, but now all available processors can be applied to load the input data simultaneously.  And speaking of RAM, not all of the data needs to fit exactly into memory as it did with the LASR and HPA procedures, so much larger data sets can be processed in SAS Viya than you might have been able to handle before.

Multi-threaded DATA step

After initially loading data into SAS Viya, I was pleased to learn that the SAS DATA step is multi-threaded.  Most of your SAS9 programs will run ‘as is,’ however the multi-processing really only kicks in when the system finds explicit BY statements or partition statements in the DATA step code.  Surprisingly, you no longer need to sort your data before using BY statements in Procs or DATA steps.  That’s because there is no Proc Sort anymore – sorting is a thing of the past and certainly takes some getting used to in SAS Viya.  So for all of those times where I had to first sort data before I could use it, and then execute one or more DATA steps, that all transforms into a more simplified code stream.   Steven Sober has some excellent code examples of the DATA step running in full-distributed mode in his recent article.

Open API’s

While all of SAS Viya’s graphical user interfaces are designed with consistency of look and feel in mind, the R&D teams have designed it to allow almost any front-end or REST service submit commands and receive results from either CAS or its corresponding micro-service architecture.  Something new I had to learn was the concept of a CAS action set.  CAS action sets are comprised of a number of separate actions which can be executed singly or with other actions belonging to the same set.  The cool thing about CAS actions is that there is one for almost any task you can think about doing (kind of like a blend between functions and Procs in SAS9).  In fact, all of the visual interfaces SAS deploys utilize CAS actions behind the scenes and most GUI’s will automatically generate code for you if you do not want to write it.

But the real beauty of CAS actions is that you can submit them through different coding interfaces using the open Application Programming Interface’s (API’s) that SAS has written to support external languages like Python, Java, Lua and R (check out Github on this topic).  The standardization aspect of using the same CAS action within any type of external interface looks like it will pay huge dividends to anyone investing in this approach.

Write it once, re-use it elsewhere

I think another feature that old and new users alike will adore is the “write-it-once, re-use it” paradigm that CAS actions support.  Here’s an example of code that was used in Proc CAS, and then used in Jupyter notebook using Python, followed by a R/REST example.

Proc CAS

proc cas;
dnnTrain / table={name  = 'iris_with_folds'
                   where = '_fold_ ne 19'}
 modelWeights = {name='dl1_weights', replace=true}
 target = "species"
 hiddens = {10, 10} acts={'tanh', 'tanh'}
 sgdopts = {miniBatchSize=5, learningRate=0.1, 
                  maxEpochs=10};
run;

 

Python API

s.dnntrain(table = {‘name’: 'iris_with_folds’,
                                  ‘where’: '_fold_ ne 19},
   modelweights = {‘name’: 'dl1_weights', ‘replace’: True}
   target  = "species"
   hiddens  = [10, 10], acts=['tanh', ‘tanh']
   sgdopts  = {‘miniBatchSize’: 5, ‘learningRate’: 0.1, 
                      ‘maxEpochs’: 10})

 

R API

cas.deepNeural.dnnTrain(s,
  table = list(name = 'iris_with_folds’
                   where = '_fold_ ne 19’),
  modelweights = list({name='dl1_weights', replace=T),
  target = "species"
  hiddens = c(10, 10), acts=c('tanh', ‘tanh‘)
  sgdopts = list(miniBatchSize = 5, learningRate = 0.1,
                   maxEpochs=10))

 

See how nearly identical each of these three are to one another?  That is the beauty of SAS Viya.  Using a coding approach like this means that I do not need to rely exclusively on finding SAS coding talent anymore.  Younger coders who usually know several open source languages take one look at this, understand it, and can easily incorporate it into what they are already doing.  In other words, they can stay in coding environments that are familiar to them, whilst learning a few new SAS Viya objects that dramatically extend and improve their work.

Analytics Procedure Advantages

Auto-tuning

Next, I want address some of the advantages in the newer analytics procedures.  One really great new capability that has been added is the auto-tuning feature for some machine learning modeling techniques, specifically (extreme) gradient boosting, decision tree, random forest, support vector machine, factorization machine and neural network.  This capability is something that is hard to find in the open source community, namely the automatic tuning of major option settings required by most iterative machine learning techniques.  Called ‘hyperspace parameters’ among data scientists, SAS has built-in optimizing routines that try different settings and pick the best ones for you (in parallel!!!).  The process takes longer to run initially, but, wow, the increase in accuracy without going through the normal model build trial-and-error process is worth it for this amazing feature!

Extreme Gradient Boosting, plus other popular ML techniques

Admittedly, xgboost has been in the open source community for a couple of years already, but SAS Viya has its own extreme[1] gradient boosting CAS action (‘gbtreetrain’) and accompanying procedure (Gradboost).  Both are very close to what Chen (2015, 2016) originally developed, yet have some nice enhancements sprinkled throughout.  One huge bonus is the auto-tuning feature I mentioned above.  Another set of enhancements include: 1) a more flexible tree-splitting methodology that is not limited to CART (binary tree-splitting), and 2) the handling of nominal input variables is done automatically for you, versus ‘one-hot-encoding’ you need to perform in most open source tools.  Plus, lots of other detailed option settings for fine tuning and control.

In SAS Viya, all of the popular machine learning techniques are there as well, and SAS makes it easy for you to explore your data, create your own model tournaments, and generate score code that is easy to deploy.  Model management is currently done through SAS9 (at least until the next SAS Viya release later this year), but good, solid links are provided between SAS Viya and SAS9 to make transferring tasks and output fairly seamless.  Check out the full list of SAS Viya analytics available as of March 2017.

In-memory forecasting

It is hard to beat SAS9 Forecast Server with its unique 12 patents for automatic diagnosing and generating forecasts, but now all of those industry-leading innovations are also available in SAS Viya’s distributed in-memory environment. And by leveraging SAS Viya’s optimized data shuffling routines, time series data does not need to be sorted, yet it is quickly and efficiently distributed across the shared memory array. The new architecture also has given us a set of new object packages to make more efficient use of the data and run faster than anything witnessed before. For example, we have seen 1.5 million weekly time series with three years of history take 130 hours (running single-machine and single-threaded) and reduce that down to run in 5 minutes on a 40 core networked array with 10 threads per core. Accurately forecasting 870 Gigabytes of information, in 5 minutes?!? That truly is amazing!

Conclusions

Though I first ventured into SAS Viya with some trepidation, it soon became clear that the new platform would fundamentally change how I build models and analytics.  In fact, the jumps in performance and the reductions in time spent to do routine work has been so compelling for me, I am having a hard time thinking about going back to a pure SAS9 environment.  For me it’s all about getting “better, faster answers,” and SAS Viya allows me to do just that.   Multi-threaded processing is the way of the future and I want to be part of that, not only for my own personal development, but also because it will help me achieve things for my customers they may not have thought possible before.  If you have not done so already, I highly recommend you to go out and sign up for a free trial and check out the benefits of SAS Viya for yourself.


[1] The definition of ‘extreme’ refers only to the distributed, multi-threaded aspect of any boosting technique.

References

Chen , Tianqi and Carlos Guestrin , “XGBoost: Reliable Large-scale Tree Boosting System”, 2015

Chen , Tianqi and Carlos Guestrin, “XGBoost: A Scalable Tree Boosting System”, 2016

Using the Robust Analytics Environment of SAS Viya was published on SAS Users.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)