Global Technology Practice

4月 262017
 

Oracle databases from SAS ViyaSAS Data Connector to Oracle lets you easily load data from your Oracle DB into SAS Viya for advanced analytics. SAS/ACCESS Interface to Oracle (on SAS Viya) provides the required SAS Data Connector to Oracle, that should be deployed to your CAS environment. Once the below described configuration steps for SAS Data Connector to Oracle are completed, SAS Studio or Visual Data Builder will be able to directly access Oracle tables and load them into the CAS engine.

SAS Data Connector to Oracle requires Oracle client components (release 12c or later) to be installed and configured and configurations deployed to your CAS server. Here is a guide that walks you through the process of installing the Oracle client on a Linux server and configuring SAS Data Connector to Oracle on SAS Viya 3.1 (or 3.2):

Step 1: Get the Oracle Instant client software (release 12.c or later)

To find the oracle client software package open a web browser and navigate to Oracle support at:
http://www.oracle.com/technetwork/topics/linuxx86-64soft-092277.html

Download the following two packages to your CAS controller server:

  • oracle-instantclient12.1-basic-12.1.0.2.0-1.x86_64.rpm
  • oracle-instantclient12.1-sqlplus-12.1.0.2.0-1.x86_64.rpm (optional for testing only)

Step 2: Install and configure Oracle instant client

On your CAS controller server, execute the following commands to install the Oracle instant client and SQLPlus utilities.

rpm –ivh oracle-instantclient12.1-basic-12.1.0.2.0-1.x86_64.rpm
rpm –ivh oracle-instantclient12.1-sqlplus-12.1.0.2.0-1.x86_64.rpm

The Oracle client should be installed to /usr/lib/oracle/12.1/client64.
To configure the Oracle client, create a file called tnsnames.ora, for example, in the /etc/ directory. Paste the following lines with the appropriate connection parameters of your Oracle DB into the tnsnames.ora file. (Replace "your_tnsname", "your_oracle_host", "your_oracle_port" and "your_oracle_db_service_name" with parameters according to your Oracle DB implementation)

your_tnsname =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = your_oracle_host)(PORT = your_oracle_port ))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME =your_oracle_db_service_name)
    )
  )

Next you need to set environment variables for the Oracle client to work:

LD_LIBRARY_PATH: Specifies the directory of your Oracle instant client libs
PATH: Must include your Oracle instant client bin directory
TNS_ADMIN: Directory of your tnsnames.ora file
ORACLE_HOME: Location of your Oracle instant client installation

In a console window on your CAS controller Linux server, issue the following commands to set environment variables for the Oracle client: (replace the directory locations if needed)

export LD_LIBRARY_PATH=/usr/lib/oracle/12.1/client64/lib:$LD_LIBRARY_PATH
export PATH=/usr/lib/oracle/12.1/client64/bin:$PATH 
export TNS_ADMIN=/etc 
export ORACLE_HOME=/usr/lib/oracle/12.1/client64

If you installed the SQLPlus package from Oracle, you can test connectivity with the following command: (replace "your_oracle_user" and "your_tnsname" with a valid oracle user and the tnsname configured previously)

sqlplus your_oracle_user@your_tnsname

When prompted for a password, use your Oracle DB password to log on.
You should see a “SQL>” prompt and you can issue SQL queries against the Oracle DB tables to verify your DB connection. This test indicates if the Oracle instant client is successfully installed and configured.

Step 3: Configure SAS Data Connector to Oracle on SAS Viya

Next you need to configure the CAS server to use the instant client.

The recommended way is to edit the vars.yml file in your Ansible playbook and deploy the required changes to your SAS Viya cluster.

Locate the vars.yml file on your cluster deployment and change the CAS_SETTINGS section to reflect the correct environment variables needed for CAS to use the Oracle instant client:
To do so, uncomment the lines for ORACLE_HOME and LD_LIBRARY_PATH and insert the respective path for your Oracle instant client installation as shown below.

CAS Specific ####
# Anything in this list will end up in the cas.settings file
CAS_SETTINGS:
1: ORACLE_HOME=/usr/lib/oracle/12.1/client64
#3: ODBCHOME=ODBC home directory
#4: JAVA_HOME=/usr/lib/jvm/jre-1.8.0
5:LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ORACLE_HOME/lib

Run the ansible-playbook to deploy the changes to your CAS server.
After ansible finished the update, your cas.settings file should contain the following lines:

export ORACLE_HOME=/usr/lib/oracle/12.1/client64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ORACLE_HOME/lib

Now you are ready to use SAS/ACCESS Interface to Oracle in SAS Viya.

Step 4: Test SAS/ACCESS Interface to Oracle in SAS Studio

Log on to SAS Studio to load data from your Oracle DB into CAS.
Execute the following SAS Code example in SAS Studio to connect to your Oracle DB and load data into CAS. Change the parameters starting with "your_" in the SAS code below according to your Oracle DB implementation.

/************************************************************************/
/*  Start a CAS session   */
/************************************************************************/
cas;
/************************************************************************/
/*  Create a Caslib for the Oracle connection   */
/************************************************************************/
caslib ORACLE datasource=(                                           
    srctype="oracle",
    uid="your_oracle_user_ID",
    pwd="your_oracle_password",
    path="//your_db_hostname:your_db_port/your_oracle_service_name",
    schema="your_oracle_schema_name" );
 
/************************************************************************/
/*  Load a table from Oracle into CAS   */
/************************************************************************/
proc casutil;
   list files incaslib="ORACLE"; 
   load casdata="your_oracle_table_name" incaslib="ORACLE" outcaslib="casuser" casout="your_cas_table_name";                   
   list tables incaslib="casuser";
quit;
/************************************************************************/
/*  Assign caslib to SAS Studio   */
/************************************************************************/
 caslib _all_ assign;

The previous example is a simple SAS program to test access to Oracle and load data from an Oracle table into CAS memory. As a result, the program loads a table in your CAS library with data from your Oracle database.

How to configure Oracle client for successful access to Oracle databases from SAS Viya was published on SAS Users.

3月 282017
 

I have been using the SAS Viya environment for just over six months now and I absolutely love it.  As a long-time SAS coder and data scientist I’m thrilled with the speed and greater accuracy I’m getting out of a lot of the same statistical techniques I once used in SAS9.  So why would a data scientist want to switch over to the new SAS Viya platform? The simple response is “better, faster answers.”  There are some features that are endemic to the SAS Viya architecture that provide advantages, and there are also benefits specific to different products as well.  So, let me try to distinguish between these.

SAS Viya Platform Advantages

To begin, I want to talk about the SAS Viya platform advantages.  For data processing, SAS Viya uses something called the CAS (Cloud Analytic Services) server – which takes the place of the SAS9 workspace server.  You can still use your SAS9 installation, as SAS has made it easy to work between SAS9 and SAS Viya using SAS/CONNECT, a feature that will be automated later in 2017.

Parallel Data Loads

One thing I immediately noticed was the speed with which large data sets are loaded into SAS Viya memory.  Using Hadoop, we can stage input files in either HDFS or Hive, and CAS will lift that data in parallel into its pooled memory area.  The same data conversion is occurring, like what happened in SAS9, but now all available processors can be applied to load the input data simultaneously.  And speaking of RAM, not all of the data needs to fit exactly into memory as it did with the LASR and HPA procedures, so much larger data sets can be processed in SAS Viya than you might have been able to handle before.

Multi-threaded DATA step

After initially loading data into SAS Viya, I was pleased to learn that the SAS DATA step is multi-threaded.  Most of your SAS9 programs will run ‘as is,’ however the multi-processing really only kicks in when the system finds explicit BY statements or partition statements in the DATA step code.  Surprisingly, you no longer need to sort your data before using BY statements in Procs or DATA steps.  That’s because there is no Proc Sort anymore – sorting is a thing of the past and certainly takes some getting used to in SAS Viya.  So for all of those times where I had to first sort data before I could use it, and then execute one or more DATA steps, that all transforms into a more simplified code stream.   Steven Sober has some excellent code examples of the DATA step running in full-distributed mode in his recent article.

Open API’s

While all of SAS Viya’s graphical user interfaces are designed with consistency of look and feel in mind, the R&D teams have designed it to allow almost any front-end or REST service submit commands and receive results from either CAS or its corresponding micro-service architecture.  Something new I had to learn was the concept of a CAS action set.  CAS action sets are comprised of a number of separate actions which can be executed singly or with other actions belonging to the same set.  The cool thing about CAS actions is that there is one for almost any task you can think about doing (kind of like a blend between functions and Procs in SAS9).  In fact, all of the visual interfaces SAS deploys utilize CAS actions behind the scenes and most GUI’s will automatically generate code for you if you do not want to write it.

But the real beauty of CAS actions is that you can submit them through different coding interfaces using the open Application Programming Interface’s (API’s) that SAS has written to support external languages like Python, Java, Lua and R (check out Github on this topic).  The standardization aspect of using the same CAS action within any type of external interface looks like it will pay huge dividends to anyone investing in this approach.

Write it once, re-use it elsewhere

I think another feature that old and new users alike will adore is the “write-it-once, re-use it” paradigm that CAS actions support.  Here’s an example of code that was used in Proc CAS, and then used in Jupyter notebook using Python, followed by a R/REST example.

Proc CAS

proc cas;
dnnTrain / table={name  = 'iris_with_folds'
                   where = '_fold_ ne 19'}
 modelWeights = {name='dl1_weights', replace=true}
 target = "species"
 hiddens = {10, 10} acts={'tanh', 'tanh'}
 sgdopts = {miniBatchSize=5, learningRate=0.1, 
                  maxEpochs=10};
run;

 

Python API

s.dnntrain(table = {‘name’: 'iris_with_folds’,
                                  ‘where’: '_fold_ ne 19},
   modelweights = {‘name’: 'dl1_weights', ‘replace’: True}
   target  = "species"
   hiddens  = [10, 10], acts=['tanh', ‘tanh']
   sgdopts  = {‘miniBatchSize’: 5, ‘learningRate’: 0.1, 
                      ‘maxEpochs’: 10})

 

R API

cas.deepNeural.dnnTrain(s,
  table = list(name = 'iris_with_folds’
                   where = '_fold_ ne 19’),
  modelweights = list({name='dl1_weights', replace=T),
  target = "species"
  hiddens = c(10, 10), acts=c('tanh', ‘tanh‘)
  sgdopts = list(miniBatchSize = 5, learningRate = 0.1,
                   maxEpochs=10))

 

See how nearly identical each of these three are to one another?  That is the beauty of SAS Viya.  Using a coding approach like this means that I do not need to rely exclusively on finding SAS coding talent anymore.  Younger coders who usually know several open source languages take one look at this, understand it, and can easily incorporate it into what they are already doing.  In other words, they can stay in coding environments that are familiar to them, whilst learning a few new SAS Viya objects that dramatically extend and improve their work.

Analytics Procedure Advantages

Auto-tuning

Next, I want address some of the advantages in the newer analytics procedures.  One really great new capability that has been added is the auto-tuning feature for some machine learning modeling techniques, specifically (extreme) gradient boosting, decision tree, random forest, support vector machine, factorization machine and neural network.  This capability is something that is hard to find in the open source community, namely the automatic tuning of major option settings required by most iterative machine learning techniques.  Called ‘hyperspace parameters’ among data scientists, SAS has built-in optimizing routines that try different settings and pick the best ones for you (in parallel!!!).  The process takes longer to run initially, but, wow, the increase in accuracy without going through the normal model build trial-and-error process is worth it for this amazing feature!

Extreme Gradient Boosting, plus other popular ML techniques

Admittedly, xgboost has been in the open source community for a couple of years already, but SAS Viya has its own extreme[1] gradient boosting CAS action (‘gbtreetrain’) and accompanying procedure (Gradboost).  Both are very close to what Chen (2015, 2016) originally developed, yet have some nice enhancements sprinkled throughout.  One huge bonus is the auto-tuning feature I mentioned above.  Another set of enhancements include: 1) a more flexible tree-splitting methodology that is not limited to CART (binary tree-splitting), and 2) the handling of nominal input variables is done automatically for you, versus ‘one-hot-encoding’ you need to perform in most open source tools.  Plus, lots of other detailed option settings for fine tuning and control.

In SAS Viya, all of the popular machine learning techniques are there as well, and SAS makes it easy for you to explore your data, create your own model tournaments, and generate score code that is easy to deploy.  Model management is currently done through SAS9 (at least until the next SAS Viya release later this year), but good, solid links are provided between SAS Viya and SAS9 to make transferring tasks and output fairly seamless.  Check out the full list of SAS Viya analytics available as of March 2017.

In-memory forecasting

It is hard to beat SAS9 Forecast Server with its unique 12 patents for automatic diagnosing and generating forecasts, but now all of those industry-leading innovations are also available in SAS Viya’s distributed in-memory environment. And by leveraging SAS Viya’s optimized data shuffling routines, time series data does not need to be sorted, yet it is quickly and efficiently distributed across the shared memory array. The new architecture also has given us a set of new object packages to make more efficient use of the data and run faster than anything witnessed before. For example, we have seen 1.5 million weekly time series with three years of history take 130 hours (running single-machine and single-threaded) and reduce that down to run in 5 minutes on a 40 core networked array with 10 threads per core. Accurately forecasting 870 Gigabytes of information, in 5 minutes?!? That truly is amazing!

Conclusions

Though I first ventured into SAS Viya with some trepidation, it soon became clear that the new platform would fundamentally change how I build models and analytics.  In fact, the jumps in performance and the reductions in time spent to do routine work has been so compelling for me, I am having a hard time thinking about going back to a pure SAS9 environment.  For me it’s all about getting “better, faster answers,” and SAS Viya allows me to do just that.   Multi-threaded processing is the way of the future and I want to be part of that, not only for my own personal development, but also because it will help me achieve things for my customers they may not have thought possible before.  If you have not done so already, I highly recommend you to go out and sign up for a free trial and check out the benefits of SAS Viya for yourself.


[1] The definition of ‘extreme’ refers only to the distributed, multi-threaded aspect of any boosting technique.

References

Chen , Tianqi and Carlos Guestrin , “XGBoost: Reliable Large-scale Tree Boosting System”, 2015

Chen , Tianqi and Carlos Guestrin, “XGBoost: A Scalable Tree Boosting System”, 2016

Using the Robust Analytics Environment of SAS Viya was published on SAS Users.

3月 172017
 

Editor’s note: This is the second in a series of articles to help current SAS programmers add SAS Viya to their analytics skillset. In this post, Advisory Solutions Architect Steven Sober explores how to accomplish distributed data management using SAS Viya. Read additional posts in the series.

This article in the SAS Viya series will explore how to accomplish distributed data management using SAS Viya. In my next article, we will discuss how SAS programmers can collaborate with their open source colleagues to leverage SAS Viya for distributed data management.

Distributed Data Management

SAS Viya provides a robust, scalable, cloud ready distributed data management platform. This platform provides multiple techniques for data management that run distributive, i.e. using all cores on all compute nodes defined to the SAS Viya platform. The four techniques we will explore here are DATA Step, PROC DS2, PROC FEDSQL and PROC TRANSPOSE. With these four techniques SAS programmers and open source programmers can quickly apply complex business rules that stage data for downstream consumption, i.e., Analytics, visualizations, and reporting.

The rule for getting your code to run distributed is to ensure all source and target tables reside in the In-Memory component of SAS Viya i.e., Cloud Analytic Services (CAS).

Starting CAS

The following statement is an example of starting a new CAS session. In the coding examples that follow we will reference this session using the key word MYSESS. Also note, this CAS session is using one of the default CAS library, CASUSER.

Binding a LIBNAME to a CAS session

Now that we have started a CAS session we can bind a LIBNAME to that session using the following syntax:

Note: CASUSER is one of the default CAS libraries created when you start a CAS session. In the following coding examples we will utilize CASUSER for our source and target tables that reside in CAS.

To list all default and end-user CAS libraries, use the following statement:

Click here for more information on CAS libraries.

THREAD program

  • The PROC DS2 DATA program must declare a THREAD program
  • The source and target tables must reside in CAS
  • Unlike DATA Step, with PROC DS2 you use the SESSREF= parameter to identify which CAS environment the source and target tables reside in
  • SESSREF=

     For PROC TRANSPOSE to run in CAS the rules are:

    1. All source and target tables must reside in CAS
      a.   Like DATA Step you use a two-level name to reference these tables

    Collaborative distributed data management using SAS Viya was published on SAS Users.

    3月 112017
     

    Ensemble models have been used extensively in credit scoring applications and other areas because they are considered to be more stable and, more importantly, predict better than single classifiers (see Lessmann et al., 2015). They are also known to reduce model bias and variance (Myoung - Jong et al., 2006; Tsai C-F et. al., 2011). The objective of this article is to compare the predictive accuracy of four distinct datasets using two ensemble classifiers (Gradient boosting(GB)/Random Forest(RF)) and two single classifiers (Logistic regression(LR)/Neural Network(NN)) to determine if, in fact, ensemble models are always better. My analysis did not look into optimizing any of these algorithms or feature engineering, which are the building blocks of arriving at a good predictive model. I also decided to base my analysis on these four algorithms because they are the most widely used methods.

    What is the difference between a single and an ensemble classifier?

    Single classifier

    Individual classifiers pursue different objectives to develop a (single) classification model. Statistical methods either estimate (+|) directly (e.g., logistic regression), or estimate class-conditional probabilities (|), which they then convert into posterior probabilities using Bayes rule (e.g., discriminant analysis). Semi-parametric methods, such as NN or SVM, operate in a similar manner, but support different functional forms and require the modeller to select one specification a priori. The parameters of the resulting model are estimated using nonlinear optimization. Tree-based methods recursively partition a data set so as to separate good and bad loans through a sequence of tests (e.g., is loan amount > threshold). This produces a set of rules that facilitate assessing new loan applications. The specific covariates and threshold values to branch a node follow from minimizing indicators of node impurity such as the Gini coefficient or information gain (Baesens, et al., 2003).

    Ensemble classifier

    Ensemble classifiers pool the predictions of multiple base models. Much empirical and theoretical evidence has shown that model combination increases predictive accuracy (Finlay, 2011; Paleologo, et al., 2010). Ensemble learners create the base models in an independent or dependent manner. For example, the bagging algorithm derives independent base models from bootstrap samples of the original data (Breiman, 1996). Boosting algorithms, on the other hand, grow an ensemble in a dependent fashion. They iteratively add base models that are trained to avoid the errors of the current ensemble (Freund & Schapire, 1996). Several extensions of bagging and boosting have been proposed in the literature (Breiman, 2001; Friedman, 2002; Rodriguez, et al., 2006). The common denominator of homogeneous ensembles is that they develop the base models using the same classification algorithm (Lessmann et al., 2015).

    ensemble modifers

    Figure 1: Workflow of single v. ensemble classifiers: derived from the work of Utami, et al., 2014

    Experiment set-up

    Datasets

    Before modelling, I partitioned the dataset into 70% training and 30% validation dataset.

    Table 1: Summary of dataset used for model comparisons

    I used SAS Enterprise Miner as a modelling tool.

    Figure 2: Model flow using Enterprise Miner

    Results

    Table 2: Results showing misclassification rates of all dataset

    Conclusion

    Using misclassification rate as model performance, RF was the best model using Cardata, Organics_Data and HMEQ followed closely by NN. NN was the best model using Time_series_data and performed better than GB ensemble model using Organics_Data and Cardata.

    My findings partly supports the hypothesis that ensemble models naturally do better in comparison to single classifiers, but not in all cases. NN, which is a single classifier, can be very powerful unlike most classifiers (single or ensemble) which are kernel machines and data-driven. NN can generalize from unseen data and act as universal functional approximators (Zhang, et al., 1998).

    According to Kaggle CEO and Founder, Anthony Goldbloom:

    “In the history of Kaggle competitions, there are only two Machine Learning approaches that win competitions: Handcrafted & Neural Networks”.

    What are your thoughts?

    Are ensemble classifiers always better than single classifiers? was published on SAS Users.

    1月 262017
     

    SAS® Viya™ 3.1 represents the third generation of high performance computing from SAS. Our journey started a long time ago and, along the way, we have introduced a number of high performance technologies into the SAS software platform:

    Introducing Cloud Analytic Services (CAS)

    SAS Viya introduces Cloud Analytic Services (CAS) and continues this story of high performance computing.  CAS is the runtime engine and microservices environment for data management and analytics in SAS Viya and introduces some new and interesting innovations for customers. CAS is an in-memory technology and is designed for scale and speed. Whilst it can be set up on a single machine, it is more commonly deployed across a number of nodes in a cluster of computers for massively parallel processing (MPP). The parallelism is further increased when we consider using all the cores within each node of the cluster for multi-threaded, analytic workload execution. In a MPP environment, just because there are a number of nodes, it doesn’t mean that using all of them is always the most efficient for analytic processing. CAS maintains node-to-node communication in the cluster and uses an internal algorithm to determine the optimal distribution and number of nodes to run a given process.

    However, processing in-memory can be expensive, so what happens if your data doesn’t fit into memory? Well CAS, has that covered. CAS will automatically spill data to disk in such a way that only the data that are required for processing are loaded into the memory of the system. The rest of the data are memory-mapped to the filesystem in an efficient way for loading into memory when required. This way of working means that CAS can handle data that are larger than the available memory that has been assigned.

    The CAS in-memory engine is made up of a number of components - namely the CAS controller and, in an MPP distributed environment, CAS worker nodes. Depending on your deployment architecture and data sources, data can be read into CAS either in serial or parallel.

    What about resilience to data loss if a node in an MPP cluster becomes unavailable? Well CAS has that covered too. CAS maintains a replicate of the data within the environment. The number of replicates can be configured but the default is to maintain one extra copy of the data within the environment. This is done efficiently by having the replicate data blocks cached to disk as opposed to consuming resident memory.

    One of the most interesting developments with the introduction of CAS is the way that an end user can interact with SAS Viya. CAS actions are a new programming construct and with CAS, if you are a Python, Java, SAS or Lua developer you can communicate with CAS using an interactive computing environment such as a Jupyter Notebook. One of the benefits of this is that a Python developer, for example, can utilize SAS analytics on a high performance, in-memory distributed architecture, all from their Python programming interface. In addition, we have introduced open REST APIs which means you can call native CAS actions and submit code to the CAS server directly from a Web application or other programs written in any language that supports REST.

    Whilst CAS represents the most recent step in our high performance journey, SAS Viya does not replace SAS 9. These two platforms can co-exist, even on the same hardware, and indeed can communicate with one another to leverage the full range of technology and innovations from SAS. To find out more about CAS, take a look at the early preview trial. Or, if you would like to explore the capabilities of SAS Viya with respect to your current environment and business objectives speak to your local SAS representative about arranging a ‘Path to SAS Viya workshop’ with SAS.

    Many thanks to Fiona McNeill, Mark Schneider and Larry LaRusso for their input and review of this article.

     

    tags: global te, Global Technology Practice, high-performance analytics, SAS Grid Manager, SAS Visual Analytics, SAS Visual Statistics, SAS Viya

    A journey of SAS high performance was published on SAS Users.

    12月 202016
     

    Joining tables with PROC FORMAT

    The title of this post borrows from Stanley Kubrick’s 1964 comedy “Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb.” It stars the great Peter Sellers as the titular character as well as George C. Scott and Slim Pickens. The black and white film is strange and brilliant just like Kubrick was. Similarly, as I was experiencing the issue I outline below and was told of this solution, I thought two things. The first was “what a strange but brilliant solution” and the second one was “I’ll try anything as long as it works.”   Thus, a title was born. But enough about that. Why are we here?

    Problem

    You want to add a couple of columns of information to your already large dataset, but each time you try to join the tables you run out of memory!  For example, you want to append latitude and longitude values from Table B to an existing list of customer phone numbers in Table A.

    You’ve tried this and got nowhere fast:

    proc sort data = demo.tablea;
    by npa nxx;
    proc sort data = demo.tableb;
    by npa nxx;
    run;
     
    data demo.aunionb;
    merge demo.tablea (in=a) demo.tableb (in=b);
    by npa nxx;
    if a;
    run;

    And then you tried this and also got nowhere (albeit a little slower):

    proc sql;
       	create table demo.aunionb as 
       	select *,
    	from demo.tablea a
    left join demo.tableb b on (a.npa = b.npa) and (a.nxx = b.nxx);
    quit;

    Solution - Joining tables with PROC FORMAT

    Use PROC FORMAT!

    Here’s how:

    First, take Table B and create character equivalents of the fields required in your join (assuming they aren’t characters already). In this example, NPA and NXX are the two fields that you are joining on. They will be your key once you concatenate them.  Next, create character equivalents of the fields that you want appended.

    data work.tableb (keep = npa_nxx--nxx_c); 
    set demo.tableb; 
     
    npa_c = compress(put(npa, best10.));
    nxx_c = compress(put(nxx, best10.));
     
    npa_nxx = catx('_',npa_c, nxx_c);
     
    lat_c = compress(put(latitude, best14.3)); 
    long_c = compress(put(longitude, best14.3)); 
    run;

    Next, make sure that you have only unique values of your key. Use PROC SORT with OPT=noduprecs turned on.

    Now, create a table that will be used as the input into PROC FORMAT. In this example, you are creating a table that will contain the formats for the latitude column.

    proc sort data = work.tableb noduprecs;
    by npa_nxx;
     
    data demo.tableb_lat_fmt(keep=fmtname type start label); 
    retain fmtname 'lat_f' type 'C'; 
    set work.tableb; 
     
    if npa_nxx = '._.' then start = 'Other  ';
    else start = npa_nxx; 
    label = lat_c; 
    run;
    proc sort data = demo.tableb_fmt;
    by start;
    run;

    This step creates a table that includes the format name (lat_f), the format type (C), the key field (start) and its corresponding latitude value (label).  Sort this table by the ‘start’ column and then repeat this step for every column you wish to append, with each column getting its own unique format and table.

    Now run PROC FORMAT using the CNTLIN option pointing to the tables that you just created in order to create your format.

    proc format cntlin=demo.tableb_lat_fmt; 
    run; 
    proc format cntlin=demo.tableb_long_fmt; 
    run;

    Now all you have to do is run your data step to create the resultant dataset with the appended values.

    data demo.aunionb (drop = npa_nxx); 
    set demo.tablea; 
     
    npa_nxx = catx('_',compress(put(npa,best10.)),compress(put(nxx, best10.)));
     
    latitude = input(put(npa_nxx, $lat_f.), BEST.); 
    longitude = input(put(npa_nxx, $long_f.), BEST.);
     
    run;

    This step creates 3 columns: npa_nxx, latitude, and longitude. Npa_nxx is the key built from the NPA  and NXX values. Latitude and longitude are then populated with the formatted value of npa_nxx, which in this case is the character equivalent of the original latitude or longitude. It also formats the value back into a numeric field.

    The result is a clever way to add columns to a dataset, much like a VLOOKUP function works in Microsoft Excel, without the hassle of running out of memory space.

    Notes:

    1. The author realizes there are other, more boring ways of tackling this issue like indexing and using WHERE statements, but where’s the fun in that?
    2. This solution may not be right for you. See your doctor if you experience any of the following symptoms:  memory loss, headache, frustration, Cartesian rage, cranial-keyboard embedment or memory loss.
    tags: Global Technology Practice, Joining tables, PROC FORMAT, SAS Programmers, tips & techniques

    Dr. Strangeformat or: How I Learned to Stop Joining Tables and Love the PROC was published on SAS Users.

    12月 102016
     

    During a recent customer visit, I was asked how to include a calculated variable within SAS Asset Performance Analytics’ (APA) Root Cause Analysis workflow. This is a simple request. Is there a simple approach to do this?

    To remind you, in the APA workflow, an ETL Administrator makes a Data Mart available in the solution for the APA users. They can select variables and explore, analyze and create a model based on the columns present in the Data Mart.

    But if you want to analyze a calculated column, like a difference between two variables, do you need to change the Data Mart? Yes, if you want the APA Data Selection to include this calculated column. But this takes time, and do you really need to change the Data Mart? No!

    A simpler and faster approach to adding a calculated column is modifying the APA Root Cause Analysis workflow. And, this is simple!

    SAS Asset Performance Analytics is easily and highly configurable. You can easily customize analytical workflows by modifying its underlying stored process. Let me show you how to customize an existing analysis and add a calculation step to enhance APA’s Root Cause Analysis with a calculated column.

    Benefits

    The main purpose of this customized analysis is to avoid the SAS Enterprise Guide utilization. The users are rarely SAS experts. Thereby, asking users to switch between the tools depending on the functionalities availability on the APA GUI isn’t recommended. The more you can do within the APA interface via wizard-guided workflows, the easier it will be.

    The second benefit is the keeping the Data Mart limited to only crucial variables. Instead of asking an ETL Administrator to add non-validated and/or infrequently used calculated columns to the Data Mart, allow the APA user to test and create meaningful tags to enhance workflows as needed. Once the APA user identifies new and meaningful calculated variables, they can easily be added to the Data Mart and available to APA Explorations and APA Stability Monitoring. Limiting only critical variables within the Data Mart will ensure the data size is optimizing and only adjusted as needed.

    Root Cause Analysis Use

    The use of this new Root Cause Analysis is very easy, instead of selecting “Root Cause Analysis,” select “Root Cause Analysis Calculated Columns” when required.

    sas-asset-performance-analytics-root-cause-analysis01

    Figure 1: Stored Process Folder allows us to choose the appropriate Analysis

    Clicking on “Ok” triggers the same steps as the original RCA.

    sas-asset-performance-analytics-root-cause-analysis02

    Figure 2: steps triggered by the Root Cause Analysis

    After specifying a data selection, the “Filter Data” step contains six new user prompts:

    Figure 3: New “Filter Data” interface

    Figure 3: New “Filter Data” interface

    These prompts allow the user to define what calculation is needed. Six calculations choices are currently available. These choices can be further customized as needed. Calculations types currently available include:

    1. Difference between two variables if you want to study a gap between two tags
    2. Absolute difference of two variables if large gaps are suspicious regardless of the order
    3. Ratio of two variables if you want to control the proportionality
    4. Multiplication of two variables
    5. Summation of two variables
    6. None

    By default, the Calculation option is set to “None,” to perform a classical Root Cause Analysis.

    Figure 4: Dropdown list with calculation details

    Figure 4: Dropdown list with calculation details

    After choosing your variables, you can apply variable coefficients you want to test. By default, APA performs an absolute difference Variable 1 – Variable 2, with the coefficients set to 1. If the variables don’t have the same order of magnitude, you can apply a coefficient to put the variables at the same level. By doing this the newly created calculated variable fluctuates around 0, which is easily interpretable.

    In the example below, the goal is to realize a simple absolute difference between the Variable 2 (V.Four_Press_Clap) and the Variable 1 (V.Four_Press_C). By default the newly created column name is DIFF_TAGS. You can modify it to something more descriptive for your purposes. Don’t forget that the new column name must follow standard SAS column naming conventions.

    Before processing the step, be careful to check the data is not missing during the data period you’ve selected. If it’s missing during the full time period, the DIFF_TAGS variable will not be created and the output will be the same than if you’d selected “None” for the calculation prompt.

    Figure 5: DIFF_TAGS variable creation corresponding to the result of Four_Press_Clap – Four_Press_C

    Figure 5: DIFF_TAGS variable creation corresponding to the result of Four_Press_Clap – Four_Press_C

    Click “Run and Save Step.” As a result, the newly created calculated column is added to your input RCA data and available in the “Expand and Build Statistics” step. Now you can apply minimum and maximum value bands to the calculated column similar to the original RCA workflow.

    Figure 6: minimum and maximum values to consider as atypical

    Figure 6: minimum and maximum values to consider as atypical

    As a result, the calculated column is used like a standard input tag during the full Root Cause Analysis workflow.

    Things to keep in mind

    If you have a specific need, SAS Asset Performance Analytics is easily customizable. In France, for example, we propose an unlimited and modular APA interface.

    Creating columns within the APA user interface analysis workflow has two main benefits:

    1. The calculation is intuitive for a non SAS user
    2. You don’t need to change the data mart

    Only two steps are required to implement Root Cause Analysis including Calculated Columns within APA:

    1. Copy and paste the pam_rca_filterData_NEW.sas file in the SASHome/SASFoundation/ 9.4/pamsrmv/sasstp directory.
    2. Using the SAS Management Console, import the RCA_CC.spk file at the following location: /Products/SAS Asset Performance Analytics/Analytical Workbench in the Folders tab.

    If you have any questions about this process, feel free to add a comment below.

    tags: Global Technology Practice, SAS Asset Performance Analytics

    Enhancing SAS Asset Performance Analytics’ Root Cause Analysis with Calculated Columns was published on SAS Users.

    12月 022016
     

    Nearly every organization has to deal with big data, and that often means dealing with big data problems. For some organizations, especially government agencies, addressing these problems provides more than a competitive advantage, it helps them ensure public confidence in their work or meet standards mandated by law. In this blog I wanted to share with you how SAS worked with a government revenue collection agency to successfully manage their big data issues and seamlessly integrate with Hadoop and other technologies.

    Hadoop Security

    We all know Hadoop pretty well, and if you haven’t heard of Hadoop yet, it is about time you invest some resources to learn more about this upcoming defacto standard for storage and compute. The core of Apache Hadoop consist of a storage part known as HDFS (Hadoop Distributed File System) and a processing part (called MapReduce). Hadoop splits large files into large blocks and distributes them across the nodes of a cluster.

    Hadoop was initially developed to solve web-scale problems like webpage search and indexing at Yahoo. However, the potential of the platform to handle big data and analytics caught the attention of a number of industries. Since the initial used of Hadoop was to count webpages and implement algorithms like page-rank, security was never considered a major requirement, until it started getting used by enterprises across the world.

    Security incidents and massive fines have become commonplace and financial institutions, in particular, are doing everything to avoid such incidents. Security should never be an afterthought and should be considered in the initial design of the system. The five core pillars of Enterprise Security are as follows:

    sas-integration-with-hadoop05

    Our customer had the four core pillars covered from Administration to Auditing using tools provided by their Hadoop vendor. While there are options in the open-source community that provide data protection, in this case the organization decided to use a data security company to protect data at rest on top of Cloudera Navigator Encryption. They refer to it as “Double Encryption.”

    The challenge

    SAS has multiple products around the Hadoop ecosystem to provide the best support for customers. The traditional way of working with Hadoop involves SAS/ACCESS which can involve pulling the data from Hadoop using Hive. However for larger installations, where data movement is a concern, SAS provides Embedded Process technology, which allows you to push SAS code inside of a Hadoop cluster and run it alongside the data blocks. This is a super-efficient way to access large data sets inside of Hadoop by pushing the compute to the data.

    Our customer's data security vendor’s product supports access via Hive UDF’s which means you can tokenize/detokenize when working with SAS/ACCESS Interface to Hadoop using PROC SQL and other options, relatively out of the box. In addition, the SAS language (BASE SAS) can be added using the security company’s API (and PROC FMCP and PROC PROTO) to add additional new SAS language functions for the de/tokenisation of data inside BASE SAS already.

    However, SAS Embedded Process has no default support for our customer's security vendor and SAS products which utilize SAS EP include SAS Code Accelerator, SAS Scoring Accelerator and LASR-based products and cannot work with data tokenized by the vendor. This was a major challenge for our customer who wanted to use SAS products like SAS Visual Analytics and SAS Visual Statistics on large volumes of data stored on Hadoop.

    The challenge hence was to make SAS Embedded Process work with their data security vendor’s software to perform detokenization before passing the data to SAS procedures.

    The possible solutions

    We considered various solutions before agreeing on a solution that satisfies all current requirements and could be extended to meet the future needs of our customer. Let’s discuss the top two solutions and the final implementation.

    Solution 1: SERDE approach

    Our first approach was to create a custom Hive SERDE that wraps the data security company’s APIs. With 9.4M3 the SAS Embedded Process (EP) can read & write via SERDE APIs with some possible constraints and limitations including DS2’s SET/MERGE capabilities and potential identity credentials being passed on from SAS to the company’s APIs.

    sas-integration-with-hadoop02

    The approach had various drawbacks but the top drawback was in working with various file formats. This approach was discarded because it would have meant lots of rework with every new data format being released by the Hadoop community. While it is true that generally an organization would standardize a few formats to be used for its use cases, it is nonetheless a limiting factor.

    Solution 2: Use HDMD with Custom Input Formats

    The second approach was to use HDMD with custom input formats. SAS HDMD supports custom input formats which will allow you to plug in your custom input format. A high-level architectural diagram looks something like Figure 2.  This approach works with a variety of file formats, and we have tested it with Parquet, Avro and ORC with good results. The objective is to load a dataset onto Hadoop or use an existing data set and generate an HDMD file for the dataset. We plug in our custom reader in the HDMD file and as a part of the custom reader we make a number of API calls to the data security company’s API. The API will call on the specific protect and unprotect procedures of the security vendor to protect and/or unprotect the data depending on the requirements and pass the results back to the client.

    sas-integration-with-hadoop03

    What is an Input/Custom input format BTW?

    Data inside Hadoop is typically stored on HDFS (Hadoop Distributed File System). The data needs to be read from the filesystem before being processed. This is achieved using Input Format, which has the following responsibilities.

    • Compute input splits
      • Input splits represent the part of the data that will be processed by each Map phase. A unique input split is passed to the process. At the start of a Map Reduce job, input format will split the data into multiple parts based on logical record boundaries and HDFS block size. To get the input splits, the following method is called:
        • List getSplits(JobContext ctx)
    • Provide a logic to read the input split
      • Each mapper gets a unique input split to process the data. Input format provides a logic to read the split, which is an implementation of the RecordReader interface. The record reader will read the split and emit <key,value> pairs as an input for each map function. The record reader is created using the following method:
        • RecordReader<K,V> createRecordReader(InputSplit is, TaskAttemptContext ctx)

    All the common formats will provide a way to split the data and read records. However, if you want to read a custom data set for which data parsing isn’t available out of the box with Hadoop, you are better off writing a custom input format.

    How to write a Custom Input Format?

    Writing a custom input format needs Java skills (the programming language in which Hadoop has been written). You have the option to implement Abstract methods of InputFormat class, or extend one of the pre-existing input formats. In our case, we had extended FileInputFormat, and overrode few critical methods like

    • getSplits()
    • getRecordReader()

    The getSplits() will create the splits from the input data, while the getRecordReader() should return an instance of a Java object, which has the ability to read custom records, which in our case was the security vendor’s API.

    You can use one of the predefined Record Reader classes or implement your own (most likely if you are writing a custom input format). In our case, we implemented RecordReader interface, and implemented the next() method which is called whenever a new record is found. This is the method where your core business logic is implemented. In our case, we had to write the integration logic by looking at the data, understanding the user who has logged in (available as a part of JobConf object), and then calling the vendor’s APIs to decrypt the data. Sample codes can be requested by contacting me directly.

    Integrating a custom input format with SAS

    Integrating a custom input format is fairly easy with SAS. SAS allows us to plug in custom formats, which are called before the data is processed via SAS Embedded Process using HDMD files.

    When you generate an HDMD file using PROC HDMD, you can specify your custom input format as a part of the generated XML file. Please refer to PROC HDMD documentation.

    The generated HDMD file would look something like this.

    sas-integration-with-hadoop04

    When loading the data from HDFS, SAS will ensure that the specified input format is called prior to any data processing taking place.

    The ultimate solution

    The solution was demonstrated using data from the tax authorities and included tokenization of data via hive UDFS, detokenization of data according to the policies set on the data security appliance, and performing analytics using SAS Visual Analytics. Only users with permissions on the specific policy were able to view the data, while users with no permissions had access to decrypted data. This additional security helped the enterprise protect users’ information from inadvertent access and resulted in widespread use of Big Data technologies within the Enterprise.

    Summary

    As you can see from the example above, SAS is open for business, and is already providing deep integration with Hadoop and other technologies using custom APIs. The sky is the limit for people willing to explore the capabilities of SAS.

    tags: Global Technology Practice, Hadoop, SAS/ACCESS Interface to Hadoop

    SAS integration with Hadoop - one success story was published on SAS Users.

    12月 022016
     

    Nearly every organization has to deal with big data, and that often means dealing with big data problems. For some organizations, especially government agencies, addressing these problems provides more than a competitive advantage, it helps them ensure public confidence in their work or meet standards mandated by law. In this blog I wanted to share with you how SAS worked with a government revenue collection agency to successfully manage their big data issues and seamlessly integrate with Hadoop and other technologies.

    Hadoop Security

    We all know Hadoop pretty well, and if you haven’t heard of Hadoop yet, it is about time you invest some resources to learn more about this upcoming defacto standard for storage and compute. The core of Apache Hadoop consist of a storage part known as HDFS (Hadoop Distributed File System) and a processing part (called MapReduce). Hadoop splits large files into large blocks and distributes them across the nodes of a cluster.

    Hadoop was initially developed to solve web-scale problems like webpage search and indexing at Yahoo. However, the potential of the platform to handle big data and analytics caught the attention of a number of industries. Since the initial used of Hadoop was to count webpages and implement algorithms like page-rank, security was never considered a major requirement, until it started getting used by enterprises across the world.

    Security incidents and massive fines have become commonplace and financial institutions, in particular, are doing everything to avoid such incidents. Security should never be an afterthought and should be considered in the initial design of the system. The five core pillars of Enterprise Security are as follows:

    sas-integration-with-hadoop05

    Our customer had the four core pillars covered from Administration to Auditing using tools provided by their Hadoop vendor. While there are options in the open-source community that provide data protection, in this case the organization decided to use a data security company to protect data at rest on top of Cloudera Navigator Encryption. They refer to it as “Double Encryption.”

    The challenge

    SAS has multiple products around the Hadoop ecosystem to provide the best support for customers. The traditional way of working with Hadoop involves SAS/ACCESS which can involve pulling the data from Hadoop using Hive. However for larger installations, where data movement is a concern, SAS provides Embedded Process technology, which allows you to push SAS code inside of a Hadoop cluster and run it alongside the data blocks. This is a super-efficient way to access large data sets inside of Hadoop by pushing the compute to the data.

    Our customer's data security vendor’s product supports access via Hive UDF’s which means you can tokenize/detokenize when working with SAS/ACCESS Interface to Hadoop using PROC SQL and other options, relatively out of the box. In addition, the SAS language (BASE SAS) can be added using the security company’s API (and PROC FMCP and PROC PROTO) to add additional new SAS language functions for the de/tokenisation of data inside BASE SAS already.

    However, SAS Embedded Process has no default support for our customer's security vendor and SAS products which utilize SAS EP include SAS Code Accelerator, SAS Scoring Accelerator and LASR-based products and cannot work with data tokenized by the vendor. This was a major challenge for our customer who wanted to use SAS products like SAS Visual Analytics and SAS Visual Statistics on large volumes of data stored on Hadoop.

    The challenge hence was to make SAS Embedded Process work with their data security vendor’s software to perform detokenization before passing the data to SAS procedures.

    The possible solutions

    We considered various solutions before agreeing on a solution that satisfies all current requirements and could be extended to meet the future needs of our customer. Let’s discuss the top two solutions and the final implementation.

    Solution 1: SERDE approach

    Our first approach was to create a custom Hive SERDE that wraps the data security company’s APIs. With 9.4M3 the SAS Embedded Process (EP) can read & write via SERDE APIs with some possible constraints and limitations including DS2’s SET/MERGE capabilities and potential identity credentials being passed on from SAS to the company’s APIs.

    sas-integration-with-hadoop02

    The approach had various drawbacks but the top drawback was in working with various file formats. This approach was discarded because it would have meant lots of rework with every new data format being released by the Hadoop community. While it is true that generally an organization would standardize a few formats to be used for its use cases, it is nonetheless a limiting factor.

    Solution 2: Use HDMD with Custom Input Formats

    The second approach was to use HDMD with custom input formats. SAS HDMD supports custom input formats which will allow you to plug in your custom input format. A high-level architectural diagram looks something like Figure 2.  This approach works with a variety of file formats, and we have tested it with Parquet, Avro and ORC with good results. The objective is to load a dataset onto Hadoop or use an existing data set and generate an HDMD file for the dataset. We plug in our custom reader in the HDMD file and as a part of the custom reader we make a number of API calls to the data security company’s API. The API will call on the specific protect and unprotect procedures of the security vendor to protect and/or unprotect the data depending on the requirements and pass the results back to the client.

    sas-integration-with-hadoop03

    What is an Input/Custom input format BTW?

    Data inside Hadoop is typically stored on HDFS (Hadoop Distributed File System). The data needs to be read from the filesystem before being processed. This is achieved using Input Format, which has the following responsibilities.

    • Compute input splits
      • Input splits represent the part of the data that will be processed by each Map phase. A unique input split is passed to the process. At the start of a Map Reduce job, input format will split the data into multiple parts based on logical record boundaries and HDFS block size. To get the input splits, the following method is called:
        • List getSplits(JobContext ctx)
    • Provide a logic to read the input split
      • Each mapper gets a unique input split to process the data. Input format provides a logic to read the split, which is an implementation of the RecordReader interface. The record reader will read the split and emit <key,value> pairs as an input for each map function. The record reader is created using the following method:
        • RecordReader<K,V> createRecordReader(InputSplit is, TaskAttemptContext ctx)

    All the common formats will provide a way to split the data and read records. However, if you want to read a custom data set for which data parsing isn’t available out of the box with Hadoop, you are better off writing a custom input format.

    How to write a Custom Input Format?

    Writing a custom input format needs Java skills (the programming language in which Hadoop has been written). You have the option to implement Abstract methods of InputFormat class, or extend one of the pre-existing input formats. In our case, we had extended FileInputFormat, and overrode few critical methods like

    • getSplits()
    • getRecordReader()

    The getSplits() will create the splits from the input data, while the getRecordReader() should return an instance of a Java object, which has the ability to read custom records, which in our case was the security vendor’s API.

    You can use one of the predefined Record Reader classes or implement your own (most likely if you are writing a custom input format). In our case, we implemented RecordReader interface, and implemented the next() method which is called whenever a new record is found. This is the method where your core business logic is implemented. In our case, we had to write the integration logic by looking at the data, understanding the user who has logged in (available as a part of JobConf object), and then calling the vendor’s APIs to decrypt the data. Sample codes can be requested by contacting me directly.

    Integrating a custom input format with SAS

    Integrating a custom input format is fairly easy with SAS. SAS allows us to plug in custom formats, which are called before the data is processed via SAS Embedded Process using HDMD files.

    When you generate an HDMD file using PROC HDMD, you can specify your custom input format as a part of the generated XML file. Please refer to PROC HDMD documentation.

    The generated HDMD file would look something like this.

    sas-integration-with-hadoop04

    When loading the data from HDFS, SAS will ensure that the specified input format is called prior to any data processing taking place.

    The ultimate solution

    The solution was demonstrated using data from the tax authorities and included tokenization of data via hive UDFS, detokenization of data according to the policies set on the data security appliance, and performing analytics using SAS Visual Analytics. Only users with permissions on the specific policy were able to view the data, while users with no permissions had access to decrypted data. This additional security helped the enterprise protect users’ information from inadvertent access and resulted in widespread use of Big Data technologies within the Enterprise.

    Summary

    As you can see from the example above, SAS is open for business, and is already providing deep integration with Hadoop and other technologies using custom APIs. The sky is the limit for people willing to explore the capabilities of SAS.

    tags: Global Technology Practice, Hadoop, SAS/ACCESS Interface to Hadoop

    SAS integration with Hadoop - one success story was published on SAS Users.