Global Technology Practice

172017
 

Editor’s note: This is the second in a series of articles to help current SAS programmers add SAS Viya to their analytics skillset. In this post, Advisory Solutions Architect Steven Sober explores how to accomplish distributed data management using SAS Viya. Read additional posts in the series.

This article in the SAS Viya series will explore how to accomplish distributed data management using SAS Viya. In my next article, we will discuss how SAS programmers can collaborate with their open source colleagues to leverage SAS Viya for distributed data management.

Distributed Data Management

SAS Viya provides a robust, scalable, cloud ready distributed data management platform. This platform provides multiple techniques for data management that run distributive, i.e. using all cores on all compute nodes defined to the SAS Viya platform. The four techniques we will explore here are DATA Step, PROC DS2, PROC FEDSQL and PROC TRANSPOSE. With these four techniques SAS programmers and open source programmers can quickly apply complex business rules that stage data for downstream consumption, i.e., Analytics, visualizations, and reporting.

The rule for getting your code to run distributed is to ensure all source and target tables reside in the In-Memory component of SAS Viya i.e., Cloud Analytic Services (CAS).

Starting CAS

The following statement is an example of starting a new CAS session. In the coding examples that follow we will reference this session using the key word MYSESS. Also note, this CAS session is using one of the default CAS library, CASUSER.

Binding a LIBNAME to a CAS session

Now that we have started a CAS session we can bind a LIBNAME to that session using the following syntax:

Note: CASUSER is one of the default CAS libraries created when you start a CAS session. In the following coding examples we will utilize CASUSER for our source and target tables that reside in CAS.

To list all default and end-user CAS libraries, use the following statement:

Click here for more information on CAS libraries.

THREAD program

  • The PROC DS2 DATA program must declare a THREAD program
  • The source and target tables must reside in CAS
  • Unlike DATA Step, with PROC DS2 you use the SESSREF= parameter to identify which CAS environment the source and target tables reside in
  • SESSREF=

     For PROC TRANSPOSE to run in CAS the rules are:

    1. All source and target tables must reside in CAS
      a.   Like DATA Step you use a two-level name to reference these tables

    Collaborative distributed data management using SAS Viya was published on SAS Users.

    112017
     

    Ensemble models have been used extensively in credit scoring applications and other areas because they are considered to be more stable and, more importantly, predict better than single classifiers (see Lessmann et al., 2015). They are also known to reduce model bias and variance (Myoung - Jong et al., 2006; Tsai C-F et. al., 2011). The objective of this article is to compare the predictive accuracy of four distinct datasets using two ensemble classifiers (Gradient boosting(GB)/Random Forest(RF)) and two single classifiers (Logistic regression(LR)/Neural Network(NN)) to determine if, in fact, ensemble models are always better. My analysis did not look into optimizing any of these algorithms or feature engineering, which are the building blocks of arriving at a good predictive model. I also decided to base my analysis on these four algorithms because they are the most widely used methods.

    What is the difference between a single and an ensemble classifier?

    Single classifier

    Individual classifiers pursue different objectives to develop a (single) classification model. Statistical methods either estimate (+|) directly (e.g., logistic regression), or estimate class-conditional probabilities (|), which they then convert into posterior probabilities using Bayes rule (e.g., discriminant analysis). Semi-parametric methods, such as NN or SVM, operate in a similar manner, but support different functional forms and require the modeller to select one specification a priori. The parameters of the resulting model are estimated using nonlinear optimization. Tree-based methods recursively partition a data set so as to separate good and bad loans through a sequence of tests (e.g., is loan amount > threshold). This produces a set of rules that facilitate assessing new loan applications. The specific covariates and threshold values to branch a node follow from minimizing indicators of node impurity such as the Gini coefficient or information gain (Baesens, et al., 2003).

    Ensemble classifier

    Ensemble classifiers pool the predictions of multiple base models. Much empirical and theoretical evidence has shown that model combination increases predictive accuracy (Finlay, 2011; Paleologo, et al., 2010). Ensemble learners create the base models in an independent or dependent manner. For example, the bagging algorithm derives independent base models from bootstrap samples of the original data (Breiman, 1996). Boosting algorithms, on the other hand, grow an ensemble in a dependent fashion. They iteratively add base models that are trained to avoid the errors of the current ensemble (Freund & Schapire, 1996). Several extensions of bagging and boosting have been proposed in the literature (Breiman, 2001; Friedman, 2002; Rodriguez, et al., 2006). The common denominator of homogeneous ensembles is that they develop the base models using the same classification algorithm (Lessmann et al., 2015).

    ensemble modifers

    Figure 1: Workflow of single v. ensemble classifiers: derived from the work of Utami, et al., 2014

    Experiment set-up

    Datasets

    Before modelling, I partitioned the dataset into 70% training and 30% validation dataset.

    Table 1: Summary of dataset used for model comparisons

    I used SAS Enterprise Miner as a modelling tool.

    Figure 2: Model flow using Enterprise Miner

    Results

    Table 2: Results showing misclassification rates of all dataset

    Conclusion

    Using misclassification rate as model performance, RF was the best model using Cardata, Organics_Data and HMEQ followed closely by NN. NN was the best model using Time_series_data and performed better than GB ensemble model using Organics_Data and Cardata.

    My findings partly supports the hypothesis that ensemble models naturally do better in comparison to single classifiers, but not in all cases. NN, which is a single classifier, can be very powerful unlike most classifiers (single or ensemble) which are kernel machines and data-driven. NN can generalize from unseen data and act as universal functional approximators (Zhang, et al., 1998).

    According to Kaggle CEO and Founder, Anthony Goldbloom:

    “In the history of Kaggle competitions, there are only two Machine Learning approaches that win competitions: Handcrafted & Neural Networks”.

    What are your thoughts?

    Are ensemble classifiers always better than single classifiers? was published on SAS Users.

    262017
     

    SAS® Viya™ 3.1 represents the third generation of high performance computing from SAS. Our journey started a long time ago and, along the way, we have introduced a number of high performance technologies into the SAS software platform:

    Introducing Cloud Analytic Services (CAS)

    SAS Viya introduces Cloud Analytic Services (CAS) and continues this story of high performance computing.  CAS is the runtime engine and microservices environment for data management and analytics in SAS Viya and introduces some new and interesting innovations for customers. CAS is an in-memory technology and is designed for scale and speed. Whilst it can be set up on a single machine, it is more commonly deployed across a number of nodes in a cluster of computers for massively parallel processing (MPP). The parallelism is further increased when we consider using all the cores within each node of the cluster for multi-threaded, analytic workload execution. In a MPP environment, just because there are a number of nodes, it doesn’t mean that using all of them is always the most efficient for analytic processing. CAS maintains node-to-node communication in the cluster and uses an internal algorithm to determine the optimal distribution and number of nodes to run a given process.

    However, processing in-memory can be expensive, so what happens if your data doesn’t fit into memory? Well CAS, has that covered. CAS will automatically spill data to disk in such a way that only the data that are required for processing are loaded into the memory of the system. The rest of the data are memory-mapped to the filesystem in an efficient way for loading into memory when required. This way of working means that CAS can handle data that are larger than the available memory that has been assigned.

    The CAS in-memory engine is made up of a number of components - namely the CAS controller and, in an MPP distributed environment, CAS worker nodes. Depending on your deployment architecture and data sources, data can be read into CAS either in serial or parallel.

    What about resilience to data loss if a node in an MPP cluster becomes unavailable? Well CAS has that covered too. CAS maintains a replicate of the data within the environment. The number of replicates can be configured but the default is to maintain one extra copy of the data within the environment. This is done efficiently by having the replicate data blocks cached to disk as opposed to consuming resident memory.

    One of the most interesting developments with the introduction of CAS is the way that an end user can interact with SAS Viya. CAS actions are a new programming construct and with CAS, if you are a Python, Java, SAS or Lua developer you can communicate with CAS using an interactive computing environment such as a Jupyter Notebook. One of the benefits of this is that a Python developer, for example, can utilize SAS analytics on a high performance, in-memory distributed architecture, all from their Python programming interface. In addition, we have introduced open REST APIs which means you can call native CAS actions and submit code to the CAS server directly from a Web application or other programs written in any language that supports REST.

    Whilst CAS represents the most recent step in our high performance journey, SAS Viya does not replace SAS 9. These two platforms can co-exist, even on the same hardware, and indeed can communicate with one another to leverage the full range of technology and innovations from SAS. To find out more about CAS, take a look at the early preview trial. Or, if you would like to explore the capabilities of SAS Viya with respect to your current environment and business objectives speak to your local SAS representative about arranging a ‘Path to SAS Viya workshop’ with SAS.

    Many thanks to Fiona McNeill, Mark Schneider and Larry LaRusso for their input and review of this article.

     

    tags: global te, Global Technology Practice, high-performance analytics, SAS Grid Manager, SAS Visual Analytics, SAS Visual Statistics, SAS Viya

    A journey of SAS high performance was published on SAS Users.

    十二 202016
     

    Joining tables with PROC FORMAT

    The title of this post borrows from Stanley Kubrick’s 1964 comedy “Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb.” It stars the great Peter Sellers as the titular character as well as George C. Scott and Slim Pickens. The black and white film is strange and brilliant just like Kubrick was. Similarly, as I was experiencing the issue I outline below and was told of this solution, I thought two things. The first was “what a strange but brilliant solution” and the second one was “I’ll try anything as long as it works.”   Thus, a title was born. But enough about that. Why are we here?

    Problem

    You want to add a couple of columns of information to your already large dataset, but each time you try to join the tables you run out of memory!  For example, you want to append latitude and longitude values from Table B to an existing list of customer phone numbers in Table A.

    You’ve tried this and got nowhere fast:

    proc sort data = demo.tablea;
    by npa nxx;
    proc sort data = demo.tableb;
    by npa nxx;
    run;
     
    data demo.aunionb;
    merge demo.tablea (in=a) demo.tableb (in=b);
    by npa nxx;
    if a;
    run;

    And then you tried this and also got nowhere (albeit a little slower):

    proc sql;
       	create table demo.aunionb as 
       	select *,
    	from demo.tablea a
    left join demo.tableb b on (a.npa = b.npa) and (a.nxx = b.nxx);
    quit;

    Solution - Joining tables with PROC FORMAT

    Use PROC FORMAT!

    Here’s how:

    First, take Table B and create character equivalents of the fields required in your join (assuming they aren’t characters already). In this example, NPA and NXX are the two fields that you are joining on. They will be your key once you concatenate them.  Next, create character equivalents of the fields that you want appended.

    data work.tableb (keep = npa_nxx--nxx_c); 
    set demo.tableb; 
     
    npa_c = compress(put(npa, best10.));
    nxx_c = compress(put(nxx, best10.));
     
    npa_nxx = catx('_',npa_c, nxx_c);
     
    lat_c = compress(put(latitude, best14.3)); 
    long_c = compress(put(longitude, best14.3)); 
    run;

    Next, make sure that you have only unique values of your key. Use PROC SORT with OPT=noduprecs turned on.

    Now, create a table that will be used as the input into PROC FORMAT. In this example, you are creating a table that will contain the formats for the latitude column.

    proc sort data = work.tableb noduprecs;
    by npa_nxx;
     
    data demo.tableb_lat_fmt(keep=fmtname type start label); 
    retain fmtname 'lat_f' type 'C'; 
    set work.tableb; 
     
    if npa_nxx = '._.' then start = 'Other  ';
    else start = npa_nxx; 
    label = lat_c; 
    run;
    proc sort data = demo.tableb_fmt;
    by start;
    run;

    This step creates a table that includes the format name (lat_f), the format type (C), the key field (start) and its corresponding latitude value (label).  Sort this table by the ‘start’ column and then repeat this step for every column you wish to append, with each column getting its own unique format and table.

    Now run PROC FORMAT using the CNTLIN option pointing to the tables that you just created in order to create your format.

    proc format cntlin=demo.tableb_lat_fmt; 
    run; 
    proc format cntlin=demo.tableb_long_fmt; 
    run;

    Now all you have to do is run your data step to create the resultant dataset with the appended values.

    data demo.aunionb (drop = npa_nxx); 
    set demo.tablea; 
     
    npa_nxx = catx('_',compress(put(npa,best10.)),compress(put(nxx, best10.)));
     
    latitude = input(put(npa_nxx, $lat_f.), BEST.); 
    longitude = input(put(npa_nxx, $long_f.), BEST.);
     
    run;

    This step creates 3 columns: npa_nxx, latitude, and longitude. Npa_nxx is the key built from the NPA  and NXX values. Latitude and longitude are then populated with the formatted value of npa_nxx, which in this case is the character equivalent of the original latitude or longitude. It also formats the value back into a numeric field.

    The result is a clever way to add columns to a dataset, much like a VLOOKUP function works in Microsoft Excel, without the hassle of running out of memory space.

    Notes:

    1. The author realizes there are other, more boring ways of tackling this issue like indexing and using WHERE statements, but where’s the fun in that?
    2. This solution may not be right for you. See your doctor if you experience any of the following symptoms:  memory loss, headache, frustration, Cartesian rage, cranial-keyboard embedment or memory loss.
    tags: Global Technology Practice, Joining tables, PROC FORMAT, SAS Programmers, tips & techniques

    Dr. Strangeformat or: How I Learned to Stop Joining Tables and Love the PROC was published on SAS Users.

    十二 102016
     

    During a recent customer visit, I was asked how to include a calculated variable within SAS Asset Performance Analytics’ (APA) Root Cause Analysis workflow. This is a simple request. Is there a simple approach to do this?

    To remind you, in the APA workflow, an ETL Administrator makes a Data Mart available in the solution for the APA users. They can select variables and explore, analyze and create a model based on the columns present in the Data Mart.

    But if you want to analyze a calculated column, like a difference between two variables, do you need to change the Data Mart? Yes, if you want the APA Data Selection to include this calculated column. But this takes time, and do you really need to change the Data Mart? No!

    A simpler and faster approach to adding a calculated column is modifying the APA Root Cause Analysis workflow. And, this is simple!

    SAS Asset Performance Analytics is easily and highly configurable. You can easily customize analytical workflows by modifying its underlying stored process. Let me show you how to customize an existing analysis and add a calculation step to enhance APA’s Root Cause Analysis with a calculated column.

    Benefits

    The main purpose of this customized analysis is to avoid the SAS Enterprise Guide utilization. The users are rarely SAS experts. Thereby, asking users to switch between the tools depending on the functionalities availability on the APA GUI isn’t recommended. The more you can do within the APA interface via wizard-guided workflows, the easier it will be.

    The second benefit is the keeping the Data Mart limited to only crucial variables. Instead of asking an ETL Administrator to add non-validated and/or infrequently used calculated columns to the Data Mart, allow the APA user to test and create meaningful tags to enhance workflows as needed. Once the APA user identifies new and meaningful calculated variables, they can easily be added to the Data Mart and available to APA Explorations and APA Stability Monitoring. Limiting only critical variables within the Data Mart will ensure the data size is optimizing and only adjusted as needed.

    Root Cause Analysis Use

    The use of this new Root Cause Analysis is very easy, instead of selecting “Root Cause Analysis,” select “Root Cause Analysis Calculated Columns” when required.

    sas-asset-performance-analytics-root-cause-analysis01

    Figure 1: Stored Process Folder allows us to choose the appropriate Analysis

    Clicking on “Ok” triggers the same steps as the original RCA.

    sas-asset-performance-analytics-root-cause-analysis02

    Figure 2: steps triggered by the Root Cause Analysis

    After specifying a data selection, the “Filter Data” step contains six new user prompts:

    Figure 3: New “Filter Data” interface

    Figure 3: New “Filter Data” interface

    These prompts allow the user to define what calculation is needed. Six calculations choices are currently available. These choices can be further customized as needed. Calculations types currently available include:

    1. Difference between two variables if you want to study a gap between two tags
    2. Absolute difference of two variables if large gaps are suspicious regardless of the order
    3. Ratio of two variables if you want to control the proportionality
    4. Multiplication of two variables
    5. Summation of two variables
    6. None

    By default, the Calculation option is set to “None,” to perform a classical Root Cause Analysis.

    Figure 4: Dropdown list with calculation details

    Figure 4: Dropdown list with calculation details

    After choosing your variables, you can apply variable coefficients you want to test. By default, APA performs an absolute difference Variable 1 – Variable 2, with the coefficients set to 1. If the variables don’t have the same order of magnitude, you can apply a coefficient to put the variables at the same level. By doing this the newly created calculated variable fluctuates around 0, which is easily interpretable.

    In the example below, the goal is to realize a simple absolute difference between the Variable 2 (V.Four_Press_Clap) and the Variable 1 (V.Four_Press_C). By default the newly created column name is DIFF_TAGS. You can modify it to something more descriptive for your purposes. Don’t forget that the new column name must follow standard SAS column naming conventions.

    Before processing the step, be careful to check the data is not missing during the data period you’ve selected. If it’s missing during the full time period, the DIFF_TAGS variable will not be created and the output will be the same than if you’d selected “None” for the calculation prompt.

    Figure 5: DIFF_TAGS variable creation corresponding to the result of Four_Press_Clap – Four_Press_C

    Figure 5: DIFF_TAGS variable creation corresponding to the result of Four_Press_Clap – Four_Press_C

    Click “Run and Save Step.” As a result, the newly created calculated column is added to your input RCA data and available in the “Expand and Build Statistics” step. Now you can apply minimum and maximum value bands to the calculated column similar to the original RCA workflow.

    Figure 6: minimum and maximum values to consider as atypical

    Figure 6: minimum and maximum values to consider as atypical

    As a result, the calculated column is used like a standard input tag during the full Root Cause Analysis workflow.

    Things to keep in mind

    If you have a specific need, SAS Asset Performance Analytics is easily customizable. In France, for example, we propose an unlimited and modular APA interface.

    Creating columns within the APA user interface analysis workflow has two main benefits:

    1. The calculation is intuitive for a non SAS user
    2. You don’t need to change the data mart

    Only two steps are required to implement Root Cause Analysis including Calculated Columns within APA:

    1. Copy and paste the pam_rca_filterData_NEW.sas file in the SASHome/SASFoundation/ 9.4/pamsrmv/sasstp directory.
    2. Using the SAS Management Console, import the RCA_CC.spk file at the following location: /Products/SAS Asset Performance Analytics/Analytical Workbench in the Folders tab.

    If you have any questions about this process, feel free to add a comment below.

    tags: Global Technology Practice, SAS Asset Performance Analytics

    Enhancing SAS Asset Performance Analytics’ Root Cause Analysis with Calculated Columns was published on SAS Users.

    十二 022016
     

    Nearly every organization has to deal with big data, and that often means dealing with big data problems. For some organizations, especially government agencies, addressing these problems provides more than a competitive advantage, it helps them ensure public confidence in their work or meet standards mandated by law. In this blog I wanted to share with you how SAS worked with a government revenue collection agency to successfully manage their big data issues and seamlessly integrate with Hadoop and other technologies.

    Hadoop Security

    We all know Hadoop pretty well, and if you haven’t heard of Hadoop yet, it is about time you invest some resources to learn more about this upcoming defacto standard for storage and compute. The core of Apache Hadoop consist of a storage part known as HDFS (Hadoop Distributed File System) and a processing part (called MapReduce). Hadoop splits large files into large blocks and distributes them across the nodes of a cluster.

    Hadoop was initially developed to solve web-scale problems like webpage search and indexing at Yahoo. However, the potential of the platform to handle big data and analytics caught the attention of a number of industries. Since the initial used of Hadoop was to count webpages and implement algorithms like page-rank, security was never considered a major requirement, until it started getting used by enterprises across the world.

    Security incidents and massive fines have become commonplace and financial institutions, in particular, are doing everything to avoid such incidents. Security should never be an afterthought and should be considered in the initial design of the system. The five core pillars of Enterprise Security are as follows:

    sas-integration-with-hadoop05

    Our customer had the four core pillars covered from Administration to Auditing using tools provided by their Hadoop vendor. While there are options in the open-source community that provide data protection, in this case the organization decided to use a data security company to protect data at rest on top of Cloudera Navigator Encryption. They refer to it as “Double Encryption.”

    The challenge

    SAS has multiple products around the Hadoop ecosystem to provide the best support for customers. The traditional way of working with Hadoop involves SAS/ACCESS which can involve pulling the data from Hadoop using Hive. However for larger installations, where data movement is a concern, SAS provides Embedded Process technology, which allows you to push SAS code inside of a Hadoop cluster and run it alongside the data blocks. This is a super-efficient way to access large data sets inside of Hadoop by pushing the compute to the data.

    Our customer's data security vendor’s product supports access via Hive UDF’s which means you can tokenize/detokenize when working with SAS/ACCESS Interface to Hadoop using PROC SQL and other options, relatively out of the box. In addition, the SAS language (BASE SAS) can be added using the security company’s API (and PROC FMCP and PROC PROTO) to add additional new SAS language functions for the de/tokenisation of data inside BASE SAS already.

    However, SAS Embedded Process has no default support for our customer's security vendor and SAS products which utilize SAS EP include SAS Code Accelerator, SAS Scoring Accelerator and LASR-based products and cannot work with data tokenized by the vendor. This was a major challenge for our customer who wanted to use SAS products like SAS Visual Analytics and SAS Visual Statistics on large volumes of data stored on Hadoop.

    The challenge hence was to make SAS Embedded Process work with their data security vendor’s software to perform detokenization before passing the data to SAS procedures.

    The possible solutions

    We considered various solutions before agreeing on a solution that satisfies all current requirements and could be extended to meet the future needs of our customer. Let’s discuss the top two solutions and the final implementation.

    Solution 1: SERDE approach

    Our first approach was to create a custom Hive SERDE that wraps the data security company’s APIs. With 9.4M3 the SAS Embedded Process (EP) can read & write via SERDE APIs with some possible constraints and limitations including DS2’s SET/MERGE capabilities and potential identity credentials being passed on from SAS to the company’s APIs.

    sas-integration-with-hadoop02

    The approach had various drawbacks but the top drawback was in working with various file formats. This approach was discarded because it would have meant lots of rework with every new data format being released by the Hadoop community. While it is true that generally an organization would standardize a few formats to be used for its use cases, it is nonetheless a limiting factor.

    Solution 2: Use HDMD with Custom Input Formats

    The second approach was to use HDMD with custom input formats. SAS HDMD supports custom input formats which will allow you to plug in your custom input format. A high-level architectural diagram looks something like Figure 2.  This approach works with a variety of file formats, and we have tested it with Parquet, Avro and ORC with good results. The objective is to load a dataset onto Hadoop or use an existing data set and generate an HDMD file for the dataset. We plug in our custom reader in the HDMD file and as a part of the custom reader we make a number of API calls to the data security company’s API. The API will call on the specific protect and unprotect procedures of the security vendor to protect and/or unprotect the data depending on the requirements and pass the results back to the client.

    sas-integration-with-hadoop03

    What is an Input/Custom input format BTW?

    Data inside Hadoop is typically stored on HDFS (Hadoop Distributed File System). The data needs to be read from the filesystem before being processed. This is achieved using Input Format, which has the following responsibilities.

    • Compute input splits
      • Input splits represent the part of the data that will be processed by each Map phase. A unique input split is passed to the process. At the start of a Map Reduce job, input format will split the data into multiple parts based on logical record boundaries and HDFS block size. To get the input splits, the following method is called:
        • List getSplits(JobContext ctx)
    • Provide a logic to read the input split
      • Each mapper gets a unique input split to process the data. Input format provides a logic to read the split, which is an implementation of the RecordReader interface. The record reader will read the split and emit <key,value> pairs as an input for each map function. The record reader is created using the following method:
        • RecordReader<K,V> createRecordReader(InputSplit is, TaskAttemptContext ctx)

    All the common formats will provide a way to split the data and read records. However, if you want to read a custom data set for which data parsing isn’t available out of the box with Hadoop, you are better off writing a custom input format.

    How to write a Custom Input Format?

    Writing a custom input format needs Java skills (the programming language in which Hadoop has been written). You have the option to implement Abstract methods of InputFormat class, or extend one of the pre-existing input formats. In our case, we had extended FileInputFormat, and overrode few critical methods like

    • getSplits()
    • getRecordReader()

    The getSplits() will create the splits from the input data, while the getRecordReader() should return an instance of a Java object, which has the ability to read custom records, which in our case was the security vendor’s API.

    You can use one of the predefined Record Reader classes or implement your own (most likely if you are writing a custom input format). In our case, we implemented RecordReader interface, and implemented the next() method which is called whenever a new record is found. This is the method where your core business logic is implemented. In our case, we had to write the integration logic by looking at the data, understanding the user who has logged in (available as a part of JobConf object), and then calling the vendor’s APIs to decrypt the data. Sample codes can be requested by contacting me directly.

    Integrating a custom input format with SAS

    Integrating a custom input format is fairly easy with SAS. SAS allows us to plug in custom formats, which are called before the data is processed via SAS Embedded Process using HDMD files.

    When you generate an HDMD file using PROC HDMD, you can specify your custom input format as a part of the generated XML file. Please refer to PROC HDMD documentation.

    The generated HDMD file would look something like this.

    sas-integration-with-hadoop04

    When loading the data from HDFS, SAS will ensure that the specified input format is called prior to any data processing taking place.

    The ultimate solution

    The solution was demonstrated using data from the tax authorities and included tokenization of data via hive UDFS, detokenization of data according to the policies set on the data security appliance, and performing analytics using SAS Visual Analytics. Only users with permissions on the specific policy were able to view the data, while users with no permissions had access to decrypted data. This additional security helped the enterprise protect users’ information from inadvertent access and resulted in widespread use of Big Data technologies within the Enterprise.

    Summary

    As you can see from the example above, SAS is open for business, and is already providing deep integration with Hadoop and other technologies using custom APIs. The sky is the limit for people willing to explore the capabilities of SAS.

    tags: Global Technology Practice, Hadoop, SAS/ACCESS Interface to Hadoop

    SAS integration with Hadoop - one success story was published on SAS Users.

    十二 022016
     

    Nearly every organization has to deal with big data, and that often means dealing with big data problems. For some organizations, especially government agencies, addressing these problems provides more than a competitive advantage, it helps them ensure public confidence in their work or meet standards mandated by law. In this blog I wanted to share with you how SAS worked with a government revenue collection agency to successfully manage their big data issues and seamlessly integrate with Hadoop and other technologies.

    Hadoop Security

    We all know Hadoop pretty well, and if you haven’t heard of Hadoop yet, it is about time you invest some resources to learn more about this upcoming defacto standard for storage and compute. The core of Apache Hadoop consist of a storage part known as HDFS (Hadoop Distributed File System) and a processing part (called MapReduce). Hadoop splits large files into large blocks and distributes them across the nodes of a cluster.

    Hadoop was initially developed to solve web-scale problems like webpage search and indexing at Yahoo. However, the potential of the platform to handle big data and analytics caught the attention of a number of industries. Since the initial used of Hadoop was to count webpages and implement algorithms like page-rank, security was never considered a major requirement, until it started getting used by enterprises across the world.

    Security incidents and massive fines have become commonplace and financial institutions, in particular, are doing everything to avoid such incidents. Security should never be an afterthought and should be considered in the initial design of the system. The five core pillars of Enterprise Security are as follows:

    sas-integration-with-hadoop05

    Our customer had the four core pillars covered from Administration to Auditing using tools provided by their Hadoop vendor. While there are options in the open-source community that provide data protection, in this case the organization decided to use a data security company to protect data at rest on top of Cloudera Navigator Encryption. They refer to it as “Double Encryption.”

    The challenge

    SAS has multiple products around the Hadoop ecosystem to provide the best support for customers. The traditional way of working with Hadoop involves SAS/ACCESS which can involve pulling the data from Hadoop using Hive. However for larger installations, where data movement is a concern, SAS provides Embedded Process technology, which allows you to push SAS code inside of a Hadoop cluster and run it alongside the data blocks. This is a super-efficient way to access large data sets inside of Hadoop by pushing the compute to the data.

    Our customer's data security vendor’s product supports access via Hive UDF’s which means you can tokenize/detokenize when working with SAS/ACCESS Interface to Hadoop using PROC SQL and other options, relatively out of the box. In addition, the SAS language (BASE SAS) can be added using the security company’s API (and PROC FMCP and PROC PROTO) to add additional new SAS language functions for the de/tokenisation of data inside BASE SAS already.

    However, SAS Embedded Process has no default support for our customer's security vendor and SAS products which utilize SAS EP include SAS Code Accelerator, SAS Scoring Accelerator and LASR-based products and cannot work with data tokenized by the vendor. This was a major challenge for our customer who wanted to use SAS products like SAS Visual Analytics and SAS Visual Statistics on large volumes of data stored on Hadoop.

    The challenge hence was to make SAS Embedded Process work with their data security vendor’s software to perform detokenization before passing the data to SAS procedures.

    The possible solutions

    We considered various solutions before agreeing on a solution that satisfies all current requirements and could be extended to meet the future needs of our customer. Let’s discuss the top two solutions and the final implementation.

    Solution 1: SERDE approach

    Our first approach was to create a custom Hive SERDE that wraps the data security company’s APIs. With 9.4M3 the SAS Embedded Process (EP) can read & write via SERDE APIs with some possible constraints and limitations including DS2’s SET/MERGE capabilities and potential identity credentials being passed on from SAS to the company’s APIs.

    sas-integration-with-hadoop02

    The approach had various drawbacks but the top drawback was in working with various file formats. This approach was discarded because it would have meant lots of rework with every new data format being released by the Hadoop community. While it is true that generally an organization would standardize a few formats to be used for its use cases, it is nonetheless a limiting factor.

    Solution 2: Use HDMD with Custom Input Formats

    The second approach was to use HDMD with custom input formats. SAS HDMD supports custom input formats which will allow you to plug in your custom input format. A high-level architectural diagram looks something like Figure 2.  This approach works with a variety of file formats, and we have tested it with Parquet, Avro and ORC with good results. The objective is to load a dataset onto Hadoop or use an existing data set and generate an HDMD file for the dataset. We plug in our custom reader in the HDMD file and as a part of the custom reader we make a number of API calls to the data security company’s API. The API will call on the specific protect and unprotect procedures of the security vendor to protect and/or unprotect the data depending on the requirements and pass the results back to the client.

    sas-integration-with-hadoop03

    What is an Input/Custom input format BTW?

    Data inside Hadoop is typically stored on HDFS (Hadoop Distributed File System). The data needs to be read from the filesystem before being processed. This is achieved using Input Format, which has the following responsibilities.

    • Compute input splits
      • Input splits represent the part of the data that will be processed by each Map phase. A unique input split is passed to the process. At the start of a Map Reduce job, input format will split the data into multiple parts based on logical record boundaries and HDFS block size. To get the input splits, the following method is called:
        • List getSplits(JobContext ctx)
    • Provide a logic to read the input split
      • Each mapper gets a unique input split to process the data. Input format provides a logic to read the split, which is an implementation of the RecordReader interface. The record reader will read the split and emit <key,value> pairs as an input for each map function. The record reader is created using the following method:
        • RecordReader<K,V> createRecordReader(InputSplit is, TaskAttemptContext ctx)

    All the common formats will provide a way to split the data and read records. However, if you want to read a custom data set for which data parsing isn’t available out of the box with Hadoop, you are better off writing a custom input format.

    How to write a Custom Input Format?

    Writing a custom input format needs Java skills (the programming language in which Hadoop has been written). You have the option to implement Abstract methods of InputFormat class, or extend one of the pre-existing input formats. In our case, we had extended FileInputFormat, and overrode few critical methods like

    • getSplits()
    • getRecordReader()

    The getSplits() will create the splits from the input data, while the getRecordReader() should return an instance of a Java object, which has the ability to read custom records, which in our case was the security vendor’s API.

    You can use one of the predefined Record Reader classes or implement your own (most likely if you are writing a custom input format). In our case, we implemented RecordReader interface, and implemented the next() method which is called whenever a new record is found. This is the method where your core business logic is implemented. In our case, we had to write the integration logic by looking at the data, understanding the user who has logged in (available as a part of JobConf object), and then calling the vendor’s APIs to decrypt the data. Sample codes can be requested by contacting me directly.

    Integrating a custom input format with SAS

    Integrating a custom input format is fairly easy with SAS. SAS allows us to plug in custom formats, which are called before the data is processed via SAS Embedded Process using HDMD files.

    When you generate an HDMD file using PROC HDMD, you can specify your custom input format as a part of the generated XML file. Please refer to PROC HDMD documentation.

    The generated HDMD file would look something like this.

    sas-integration-with-hadoop04

    When loading the data from HDFS, SAS will ensure that the specified input format is called prior to any data processing taking place.

    The ultimate solution

    The solution was demonstrated using data from the tax authorities and included tokenization of data via hive UDFS, detokenization of data according to the policies set on the data security appliance, and performing analytics using SAS Visual Analytics. Only users with permissions on the specific policy were able to view the data, while users with no permissions had access to decrypted data. This additional security helped the enterprise protect users’ information from inadvertent access and resulted in widespread use of Big Data technologies within the Enterprise.

    Summary

    As you can see from the example above, SAS is open for business, and is already providing deep integration with Hadoop and other technologies using custom APIs. The sky is the limit for people willing to explore the capabilities of SAS.

    tags: Global Technology Practice, Hadoop, SAS/ACCESS Interface to Hadoop

    SAS integration with Hadoop - one success story was published on SAS Users.