SAS Viya

10月 222019
 

PROC SQL

PROC SQL is a very powerful ANSI 92 compliant version of SQL that also allows us to leverage many unique SAS capabilities. Recently I was asked if the PROC SQL in Figure 1 could be refactored into PROC FedSQL so it could run faster by leveraging SAS Viya’s in-memory engine CAS (SAS® Cloud Analytic Services). I was struggling to find a way to refactor this into PROC FedSQL, so I reached out to the SAS Jedi (aka Mark Jordan) for help.

/* Original SQL Statements */
proc sql; create table BenchMark as
     select count(*) as ItemCount
     , sum( abs( nhits - nruns ) < 0.1*natbat )   as DIFF_10
     from sashelp.baseball;
run;

Figure 1. Original PROC SQL

In Figure 2, we can review the SAS Log of our PROC SQL code.

  • It is line 77 that we want to refactor into PROC FedSQL so we can benefit performance improvements by running that code distributed in CAS.
  • On line 77, we use the alias DIFF_10 to create the new column name that is calculated by the two SAS functions SUM and ABS.
  • The expression on line 77 will cause SQL to return a value of 1 if the condition is true and a value of 0 if the condition is false.
  • The alias DIFF_10 will contain the summation of the value returned by the condition (i.e. 0 or 1) for all rows in our data set SASHELP.BASESBALL.

In Figure 5, we can review the results of our PROC SQL statement.

Figure 2. SAS Log of PROC SQL

PROC FedSQL

PROC FedSQL is ANSI 99 compliant without any of the unique SAS capabilities that PROC SQL contains, but PROC FedSQL is CAS enabled, which allows us to leverage SAS Viya’s distributed computing capabilities to improve run-times. Figure 3 is the refactored PROC FedSQL code that the SAS Jedi came up with.

/* PROC FedSQL code */
proc fedsql sessref=casauto; 
   create table BenchMark as
     select count(*) as ItemCount
     , sum(case 
           when (abs (nhits - nruns ) < (0.1*natbat)
                ) is true then 1 end 
          ) as DIFF_10
     from baseball;
quit;

Figure 3. CAS-enabled PROC FedSQL

Figure 4 contains the SAS Log of our CAS enabled PROC FedSQL.

  • Notice on lines 77 we added a CASE statement to the SUM function for our alias DIFF_10.
  • On lines 78-79, the WHEN statement return a value of 1 when the condition is true and a value of 0 when it is false.
  • The alias DIFF_10 will contain the summation, of the value returned by the CASE statement (i.e. 0 or 1) for all rows in our CAS table BASESBALL.

In Figure 5 we can review the results of our PROC FedSQL statement.

Figure 4. SAS log of PROC FedSQL code

Figure 5. Validation that the values from the refactoring of PROC SQL into PROC FedSQL match

Conclusion

As we adopt SAS Viya, a goal is to leverage CAS to speed up the processing of routines written in PROC SQL. To accomplish this, refactor PROC SQL code into PROC FedSQL code. For PROC SQL that cannot be refactored, simply run that PROC SQL code as-is in SAS Viya’s SAS Programming Run-time Environment (SPRE).

SAS® Viya®: How to Emulate PROC SQL Using CAS-Enabled PROC FedSQL was published on SAS Users.

10月 212019
 

Image by rawpixel from Pixabay

When my younger son grabs a book or a toy from his older siblings without permission, his line of defense is always the same: “Sharing is caring!” Our kids' schools teach and reinforce this philosophy. Likewise, our family has rules to ensure peaceful, orderly sharing.

Similarly, many organizations value collaboration. They encourage researchers, data owners, data scientists and business analysts to share work product and ideas and facilitate it among their teams. As with families, they often find it easier said than done.

A big part of my job is to meet customers and advise them on how SAS technology can help solve their business challenges. A recurring topic has been around SAS Viya, the analytics capabilities collectively known as the SAS® Platform. I emphasize how SAS Viya seamlessly enables collaboration across diverse users and teams.

SAS Viya collaboration use case with Commitments of Traders data

How does it work in real life? Here is an example to demonstrate how SAS coders and business analysts can easily collaborate on SAS Viya. I am using a publicly available data set known as Commitments of Traders (COT) that the U.S. Commodity Futures Trading Commission (CFTC) publishes on its website (https://www.cftc.gov/MarketReports/CommitmentsofTraders/index.htm.) Traders and researchers closely watch and analyze this data set for trends and price movements in the commodities market.

A SAS programmer readies the data

Figure 1: Drop-down menu

First, I need to bring the original COT file, saved in ‘.txt’ format, to my enterprise SAS environment. On SAS Viya, I have a choice of using a programmatic or graphical user interface (GUI) approach to import data and perform data wrangling/preparation. Both interfaces are easily accessible from a drop-down menu on SAS Drive, a web-based central hub for SAS Viya applications (see Figure 1). I choose the programmatic approach by clicking on Develop SAS Code from the drop-down menu.

SAS® Studio, the programming interface for SAS Viya, is a web-based development environment that includes code autocomplete, a library of frequently used code snippets, pre-built GUI wizards for numerous analytical routines, etc.

In SAS Studio, I prepare the data with five steps: 1) Import the raw text file; 2) Reduce the number of variables; 3) Compute the traders’ net position; 4) Import the mapping table; and 5) Add the commodity category variable from the mapping table. For all of these steps, I use SAS DATA Step and PROC statements that have been around for 40+ years – old school SAS.

The last two steps in my program, Step 6 and Step 7, are unique to SAS Viya (see Figure 2 below.) Those lines of code start the Cloud Analytics Server (CAS) session and load the final curated table into CAS, an in-memory distributed analytical engine that I consider the “heart and soul” of SAS Viya. Once the data is in CAS, an authorized user of my enterprise SAS Viya environment can easily locate data in CAS library/catalog for data exploration, modeling, or developing business intelligence content.

Figure 2: CAS

A business analyst creates and shares an interactive dashboard

Now it's the business analyst's turn. In this example, my teammate wants to build an interactive dashboard using curated COT data that I (the programmer) loaded into CAS. The analyst will access SAS Drive, select Explore and Visualize Data option from the menu, and be directed to SAS Visual Analytics, a web-based application that allows you to explore data and build point-and-click, interactive visualizations, no coding skills required.

Figure 3 and Figure 4 below show examples of the types of drillable dashboards and reports you can easily develop in SAS Visual Analytics with a few clicks. You can share the report internally via web link, view it in MS Office products, or publish it on the intranet or external facing website.

Figure 3: COT Dashboard

 

Figure 4: Monthly Trend

Collaboration bridges diverse skill sets, fosters successful projects

Studies show that success of any analytical project requires multi-disciplinary teams that include database administrators, data scientists, analysts, subject matter experts, management and IT support. SAS Viya helps them capitalize on their strengths to promote frictionless collaboration in a secure and controlled environment.

This post, focused on collaboration between SAS coders and business analysts, only scratches the surface of SAS Viya's collaboration and knowledge sharing capabilities. Likewise, open source coders (R and Python) and business analysts can collaborate on SAS Viya too.

Free trials below give programmers and business analysts a taste of what's possible with SAS Viya. Try one and tell us about it in the comments.

SAS® Visual Analytics on SAS® Viya® | Try it as a business analyst! SAS® Visual Analytics on SAS® Viya® | Try it as a programmer!

How SAS® Viya fosters collaboration was published on SAS Users.

10月 212019
 

Image by rawpixel from Pixabay

When my younger son grabs a book or a toy from his older siblings without permission, his line of defense is always the same: “Sharing is caring!” Our kids' schools teach and reinforce this philosophy. Likewise, our family has rules to ensure peaceful, orderly sharing.

Similarly, many organizations value collaboration. They encourage researchers, data owners, data scientists and business analysts to share work product and ideas and facilitate it among their teams. As with families, they often find it easier said than done.

A big part of my job is to meet customers and advise them on how SAS technology can help solve their business challenges. A recurring topic has been around SAS Viya, the analytics capabilities collectively known as the SAS® Platform. I emphasize how SAS Viya seamlessly enables collaboration across diverse users and teams.

SAS Viya collaboration use case with Commitments of Traders data

How does it work in real life? Here is an example to demonstrate how SAS coders and business analysts can easily collaborate on SAS Viya. I am using a publicly available data set known as Commitments of Traders (COT) that the U.S. Commodity Futures Trading Commission (CFTC) publishes on its website (https://www.cftc.gov/MarketReports/CommitmentsofTraders/index.htm.) Traders and researchers closely watch and analyze this data set for trends and price movements in the commodities market.

A SAS programmer readies the data

Figure 1: Drop-down menu

First, I need to bring the original COT file, saved in ‘.txt’ format, to my enterprise SAS environment. On SAS Viya, I have a choice of using a programmatic or graphical user interface (GUI) approach to import data and perform data wrangling/preparation. Both interfaces are easily accessible from a drop-down menu on SAS Drive, a web-based central hub for SAS Viya applications (see Figure 1). I choose the programmatic approach by clicking on Develop SAS Code from the drop-down menu.

SAS® Studio, the programming interface for SAS Viya, is a web-based development environment that includes code autocomplete, a library of frequently used code snippets, pre-built GUI wizards for numerous analytical routines, etc.

In SAS Studio, I prepare the data with five steps: 1) Import the raw text file; 2) Reduce the number of variables; 3) Compute the traders’ net position; 4) Import the mapping table; and 5) Add the commodity category variable from the mapping table. For all of these steps, I use SAS DATA Step and PROC statements that have been around for 40+ years – old school SAS.

The last two steps in my program, Step 6 and Step 7, are unique to SAS Viya (see Figure 2 below.) Those lines of code start the Cloud Analytics Server (CAS) session and load the final curated table into CAS, an in-memory distributed analytical engine that I consider the “heart and soul” of SAS Viya. Once the data is in CAS, an authorized user of my enterprise SAS Viya environment can easily locate data in CAS library/catalog for data exploration, modeling, or developing business intelligence content.

Figure 2: CAS

A business analyst creates and shares an interactive dashboard

Now it's the business analyst's turn. In this example, my teammate wants to build an interactive dashboard using curated COT data that I (the programmer) loaded into CAS. The analyst will access SAS Drive, select Explore and Visualize Data option from the menu, and be directed to SAS Visual Analytics, a web-based application that allows you to explore data and build point-and-click, interactive visualizations, no coding skills required.

Figure 3 and Figure 4 below show examples of the types of drillable dashboards and reports you can easily develop in SAS Visual Analytics with a few clicks. You can share the report internally via web link, view it in MS Office products, or publish it on the intranet or external facing website.

Figure 3: COT Dashboard

 

Figure 4: Monthly Trend

Collaboration bridges diverse skill sets, fosters successful projects

Studies show that success of any analytical project requires multi-disciplinary teams that include database administrators, data scientists, analysts, subject matter experts, management and IT support. SAS Viya helps them capitalize on their strengths to promote frictionless collaboration in a secure and controlled environment.

This post, focused on collaboration between SAS coders and business analysts, only scratches the surface of SAS Viya's collaboration and knowledge sharing capabilities. Likewise, open source coders (R and Python) and business analysts can collaborate on SAS Viya too.

Free trials below give programmers and business analysts a taste of what's possible with SAS Viya. Try one and tell us about it in the comments.

SAS® Visual Analytics on SAS® Viya® | Try it as a business analyst! SAS® Visual Analytics on SAS® Viya® | Try it as a programmer!

How SAS® Viya fosters collaboration was published on SAS Users.

10月 152019
 

In a previous post, I discussed using logs to troubleshoot problems in your Viya environment. In this post, I will look at some additional ways to troubleshoot using some of the tools provided by the Viya Operations Infrastructure. With applications, servers and numerous micro-services all working together and generating their own logs in Viya, it can be difficult to find relevant logs. In order to manage the large number of logs and to enable you to locate messages of interest, the operations infrastructure provides components to collect and store log messages.

The collection process is illustrated in the diagram below.

Co-ordinated by the operations infrastructure:

  • sas-watch log continuously collects and sends log messages to the RabbitMQ exchange
  • sas-stream pulls the messages from RabbitMQ and writes them to disk as a tab-separated value (TSV) file
  • Every five minutes, the sas-ops-agentsrv runs the DatamartEtl task to extract log messages from the TSV file and load them into the VIYALOGS CAS-indexed search table

SAS Environment Manager uses the information in the VIYALOGS table and the VIYALOGS_SOURCES tables to display log messages and graphs that contain the frequency and trends of messages. The SAS Environment Manager LOG’s interface makes it really easy to search and analyze log messages. Using the interface, you can view, subset and search logs. The interface has the filtering capabilities on the left hand side and displays the messages on the right. By default, the filter is set to display all messages from all applications and services from the last 30 minutes.

You can modify the filter to extend or shorten the timeframe, subset the level of messages displayed or the source (service/application) that the messages are coming from. You can also search for any text within a message.

Many administrators would prefer a command-line interface, and the good news is there is one.

sas-ops is a command-line interface which allows for the monitoring of the operational infrastructure in a SAS Viya deployment environment.

I have found the sas-ops log command very useful to troubleshoot problems. The sas-ops log command can be used to stream log messages that are generated by SAS Viya applications and services. The messages can be streamed to a terminal window or piped to a file. The sas-ops logs command is located at /opt/sas/viya/home/bin and can be run from any machine in a Viya environment that is included in the CommandLine.

When would you use sas-ops logs to stream log messages? Some potential scenarios are to:

  • troubleshoot a poorly performing report or analysis
  • debug problems in the environment such as logon issues
  • monitor access to resources

In these cases, using sas-ops logs you can stream the log messages from all services to a single file or terminal.

In its simplest form, the command live streams all log messages from a Viya environment to the terminal. Selecting CTRC+C will stop the streaming.

./sas-ops logs

Partial output from the stream is shown below.

If you want to save the output, you can redirect the stream to a file.

./sas-ops logs &gt; /tmp/mylog.log

You can get more creative and achieve more complex tasks. You can change the format of the message output using –format. For example, to create a file with json which could be read by another process use:

./sas-ops logs –format pretty &gt; mylogs.json

You can also:

  • stream messages for just a specific Viya service
  • filter logs messages by text in a regular expression
  • stream for a specific duration

The duration is specified using the format 0h0m0s0ms, but you can also use individual parts of the specification, for example 30s for 30 seconds or 5m for 5 minutes.

Consider the situation where we want to monitor access to a particular CAS table over a specific period of time. The command below will output to a file all messages that contain the table name HR_SUMMARY for a period of 5 minutes.

./sas-ops logs –match HR_SUMMARY –timeout 5m &gt; /tmp/hr_summary_access.log

The output shows all the CAS actions that were performed on the table during the time period.

You can subset the stream to one service.

Consider a case where a user is having an issue logging in and you suspect you have an issue with the LDAP setup. To check the problem, you can firstly enable DEBUG logging on com.sas.identities. Then stream the log messages from the identities service.

./sas-ops logs –format pretty –source identities &gt; logonerrors.json

Viewing the output shows that there is something wrong with the LDAP query.

I think you will agree that sas-ops logs is a very useful tool for monitoring and troubleshooting issues in a Viya environment. For more information, check out the following resources:

I would like to thank Bryan Ellington for his helpful input with this post.

Capturing log messages from Viya deployments was published on SAS Users.

9月 092019
 

Editor's Note: This article was translated and edited by SAS USA and was originally written by Makoto Unemi. The original text is here.

SAS previously provided SAS Scripting Wrapper for Analytics Transfer (SWAT), a package for using SAS Viya functions from various general-purpose programming languages ​​such as Python.

In addition to SWAT, SAS launched Deep Learning Python (DLPy), a higher-level API package for Python, making it possible to use SAS Viya functions more efficiently from Python. In this article I outline more about what DLPy is and how it's implementation.

About DLPy

DLPy is a high-level package for the Python API created for deep learning and image action set after Viya3.3. DLPy provides an API similar to Keras to improve the efficiency of deep learning and image processing coding. With just a little rewriting of the existing Keras code, it is possible to execute the processing on SAS Viya.

For example, below is an example of a Convolutional Neural Network (CNN) layer definition; you can see that it is very similar to Keras.

The layers supported by DLPy are: InputLayer, Conv2d, Pooling, Dense, Recurrent, BN, Res, Proj, and OutputLayer. The following is an example of learning.

DLPy functions

Introducing DLPy's functions (partial excerpts), taking as an example the learning of multiple dolphins and giraffe images using CNN and applying test images to the model.

Implementation of major deep learning networks

DLPy offers the following pre-built deep learning models: VGG11/13/16/19, ResNet34/50/101/152, wide_resnet, and dense_net.

The following models also offer pre-trained weights using ImageNet data (these weights can be used for unique tasks by transfer learning): VGG16, VGG19, ResNet50, ResNet101, and ResNet152. The following is an example of transferring ResNet50 pre-trained weights.

CNN judgment basis information

Using the heat_map_analysis() method, you can output a colorful heat map and check where you focused on the image.

In addition, the get_feature_maps() method is used to get the feature map of each layer of CNN, and feature_maps.display() method is used to specify and display the obtained feature map layer and check can also do.

The following is the output result of layer 1 feature map.

The following is the output result of layer 18 feature map.

Deep learning & image processing related task support function

resize() method: Resize image data

as_patches() method: Image data expansion (generates a patch from the original image)

two_way_split() method: Data split (learning, testing)

plot_network() method: draws the structure of the defined deep learning layer (network) as a graphical diagram

plot_training_history() method: Iterative learning history display

predict() method: Display prediction (scoring) results

plot_predict_res() method: Display classification results

And of course, you can use DLPy to get data from a SAS Viya in-memory session, pass it to your local client, and convert it to common data formats like numpy arrays and Pandas DataFrames. The converted data can be smoothly supplied to models of other open source packages such as scikit-learn.

Regarding image classification using DLPy, videos are also available in the Deep Learning with Python (DLPy) Demo Series section of the DLPy product page.

SAS Viya: Package for Python API for deep learning and image processing: DLPy was published on SAS Users.

9月 042019
 

Editor’s note: This article is a continuation of the series by Conor Hogan, a Solutions Architect at SAS, on SAS and database and storage options on cloud technologies. Access all the articles in the series here.

In a previous article in this series, Accessing Databases in the Cloud – SAS Data Connectors and Amazon Web Services, I covered SAS and database as a service (DBaaS) and storage offerings from Amazon Web Services (AWS). Today, I cover the various storage options available on AWS and how connect to and interact with them from SAS.

Object Storage

Amazon Simple Storage Service (S3) is a low-cost, scalable cloud object storage for any type of data in its native format. Individual Amazon S3 objects can range in size from 1 byte all the way to 5 terabytes (TB). Amazon S3 organizes these objects into buckets. A bucket is globally unique. You access the bucket directly through an API from anywhere in the world, if granted permissions. The default granted to the bucket is least access. Amazon advertises 11 9’s, or 99.999999999% of durability, meaning that you never lose your data. Data replicates automatically across availability zones to meet this durability. You can reduce the number of replicants or use one of the various tiers of archive services to reduce your object storage cost. Costs are calculated based on terabytes of storage per month with added costs for request and transfers of data.

SAS and S3

Support for Amazon Web Services S3 as a Caslib data source for SAS Cloud Analytic Services (CAS) was added in SAS Viya 3.4. This data source enables you to access SASHDAT files and CSV files in S3. You can use the CASLIB statement or the table.addCaslib action to add a Caslib for S3. SAS is currently exploring native object storage integration with AWS S3 for more file types. For other file types you can copy the data from S3 and then use a SAS Data Connector to load the data into memory. For example, if I had Excel data in S3, I could use PROC S3 to copy the data locally and then load the data into CAS using the SAS Data Connector to PC Files.

Block Storage

Amazon Elastic Block Store (EBS) is the block storage service designed for use with Amazon Elastic Compute Cloud (EC2). Only when attached to an operating system is the storage class accessible. Storage volumes can be treated as an independent disk drive controlled by a server operating system. You would mount an EBS volume to an operating system as if it were a physical disk. EBS volumes are valuable because they are the storage that will persist when you terminate your compute instance. You can choose from four different volume types that supply performance levels at corresponding costs.

SAS and EBS

EBS is used as the permanent SAS data storage and persists through a restart of your SAS environment. The performance choices made when selecting from the different EBS volume type will have a direct impact on the performance that you get from SAS. One thing to consider is using compute instances that have enhanced EBS performance or dedicated solid state drive instance storage. For example, the SAS Viya on AWS QuickStart uses Storage Optimized and Memory Optimized compute instances with local NVMe-based SSDs that are physically connected to the host server that is coupled to the lifetime of the instance. This is beneficial for performance.

SAS Cloud Analytic Services (CAS) is an in-memory server that relies on the CAS Disk Cache as the virtual memory storage backend. This is especially true if you are reading data from a database. In this case, make sure you have enough block storage, in the form of EBS volumes for use as the CAS Disk Cache.

File Storage

Amazon Elastic File System (EFS) provides access to data through a shared file system. EFS is an elastic network file system that grows and shrinks as you add or remove files, so you only pay for the storage you consume. Users create, delete, modify, read, and write files organized logically in a directory structure for intuitive access. This allows simultaneous access for multiple users to a common set of file data managed with user and group permissions. Amazon FSx for Lustre is the high-performance file system service.

SAS and EFS

EFS shared file system storage can be a powerful tool if utilizing a SAS Grid architecture. If you have a requirement in your SAS architecture for a shared location that any node in a group can access and write to, then EFS could meet your requirement. To access the data stored in your network file system you will have to mount the EFS file system. You can mount your Amazon EFS file systems to any EC2 instance, or any on-premises server connected to your Amazon VPC.

BONUS: Serverless

Amazon Athena is query service for Amazon S3. This service makes it easy to submit queries against the objects stored in S3. You can run analysis on this data using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries you run. Amazon Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet.

SAS and Athena

Amazon Athena is ODBC/JDBC compliant which means I can use SAS/ACCESS Interface to ODBC or SAS/ACCESS Interface to JDBC to connect using SAS. Download an Amazon Athena ODBC driver and submit code from SAS just like you would any ODBC data source. Athena is a great tool if you want to use the serverless computing power of Amazon to query data in S3.

Finally

Many times, we do not have a choice of technologies we use and infrastructures on which they sit. Luckily, if you use AWS, integration with SAS is not a concern. I’ve now covered databases and storage for AWS. In future articles, I’ll cover the same topics for Microsoft Azure and Google Cloud Platform.

Additional Resources

Storage in the Cloud – SAS and Amazon Web Services was published on SAS Users.

8月 282019
 

This article is not a tutorial on Hadoop, Spark, or big data. At the same time, no prerequisite knowledge of these technologies is required for understanding. We’ll give you enough background prior to diving into the details. In simplest terms, the Hadoop framework maintains the data and Spark controls and directs data processing. As an analogy, think of Hadoop as a train, big data as the payload, and Spark as the crew driving the train and organizing and distributing the goods.

Big data

I recently read that data volumes are doubling each year. Not that long ago we talked in terms of gigabytes. This quickly turned into terabytes and we’re now in the age of petabytes. The type of data is also changing. Data used to fit neatly into rows and columns. Now, nearly eighty percent of data is unstructured. All these trends and facts have led us to deal with massive amounts of data, aka big data. Maintaining and processing big data required creating technical frameworks. Next, we’ll investigate a couple of these tools.

Hadoop

Hadoop is a technology stack utilizing parallel processing on a distributed filesystem. Hadoop is useful to companies when data sets become so large or complex that their current solutions cannot effectively process the information in a reasonable amount of time. As the data science field has matured over the past few years, so has the need for a different approach to processing data.

Apache Spark

Apache Spark is a cluster-computing framework utilizing both iterative algorithms and interactive/exploratory data analysis. The goal of Spark is to keep the benefits of Hadoop’s scalable, distributed, fault-tolerant processing framework, while making it more efficient and easier to use. Using in-memory distributed computing, Spark provides capabilities over and above the batch model of Hadoop MapReduce. As a result, this brings to the big data world new applications of data science that were previously too expensive or slow on massive data sets.

Now let’s explore how SAS integrates with these technologies to maximize capturing, managing, and analyzing big data.

SAS capabilities to leverage Spark

SAS provides Hadoop data processing and data scoring capabilities using SAS/ACCESS Interface to Hadoop and In-Database Technologies to Hadoop with MapReduce or Spark as the processing framework. This addresses some of the traditional data management batch processing, huge volumes of extract, transform, load (ETL) data as well as faster, interactive and in-memory processing for quicker response with Spark.

In SAS Viya, SAS/ACCESS Interface to Hadoop includes SAS Data Connector to Hadoop. All users with SAS/ACCESS Interface to Hadoop can use the serial. Likewise, SAS Data Connect Accelerator to Hadoop can load or save data in parallel between Hadoop and SAS using SAS Embedded Process, as a Hive/MapReduce or Spark job.

Connecting to Spark in a Hadoop Cluster

There are two ways to connect to a Hadoop cluster using SAS/ACCESS Interface to Hadoop, based on the SAS environment: LIBNAME and CASLIB statements.

LIBNAME statement to connect to Spark from MVA SAS

options set=SAS_HADOOP_JAR_PATH="/third_party/Hadoop/jars/lib:/third_party/Hadoop/jars/lib/spark"; 
options set=SAS_HADOOP_CONFIG_PATH="/third_party/Hadoop/conf"; 
 
libname hdplib hadoop server="hadoop.server.com" port=10000 user="hive"
schema='default' properties="hive.execution.engine=SPARK";

Parameters

SAS_HADOOP_JAR_PATH Directory path for the Hadoop and Spark JAR files
SAS_HADOOP_CONFIG_PATH Directory path for the Hadoop cluster configuration files
Libref The hdplib libref specifies the location where SAS will find the data
SAS/ACCESS Engine Name HADOOP option to connect Hadoop engine
SERVER Hadoop Hive server to connect
PORT Listening Hive server Port. 10000 is the default, so it is not required. It is included just in case
USER and PASSWORD Are not always required
SCHEMA Hive schema to access. It is optional; by default, it connects to the “default” schema
PROPERTIES Hadoop properties. Choosing SPARK for the property hive.execution.engine enables SAS Viya to use Spark as the execution platform

 
CASLIB statement to connect from CAS

caslib splib sessref=mysession datasource=(srctype="hadoop", dataTransferMode="auto",username="hive", server="hadoop.server.com", 
hadoopjarpath="/opt/sas/viya/config/data/hadoop/lib:/opt/sas/viya/conf ig/data/hadoop/lib/spark", 
hadoopconfigdir="/opt/sas/viya/config/data/hadoop/conf", schema="default"
platform="spark"
dfdebug="EPALL" 
properties="hive.execution.engine=SPARK");

Parameters

CASLIB Space holder for the specified data access. The splib CAS library specifies the Hadoop data source
sessref Holds the CAS library in a specific CAS session. mysession is the current active CAS session
SRCTYPE Type of data source
DATATRANSFERMODE Type of data movement between CAS and Hadoop. Accepts one of three values – serial, parallel, auto. When AUTO is specified, CAS choose the type of data transfer based on available license in the system. If Data Connect Accelerator to Hadoop has been licensed, parallel data transfer will be used, otherwise serial mode of transfer is used
HADOOPJARPATH Hadoop and Spark JAR files location path on the CAS cluster
HADOOPCONFIGDIR Hadoop configuration files location path on the CAS cluster
PLATFORM Type of Hadoop platform to execute the job or transfer data using SAS Embedded Process. Default value is “mapred” for Hive MapReduce. When using “Spark”, data transfer and job executes as a Spark job
DFDEBUG Used to receive additional information back from SAS Embedded Process transfers data in the SAS log
PROPERTIES Hadoop properties. Choosing SPARK for the property hive.execution.engine enables SAS Viya to use Spark as the execution platform

 

Data Access using Spark

SAS Data Connect Accelerator for Hadoop with the Spark platform option uses Hive as the query engine to access Spark data. Data movement happens between Spark and CAS through SAS generated Scala code. This approach is useful when data already exists in Spark and either needs to be used for SAS analytics processing or moved to CAS for massively parallel data and analytics processing.

Loading Data from Hadoop to CAS using Spark

Processing data in CAS offers advanced data preparation, visualization, modeling and model pipelines, and finally model deployment. Model deployment can be performed using available CAS modules or pushed back to Spark if the data is already in Hadoop.

Load data from Hadoop to CAS using Spark

proc casutil 
      incaslib=splib
      outcaslib=casuser;
      load casdata="gas"
      casout="gas"
      replace;
run;

Parameters

PROC CASUTIL Used to process CAS action routines to process data
INCASLIB Input CAS library to read data
OUTCASLIB Output CAS library to write data
CASDATA Table to load to the CAS in-memory server
CASOUT Output CAS table name

 

We can look at the status of the data load job using Hadoop' resource management and job scheduling application, YARN. YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

Loading Data from Hadoop to Viya CAS using Spark

In the figure above, the YARN application executed the data load as a Spark job. This was possible because the CASLIB statement had Platform= Spark option specified. The data movement direction, in this case Hadoop to CAS uses the Spark job name, “SAS CAS/DC Input,” where “Input” is data loaded into CAS.

Saving Data from CAS to Hadoop using Spark

You can save data back to Hadoop from CAS at many stages of the analytic life cycle. For example, use data in CAS to prepare, blend, visualize, and model. Once the data meets the business use case, data can be saved in parallel to Hadoop using Spark jobs to share with other parts of the organization.

Using the SAVE CAS action to move data to Hadoop using Spark

proc cas;
session mysession; 
      table.save /
      caslib="splib"
      table={caslib="casuser", name="gas"},
      name="gas.sashdat"
      replace=True;
quit;

Parameters

PROC CAS Used to execute CAS actionsets and actions to process data.
“table” is the actionset and “save” is the action
TABLE Location and name of the source table
NAME Name of the target table saved to the Hadoop library using Spark

 

We can verify the status of saving data from CAS to Hadoop using YARN application. Data from CAS saves as a Hadoop table using, Spark as the execution platform. Furthermore, as SAS Data Connect Accelerator for Hadoop transfers data in parallel, individual Spark executors in each of the Spark executor nodes handles data execution for that specific Hadoop cluster node.

Saving Data from Viya CAS to Hadoop using Spark

Finally, the SAVE data executed as a Spark job. As we can see from YARN, the Spark job named “SAS CAS/DC Output” specifies that the data moves from CAS to Hadoop.

Where we are; where we're going

We have so far traveled across the Spark pond to setup SAS libraries for Spark, Load and Save data from and to Hadoop using Spark. In the next section we’ll look at ways to Score data and execute SAS code inside Hadoop using Spark.

Data and Analytics Innovation using SAS & Spark - part 1 was published on SAS Users.

8月 222019
 

Editor’s note: This is the first article in a series by Conor Hogan, a Solutions Architect at SAS, on SAS and database and storage options on cloud technologies. This article covers the SAS offerings available to connect to and interact with the various database options available in Amazon Web Services.

As companies move their computing to the cloud, they are also moving their storage to the cloud. Just like compute in the cloud, data storage in the cloud is elastic and responds to demand while only paying for what you use. As more technologies moves to a cloud-based architecture, companies must consider questions like: Where is my data going to be stored? Do I want a hybrid solution? What cloud storage options do I have? What storage solution best meets my business needs?. Another question requiring an answer is: Is the software I use cloud-ready?. The answer in the case of SAS is, YES! SAS offers various cloud deployment patterns on various cloud providers and supports integration with cloud storage services.

This is part one in a series covering database as a service (DBaaS) and storage offerings from Amazon Web Services (AWS). Microsoft Azure and Google Cloud Platform will be covered in future articles. The goal is to supply a breakdown of these services to better understanding the business requirements of these offerings and how they relate to SAS. I will focus primarily on SAS Data Connectors as part of SAS Viya, but all the same functionality is available using a SAS/ACCESS Interface in SAS 9.4. SAS In-Database technologies in SAS Viya are called SAS Data Connect Accelerators and are synonymous with the SAS Embedded Process.

SAS integration with AWS

SAS has extended SAS Data Connectors and SAS In-Database Technologies support to Amazon Web Services database variants. A database running in AWS is much like your on-premise database, but instead Amazon is managing the software and hardware. Amazon’s DBaaS offerings take care of the scalability and high availability of the database with minimal user input. SAS integrates with your cloud database even if SAS is running on-premise or with a different cloud provider.

AWS databases

Amazon offers database service technologies familiar to users. It is important to understand the new terminology and how the different database services best meet the demands of your specific application. Many common databases already in use are being refactored and provided as service offerings to customers in AWS. The advantages for customers are clear: no hardware to manage and no software to install. Databases that scale automatically to meet demand and software that updates and create backups automatically means customers can spend more time creating value from their data and less time managing their infrastructure.

For the rest of this article I cover various database management systems, the AWS offering for each database type, and SAS integration. First let's consider the diagram below depicting a decision flow chart to determine integration points between AWS database services and SAS.

Integration points between AWS database services and SAS

Trace you path in the diagram and read on to learn more about connection details.

Relational Database Management System (RDBMS)

In the simplest possible terms, an RDBMS is a collection of managed tables with rows and columns. You can divide relational databases into two functional groups: online transaction processing (OLTP) and online analytical processing (OLAP). These two methods serve two distinct purposes and are optimized depending in how you plan to use the data in the database.

Transactional Databases (OLTP)

Transactional databases are good at processing reads, inserts, updates and deletes. These queries usually have minimal complexity, in large volumes. Transactional databases are not optimized for business intelligence or reporting. Data processing typically involves gathering input information, processing the data and updating existing data to reflect the collected and processed information. Transactional databases prevent two users accessing the same data concurrently. Examples include order entry, retail sales, and financial transaction systems. Amazon offers several types of transactional database services. You can organize Amazon Relational Database Service (RDS) into three categories: enterprise licenses, open source, and cloud native.

Enterprise License

Many customers already have workloads built around enterprise databases. Amazon provides a turn-key enterprise solution for customers not looking to break their relationship with enterprise vendors or refactor their existing workflows. AWS offers Oracle and Microsoft SQL Server as a turn-key enterprise solution in RDS. Both offerings include the required software license, however Oracle also allows you to “Bring Your Own License” (BYOL). SAS has extended SAS Data Connector support for both cloud variants. You can use your existing license for SAS Data Connector to Oracle or SAS Data Connector to Microsoft SQL Server to interact with these RDS databases.

Remember you can install and manage your own database on a virtual machine if there is not an available database as a service offering. The traditional backup and update responsibilities are left to the customer in this case. For example, both SAS Data Connector to Teradata and SAS Data Connect Accelerator for Teradata are supported for Teradata installed on AWS.

Open Source

Amazon provides service offerings for common open source databases like MySQL, MariaDB, and PostgreSQL. SAS has extended SAS Data Connector support for all these cloud variants. You can use your existing license for SAS Data Connector to MYSQL to connect to either RDS MYSQL or RDS MariaDB and SAS Data Connector to PostgreSQL to interface with RDS PostgreSQL.

Cloud Native

Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud, combining the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases. SAS has extended SAS Data Connector support for Amazon Aurora. You can use your existing license for SAS Data Connector to MYSQL to connect to either Aurora MYSQL or and SAS Data Connector to PostgreSQL to interface with Aurora PostgreSQL.

Analytical Databases (OLAP)

Analytical Databases optimize on read performance. These databases work best from complex queries in smaller volume. When working with an analytical database you are typically doing analysis on multidimensional data interactively from multiple perspectives. Redshift is the analytical database service offered by Amazon. SAS has a dedicated product called SAS Data Connector to Amazon Redshift that was purpose built for analytics workloads running in the Amazon cloud.

NoSQL Databases

A non-relational or NoSQL database is any database not conforming to the relational database model. These databases are more easily scalable to a cluster of machines. NoSQL databases are a more natural fit for the cloud because the loose dependencies make the data easier to distribute and scale. The different NoSQL databases are designed to solve a specific business problem. Some of the most common data structures are key-value, column, document, and graph databases. Here is a brief overview of the most common data structures.

Key-Value Database

A key-value database stores data as a collection of key-value pairs. The key acts as a unique identifier for each record. Amazon’s key-value database as a service is DynamoDB. SAS interacts with DynamoDB using industry standard ODBC or JDBC drivers.

Columnar Database

Data in a traditional relational database is sorted by rows. The alternative columnar databases optimize by sorting data quickly using columns, saving valuable time and network I/O. Redshift is the columnar database service offered by Amazon. SAS has a dedicated product called SAS Data Connector to Amazon Redshift that was purpose built for this database.

Document Database

A document database queries data in documents typically stored in JSON format. DocumentDB is the document database service offering from Amazon. SAS interacts with DocumentDB using industry standard ODBC or JDBC drivers. DocumentDB is MongoDB-compatible which means existing MongoDB drivers and tolls work with DocumentDB. Currently SAS is building out functionally to support SAS Data Connector to MongoDB and you should expect that to expand further into DocumentDB as well.

Graph Database

Amazon Neptune is the graph database service designed to work with a complex hierarchy of interconnected data. These design of these types of databases queries relationships in data and reduce the number of table joins. SAS interacts with Amazon Neptune using industry standard ODBC or JDBC drivers.

Hadoop

The traditional deployment of Hadoop is changing dramatically with the cloud. Traditional Hadoop vendors may have a tough time keeping up with the service offerings available in the cloud. Hadoop still offers reliable replicated storage across nodes and powerful parallel processing of large jobs without much data movement. Amazon offers Elastic Map Reduce as their Hadoop as a service offering. Amazon Elastic supports both SAS Data Connector to Hadoop and SAS Data Connect Accelerator for Hadoop.

Finally

It is important to think about the use case for your database and the type of data that you plan to store before you select an AWS database service. Understanding your workloads is critical to getting the right performance and cost. When dealing with cloud databases always remember that you will be charged for the storage that you use but also for the data that you move out of the database. To do analysis and reporting on your data may require data transfer. Be aware of these costs and think about how you can lower these costs by keeping frequently accessed data cached somewhere or remain on-premise.

Additional Resources

  1. Support for Databases in SAS® Viya® 3.4
  2. Support for Cloud and Database Variants in SAS® 9.4

Accessing Databases in the Cloud – SAS Data Connectors and Amazon Web Services was published on SAS Users.

7月 192019
 

When a new Moon passes between the Earth and the Sun, the Moon can cast a shadow on certain regions of the Earth. This natural phenomenon creates a solar eclipse, meaning the Moon covers, or eclipses, your view of the Sun if you're in that region. No surprise that in [...]

Ring of fire: Visualizing 5,000 years of solar eclipses was published on SAS Voices by Falko Schulz

6月 262019
 

In his article How to use CASL to develop and work with user-defined CAS actions, Brian Kinnebrew defines CASL as "a language specification used by the SAS client to interact with and provide easy access to Cloud Analytic Services (CAS). CASL is a statement-based scripting language with many uses and strengths." I can't come up with a better definition, so if you'd like to learn more about the basics of CASL, I encourage you to read Brian's post.

In a SAS Stored Process (or any traditional SAS program) the code has multiple DATA steps and procedures with a good dose of macros.

Over the last couple of years, the focus of my projects is the use of CAS actions with no traditional procedures in the mix. Most of these involved web applications calling multiple CAS actions. My initial approach was to make multiple http calls - one per action. This could get tedious.

Then I met my action hero: sccasl.runCasl.

My favorite SAS CASL action

This action executes a CASL "script" on the CAS server analogous to executing a SAS Stored Process. Running a CASL program with a mix of CAS actions and CASL statements on the CAS server has these benefits:

  1. Reduced the number of http calls to the server
  2. The client-side code is much easier to reason
  3. The returned values can be a dictionary that is suited for further consumption by the client, simplifying the client code

My personal name for these scripts is "CAS stored process".

Where art thou Macro?

In many applications, user input is passed to the code running on the server. In a SAS Stored Process, macros pass the parameters. CASL has no macros. My initial approaches to passing parameters to CASL programs were:

  1. Generate the final CASL program in JavaScript with the user input values inserted into the code. Sample code is available here.
    • Drawback: Debugging the code in SAS Studio requires a cut-and-paste of the generated code into SAS Studio.
  2. Load the data into a CAS table using table.upload action for programs needing an input table. Sample code is available here.
    • Drawback: This requires an additional http call to the server.

In developing the GraphQL approach to writing applications - GraphQL and SAS Viya applications - a good match - I addressed the two drawbacks listed above by creating two functions and put them in my utility belt.

    Superfriend functions

  • jsonToDict.js - generate a string having the CASL dictionary version of a JavaScript object
  • argsToTable - create a CAS table from a dictionary

The remainder of this article discusses these two functions. To demonstrate the functions' usage, I use code listings from the scoring example, covered in the GraphQL example.

The jsonToDict function

The result of this function produces a string with CASL dictionary suitable for inclusion in CASL code.

Function definition

jsonToDict ⇒ string

Returnsstring - returns the string containing the CASL dictionary

Param Type Description
obj object the JavScript object of interest
name string the name to assign to the dictionary

 

Example function code

The code below outlines the jsonToDict function usage.

obj = {x:1, b:2, c:['a', 'b']};
let r = jsonToDict(obj, '_appEnv_');
The result is:
r = `_appEnv_ = {x=1, b=2, c={"a", "b"}}'`;

The following lists the input parameters passed to the CASL code for scoring.

let input = {
        JOB    : 'J1',
        CLAGE  : 100, 
        CLNO   : 20, 
        DEBTINC: 20, 
        DELINQ : 2, 
        DEROG  : 0, 
        MORTDUE: 4000, 
        NINQ   : 1,
        YOJ    : 10,
        LOAN: 1000,
        ASSET: 100000
    };

Below is the Javascript code and the result.

let _args_ = jsonToDict(input, '_args_');
results in
 
args_ = `_args_ = {  JOB= "J1" ,CLAGE=100  ,CLNO=20  ,DEBTINC=20  ,DELINQ=2  ,DEROG=0  ,
MORTDUE=4000  ,NINQ=1  ,YOJ=10  ,LOAN=1000  ,ASSET=100000  };`;

To allow the use of different versions of the model, the name of the scoring model is passed in as a parameter. The Javascript for the model follows.

let env = {
     astore: {
            caslib: 'Public',
            name  : 'GRADIENT_BOOSTING___BAD_2'
        }
};
let _appEnv_= jsonToDict(env, '_appEnv_');
 
resulting in:
let _appEnv_ = `_appEnv_ = { astore = { caslib="Public", name="GRADIENT_BOOSTING___BAD_2"}};`;

Next, I prepend the strings to the CASL program in the client code with the following code snips.

let code = _args_ + _appEnv_ + `the CASL code shown below';
loadactionset "astore";
 
/* convert arguments to a cas table */
argsToTable(_args_, 'casuser', 'INPUTDATA' );
 
/* score */
action astore.score r=rc/
    table  = { caslib= 'casuser', name = 'INPUTDATA' } 
    rstore = { caslib= _appEnv_.astore.caslib,  name=_appEnv_.astore.name }
    casout  = { caslib = 'casuser', name = 'OUTPUTDATA' replace= TRUE};
 
/* fetch results */
action table.fetch r = result /
    table = {  caslib = 'casuser' name = 'OUTPUTDATA' } ;
 
/* extract the score and send it as a dictionary */
score = result.Fetch[1].P_BAD;
send_response({score = score});

Now the CASL program can access the incoming information using the two dictionaries _args_ and _appEnv_. Note: As a personal choice, I use a convention of _args_ for user input and _appEnv_ for application specific information. I use the restaf application framework to make the http calls as shown below.

let payload = {
        action: 'sccasl.runCasl',
        data  : { code: code}
    }
 let result = await store.runAction(session, payload); 
let score = result.items('results', 'score');

CASL coding - easier than saying Kilp-ill-skim

Notice the absence of string substitutions that look strange. Just simple, straight forward coding. Easier than saying your name backwards, forcing you back to the fifth dimension.

The argsToTable.casl function

The sample code above used the function argsToTable. As you may guess it converts the _args_ dictionary into a CAS table used in the scoring action. The argsToTable is the function in CASL handling this task.

Function definition

argsToTable ⇒ Load dictionary into a CAS Table

Param Type Description
input dictionary the data to load
caslib string caslib of output table
name string name of output table

 

The relevant CASL code from Step 1 is reproduced here:

argsToTable(_args_, 'casuser', 'INPUTDATA' );

The argsToTable function is either stored on a server or prepended to the CASL code sent to the runCasl action. This function removes the need to run a http call to load the data in the CAS table.

Returning data from CAS - the send_response function

The ultimate sidekick function

Any good super hero has a sidekick. The function send_response in CASL is very versatile - it allows one to return data in the form the application needs and allows more than one result. In many programs I return data in a form easily consumed by the client code.

For example, if you wanted to return just the rows of table you can do the following:

function resultsToDict(r);
    casResults = {};
    i = 1;
  do row over r;
     casResults[i] = row;
     i = i + 1;
   end;
  return casResults;
end;
 
/* and use it as follows: */
 
action table.fetch r = result /
    table = {  caslib = 'casuser',  name = 'mydata' };
/* extract the data and return it as a dictionary */
casResults = resultsToDict(result.Fetch);
send_response({casResults: casResults});

Finally

Using a combination of CASL, a couple of utility functions and the runCasl action you can develop some very efficient programs with minimal traffic between your client and the server. If you run multiple actions in sequence you should consider grouping them into a CASL program and executing them on the CAS server using the runCasl action.

Next

In my next article, which I hope to finish in a flash, I will discuss using the runCasl action to create a browser for CAS tables with support for pagination.

All comments are welcome. Please feel free to clone the code, make it better.

Cheers...
Deva

Let runCasl be your BFF and favorite action hero

"CAS Stored Process" with my Favorite Action Hero runCasl was published on SAS Users.