Tech

9月 132018
 

With the release of SAS Viya 3.4, you can easily build large-scale machine learning models and seamlessly publish and run models to Hadoop, or other external databases such as Teradata, without the data ever leaving the Hadoop environment. In this process, SAS Viya:

1) Converts the model into MapReduce Code.

2) Executes the MapReduce code.

3) Returns a new, scored dataset in Hadoop.

SAS Viya is a new, distributed in-memory product that allows users to easily build predictive models at scale. Using the SAS Model Studio interface, I can build complex models without the need to write large amounts of underlying code.

For this blog post, I'll go through the steps to build my model using a telecommunications dataset to predict customer churn. Under the “Data” tab, I can see all of my variables, assign the proper roles, and view the dataset.

 

 

 

 

 

 

 

 

 

 

 

 

 

With the data prepared, I build a pipeline to perform data preprocessing steps such as imputation and binning and build several predictive models, including Regression, Neural Networks and Gradient Boosting. Pipelines are powerful because they automate the heavy lifting of the model building process, allowing you to solve problems faster. In addition, pipelines are re-usable across different users and datasets, allowing the adoption of best practices across an organization.

 

After building the models, I combine the models into one ensemble model with ease, and compare their performance on the validation sample. I determine that the gradient boosting model is the most accurate based on the misclassification rate. You can pick from a large number of accuracy criteria, including KS Statistic, AUC, MCR or F1.

 

After having identified the best model, I  publish the model to Hadoop. This allows me to perform future scoring at the data source, meaning data does not have to leave Hadoop. I could have configured the system to publish the model directly from SAS Model Studio; however, I publish and score the model via SAS code for maximum flexibility. With SAS Studio, I can easily control, and change, where I write my resulting models and datasets in Hadoop.

In the “Compare Models” tab, I then download the score code, which provides me with the following:

  1. sas file containing DS2 code that performs all the data preprocessing steps, such as binning and imputation, in the pipeline above. I load this .sas file into a location that can be viewed in SAS Studio.
  2. A .sashdat file in the “Models” Caslib, that is a binary representation of our model called ASTORE, used to score our model in Hadoop.

 

Opening up the dm_epscore.sas file in SAS Studio, the comments in the top tell me the ASTORE file needed to publish the model.

 

 

 

This scoring file allows the data preparation within the pipeline above to be published to Hadoop as well. In this case, the file is binning the variables before building the Gradient Boosting.

 

 

 

 

The scoring file then invokes the ASTORE file needed to score the model in Hadoop.

 

 

 

Now, I switch to SAS Studio to publish and score my model in Hadoop.  The full code can be found here.

Below is the syntax to publish the model.  I'm sure to set the classpath variable to the appropriate jar and config files for my Hadoop cluster. Note that you will need permission to read and write from the modeldir directory. Publishing the model converts the .sas scoring code and the ATORE file into MapReduce code for execution in the cluster.

 

 

 

 

 

 

 

 

 

 

This publishing code will create a directory called “telco_churn” in my home directory in HDFS, /user/ankram. In a SAS Viya environment co-located with Hadoop, the “CASUSERHDFS” Caslib is by default pointed to this location, allowing me to ensure the “telco_churn” file was successfully published.

 

 

 

 

The next step is to score the model in Hadoop. The code below scores the “looking_glass_v4” table in Hive and create a new table called “looking_glass_v4_scored”, without the data ever leaving Hadoop.

 

 

 

 

 

 

 

 

If everything is configured properly, the log should show that the SAS Embedded Process executed correctly.

 

Using a previously setup Caslib called “Hivelib” that points to the default schema in the Hive Server, I can now load the “Looking_Glass_v4_scored” dataset into CAS to view the table.

 

 

 

 

Using the Table Viewer, I can then see the predicted probabilities of churn for each individual.

 

To conclude, many organizations have very large datasets, often times terabytes or larger, and often find that minimizing data movement is critical to successfully putting models into production. The in-database technologies for Hadoop on SAS Viya allow you and your fellow data scientists to easily prepare data and score large-scale models entirely in Hadoop, with the data never leaving the environment. You can now focus on solving more problems and are no longer at the mercy of large datasets and network latency.

 

 

 

 

Publishing and running models to Hadoop in SAS Viya was published on SAS Users.

9月 052018
 

Typically, when filters are applied in SAS Visual Analytics it affects all the records and aggregations in linked objects. For example, in a typical sales report below, when filters are applied, it changes all the measures of linked objects.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

With this kind of filtering, it becomes difficult to calculate measures which requires a different level of aggregation. In above image the expectation is that the ‘Total Customers’ should not be changing irrespective of ‘Region’, ‘State’, ‘Category’ and ‘Subcategory’ control selections. ‘Total Customers (Geo)’ should be changing only based on ‘Region’ and ‘State’ control selections. ‘Total Customers (Geo and Prod)’ should be changing based on all the controls mentioned above. In the above example only, a ‘Total Customers (Geo and Prod)’ calculation is correct.

We will learn to create measures with different levels of aggregation by using ‘Customer Penetration’ measure as an example.

          Customer Penetration = Distinct customers at selected geography and product level/ Distinct customers at selected geography level

Selective filtering may be used for creating similar reports like: Dealer Participation, Sales Contribution, etc. The below section exemplifies the creation of a customer penetration report with selective filtering.

Customer penetration using SAS Visual Analytics 8.2 (selective filtering)

Customer penetration is used to analyze whether marketing and sales strategies are working or not. Managers often uses customer penetration or dealer participation measures along with other measures to measure the popularity of a product, category or brand.

This report requirement is such that the numerator in the ‘Customer Penetration’ formula should be filtered based on region and state list control selections, while the denominator should be filtered based on region, state, category and subcategory list control selections. This is not the same requirement as filtering the whole table through common list controls. In general, if you link a table with any control, all the measures in that table will be filtered as per selected value(s) in controls. However, our requirement is not like that. Instead of linking control and tables we will use control parameters to achieve our objective.

Assume we have a customer transaction table with following variables:

 

 

 

 

 

 

Before we move, be ready with the basic report as per below image:

 

Once you are ready with the report as per the above image, create parameters for ‘Region’, ‘State’, ‘Category’, ‘SubCategory’:

Region Parameter


 

State Parameter

 


Category Parameter

 


SubCategory Parameter

 

Now create the following two calculated items derived from ‘Customer_ID’:

Geo_Customer_ID
Equivalent to ‘Customer_ID’. However, populated only for selected geography levels and rest would be filled with missing.

 

 

 

 

 

 

Geo_and_Prod_Customer_ID
Equivalent to ‘Customer_ID’. However, populated only for selected geography and product levels and rest would be filled with missing.

 

Create the following two aggregated measures:

Total Customers (Geo)
You need to subtract the distinct count related to missing ‘Geo_Customer_ID’, which is 1.

 

Total Customers (Geo and Prod)
You need to subtract the distinct count related to missing ‘Geo_and_Prod_Customer_ID’, which is 1.

 

Now you can create an aggregated measure ‘Customer Penetration’.

Customer Penetration = Total Customers (Geo and Prod) / Total Customers (Geo)

 

Final report will look like this:

 

 

 

 

 

 

 

 

 

 

 

 

 

Comparative images with default and selective filtering implementation:

 

If you compare the above images, you will find the difference in highlighted measures where the first image aggregation level is based on selective filtering, while in second image aggregation level is uniform.

Note – ‘Total Customers’ is count of distinct ‘Customer_ID’ i.e., total customers count is independent from geography and product hierarchy selection.

Conclusion

This process allows you to use control parameters in ‘If Then Else…’ statements to create a variable (calculated item) having character values. You can utilize this feature in several other applications – this is just one way you can use parameters to fulfil a business requirement.

Selective filtering in SAS Visual Analytics 8.2 was published on SAS Users.

8月 302018
 

A combination of SAS Grid Manager and SAS Viya can change the game for IT leaders looking to take on peak computing demands without sacrificing reliability or driving higher costs.  Maybe that’s why we fielded so many questions about SAS Grid Manager and SAS Viya in our recent webinar  about how the two can work together to process massive volumes of data – fast.

Participants asked us so many great questions that we wanted to share the answers here, assuming that you may have the same questions.  This is the first of two blog posts focusing on some of the very best questions we received.  Stay tuned for more soon – and if you don’t see your own burning questions posed here, just post your question in the comments and we’ll respond.

1. Do SAS Grid Manager and SAS Viya need to be collocated in the same data center?

And if they’re in different data centers, can SAS Grid Manager and SAS Viya communicate with one another? As to the question of how well SAS Grid Manager and SAS Viya can communicate if they’re in different data centers, it shouldn’t be a functional concern. If there’s network connectivity between the two data centers, they can communicate.  Just be mindful of a few things:

  • The size of data being processed in each environment.
  • How much data is going back and forth between the two.
  • The impact all that data movement can have on response times and overall performance.

In addition to performance, there's a greater sensitivity around data handling in physically separated deployments. The emergence of stricter data protection regulations increases the complexity of compliance when moving data between locations with different legal jurisdictions. It will be important to consider the additional performance implications of encryption of the transferred data as well. Ultimately, having compute as close as possible to the data it needs results in less complexity and better performance.

In this case it is important to remember the old saying “just because you can doesn’t mean that you should.”  When SAS Grid Manager and SAS Viya are collocated, they can share the same data. I need to be clear that there are implications of sharing the same data for example data sets. For example, the data cannot be open by both a process running on the grid and by the SAS Viya analytics server at the same time. If your business processes can accommodate this requirement then sharing the same physical copies of data in storage may save your organization money as well as ease compliance efforts.  I also cannot sufficiently stress the need to complete a proof of concept with production data volumes and job complexity to compare the performance of hosting SAS Grid Manager and SAS Viya in the same data center, compared to having them in geographically separated data centers.

2. Should SAS Viya and SAS Grid Manager 9.4M5 run on the same OS for integration purposes?

From an integration perspective, they can definitely run on different operating systems. In fact, as of version SAS 9.4m5, you can access a CAS (Cloud Analytics Services) server from Solaris, AIX, or 64-bit Windows. The CAS server itself will run on Linux – and soon we’ll roll out SAS Viya for Windows server.

Short version: Crossing operating systems does not present a functional problem.  Those of you who have used SAS in heterogeneous environments know that there are performance implications when processing data that is not native to the running session.  You should carefully consider the performance implications of deploying in a heterogeneous topology before committing to a mixed environment.

3. How do SAS Viya and SAS Grid Manager compare in terms of complexity – particularly in the context of platform administration?

They’re actually very comparable.  Some level of detail varies but in many cases the underlying concepts and effort required are similar – especially when you reach the level of multi-node administration, keeping multiple hosts patched, and those sorts of issues.

4. SAS Grid Manager requires high-performance storage. Do we need to have that same level of storage (such as IBM General Parallel File System) for SAS Viya? 

No – SAS Viya relies on Cloud Analytic Services (CAS), so it doesn’t have the same storage requirements as SAS Grid Manager.  It’s more like what you’d find in a Hadoop environment – the CAS reference architecture is mainly a collection of nodes with local storage that allows SAS Viya to perform memory-mapping to disk run jobs that need resources larger than the total RAM and can use disk cache to continue running. SAS Viya can ingest data serially or in parallel, so for customers that have the ability to use cost-efficient distributed file systems, movement of data into CAS can be done in parallel.

Note, the shared file system that is part of an existing SAS Grid Manager environment could be further leveraged as a means to share data between the SAS Grid and SAS Viya environments.

SAS’ recommended IO throughput for SAS Grid Manager deployments are based upon years of experience with customers who have been unsatisfied with the performance of their chosen storage.  The resulting best practice is one that minimizes performance complaints and allows customers to process very large data in the timeliest manner. If SAS Viya is deployed with a multi-node analytics server (MPP mode) then a shared file system is required. The SAS Viya Cloud Analytic Server (CAS) has been designed with Network File System (NFS) in mind.

Our customers get the best value from environments built with a blend of storage solutions, including both shared files systems for job/user/application concurrency as well as less expensive distributed storage for workloads that may not require concurrency like large machine learning and AI training problems. The latter is where SAS Viya shines.

These were all great questions that we thought deserved more detail than we could offer in a webinar – and there are more!  Soon we’ll post a second set of questions that you can use to inform your work with SAS Grid Manager and SAS Viya.  In the meantime, feel free to post any further questions in the comment section of this post.  We’ll answer them quickly.

4 FAQs about SAS Grid Manager and SAS Viya was published on SAS Users.

8月 242018
 

I know a lot of you have been programming in SAS for a long time, which is awesome! However, when you do something for a long time, sometimes you get set in your ways and you miss out on new ways of doing things.

Although the COUNT and CAT functions have been around for a while now, I see a lot of customer code that is counting and concatenating text strings the "old-fashioned" way. In this article, I would like to introduce you to the COUNT, COUNTW, CATS and CATX functions. These functions make certain tasks much simpler, like counting words in a string and concatenating text together.

Counting words or text occurrences

First let's take a look at the COUNT and COUNTW functions.
The

Data a;
  Contributors='The Big Company INC, The Little Company, ACME Incorporated,    Big Data Co, Donut Inc.';
  Num=count(contributors,'inc','i');  /* the 'i' modifier means to ignore case*/
  Put num=;
Run;

When we examine the SAS log, we can see that NUM has a value of 3.

Num=3
NOTE: The data set WORK.A has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

The

/* DON'T USE - use COUNTW instead */
data a(drop=done i);
  x='a#b#c#d#e';
  do until(done);
    i+1;
    y=scan(x,i,'#');
    if y='' then done=1;
    else output;
  end;
  run;

I realize this code isn't terrible, but I try to avoid DO UNTIL/WHILE loops if I can. There is always the possibility of going into an infinite loop.
The COUNTW function eliminates the need for a DO UNTIL/WHILE loop.

Here is an example of logic that I use all the time. In this example, I have a macro variable that contains a list of values that I want to loop through. I can use the COUNTW function to easily loop through each file listed in the resolved value of &FILE_NAMES. The code then uses the file name on the DATA statement and the INFILE statement.

%let file_names=01JAN2018.csv 01FEB2018.csv 01MAR2018.csv 01APR2018.csv;
%macro test(files);
 
%do i=1 %to %sysfunc(countw(&file_names,%str( )));
  %let file=%scan(&file_names,&i,%str( ));
  data _%scan(&file,1,.);
    infile "c:\my files\&file";
    input region $ manager $ sales;
  run;
%end;
%mend;
%test(&file_names)

The log is too large to list here, but you can see one of the generated DATA steps in the MPRINT output of this snapshot of the log.

MPRINT(TEST):   data _01JAN2018;
MPRINT(TEST):   infile "c:\my files\01JAN2018.csv";
MPRINT(TEST):   input region $ manager $ sales;
MPRINT(TEST):   run;

This data step will be generated for each file listed.

Counting strings within another text string should be easy to do. The COUNT functions definitely make this a reality!

Concatenating strings in SAS

Now that we know how to COUNT text in SAS, let me show you how to CAT in SAS with the CATS and CATX functions.

Back in the old days, I had hair(!) and we concatenated text strings using double pipes syntax.

  X=var1||var2||var3;

This syntax is not too bad, but what if VARn has trailing blanks? Prior to SAS Version 9 you had to remove the trailing blanks from each value. Also, if the text was right justified, you had to left justify the text. This complicates the syntax:

X=trim(left(var1))||trim(left(var2))||trim(left(var3));

You can now accomplish the same thing using CATS. The
data a;
  length var1 var2 var3 $12;
  var1='abc';
  var2='123';
  var3='xyz';
  x=cats(var1,var2,var3);
  put x=;
run;

VAR1-VAR3 have a length of 12, which means each value contains trailing blanks. By using the CATS function, all trailing blanks are removed and the text is concatenated without any spaces between the text. Here is the result of the above PUT statement.

x=abc123xyz

Another common need when concatenating text together is to create a delimited string. This can now be done using the CATX function. The

data a;
  length var1 var2 var3 $12;
  var1='abc';
  var2='123';
  var3='xyz';
  x=catx(',',var1,var2,var3);
  put x=;
run;

This syntax creates a comma separated list with all leading and trailing blanks removed. Here is the result of the PUT statement.

x=abc,123,xyz

Just how the COUNT functions making counting text in SAS easier, the CAT functions make concatenating strings so much easier. For more explanation and examples of these CAT* functions, see this paper by Louise Hadden, Purrfectly Fabulous Feline Functions (because they are CAT functions, get it?).

Before I let you go, let me point out that in addition to the COUNT, COUNTW, CATS and CATX functions, there are also the COUNTC, CAT, CATQ and CATT functions that provide even more functionality. These functions are not used as often, so I haven't discussed them here. Please How to COUNT CATs in SAS was published on SAS Users.

8月 242018
 

Dynamic programming is a powerful technique to implement algorithms, and is often used to solve complex computational problems. Some are applications are world-changing, such as aligning DNA sequences; others are more "everyday," such as spelling correction. If you search for "dynamic programming," you will find lots of materials including sample programs written in other programming languages, such as Java, C, and Python etc., but there isn't any SAS sample program. SAS is a powerful language, and of course SAS can do it! This article will show you how to write your dynamic programming function including the SPEDIS function. The purpose of this article is to demonstrate how you can implement such a function in a SAS dynamic programming method.

What is edit distance?

Edit distance is a metric used to measure dissimilarity between two strings, by matching one string to the other through insertions, deletions, or substitutions.

What is minimum edit distance?

Minimum edit distance measures the dissimilarity between two strings through the least number of edit operations.
Taking two strings "BAD" and "BED" as an example. These two words have multiple match possibilities. One solution is an edit distance of 1, resulting from one substitution from letter "A" to "E".

Another solution might be deleting letter "A" from "BAD" then inserting letter "E" between "B" and "D", which results the edit distance of 2

There are several variants of edit distance, depending on the cost of edit operation. For example, given a string pair "BAD" and "BED", if each operation has cost of 1, then its edit distance is 1; if we set substitution cost as 2 (Levenshtein edit distance), then its edit distance is 2.

What is dynamic programming?

The standard method used to solve minimum edit distance uses a dynamic programming algorithm.

Dynamic programming is a method used to resolve complex problems by breaking it into simpler sub-problems and solving these recursively. Partial solutions are saved in a big table, so it can be quickly accessed for successive calculations while avoiding repetitive work. Through this process of building on each preceding result, we eventually solve the original, challenging problem efficiently. Many difficult issues can be resolved using this method.

Here's the algorithm that solves Levenshtein edit distance through dynamic programming:

The following image shows the annotated SAS program that implements the algorithm. The complete code (which you can copy and test for yourself) is at the end of this article.
SPEDIS algorithm

To demonstrate the edit distance function usage and validate the intermediate edit distance matrix table, I used a string pair "INTENTION" and "EXECUTION" that I copied from Stanford's class material as example. (This same resource also shows how the technique applies to DNA sequence alignment.)

options cmplib=work.funcs; 
data test;  
   infile cards missover;
   input word1 : $20. word2 : $20.;
   d=editDistance(word1,word2); 
   put d=;
cards;
INTENTION EXECUTION
;
run;

The Levenshtein edit distance between "INTENTION" and "EXECUTION" is 8.

The edit distance table as follows.

N 8 9 10 11 12 11 10 9 8
O 7 8 9 10 11 10 9 8 9
I 6 7 8 9 10 9 8 9 10
T 5 6 7 8 9 8 9 10 11
N 4 5 6 7 8 9 10 11 10
E 3 4 5 6 7 8 9 10 9
T 4 5 6 7 8 7 8 9 8
N 3 4 5 6 7 8 7 8 7
I 2 3 4 5 6 7 6 7 8
  E X E C U T I O N

 

More dynamic programming applications

Dynamic programming is a powerful technique, and it can be used to solve many complex computation problems. Anna Di and I are presenting a paper to the PharmaSUG 2018 China conference to demonstrate how to align DNA sequences with SAS FCMP and SAS Viya. If you are interested in this topic, please look for our paper after the conference proceedings are published.

Appendix: Complete SAS program for editDistance function

proc fcmp outlib=work.funcs.spedis;
function editDistance(query $, keyword $);
   array distance[1,1]/nosymbols;    
   m = length(query);
   n = length(keyword);   
   call dynamic_array(distance, m+1, n+1); 
   do i=1 to m+1;
      do j=1 to n+1;
         distance[i, j]=-1;  
      end;
   end;
 
   i_max=m;
   j_max=n;
 
   dist = edDistRecursive(query, keyword, distance, m, n);
 
   do i=i_max to 1 by -1;
      do j=1 to j_max;
         put distance[i, j] best3. @;  
      end;
      put;
   end;
 
   return (dist);
endsub; 
 
function edDistRecursive(query $, keyword $, distance[*,*], m, n);
   outargs distance;
 
   if m = 0 then
      return (n);
   if n = 0 then
      return (m);
 
   if distance[m,n] >= 0 then
      return (distance[m,n]);
   if (substr(query,m,1) = substr(keyword,n,1)) then
      delta = 0;
   else
      delta = 2;
 
   ans = min(edDistRecursive(query, keyword, distance, m - 1, n - 1) + delta, 
             edDistRecursive(query, keyword, distance, m - 1, n) + 1, 
             edDistRecursive(query, keyword, distance, m, n - 1) + 1);
   distance[m,n] = ans;
   return (distance[m,n]);   
endsub;
run;
quit; 
 
options cmplib=work.funcs; 
data test;  
   infile cards missover;
   input word1 : $20. word2 : $20.;
   d=editDistance(word1,word2); 
   put d=;
cards;
INTENTION EXECUTION
;
run;

Dynamic programming with SAS FCMP was published on SAS Users.

8月 202018
 

As SAS Global Forum 2019 Conference Chair, I am excited to share our plans for an extraordinary SAS experience in Dallas, Texas, April 28-May 1. SAS has always been my analytic tool of choice and has provided me an amazing journey in my professional career.  Attending many SAS global conferences throughout the years, I've found this is the best venue to learn, contribute and network with other SAS colleagues.

The content will not disappoint – submit your abstract now
What presentation topics will you hear at SAS Global Forum? In 2019, expanded topics will include deep dives into text analytics, business visualization, AI/Machine Learning, SAS administration, open source integration and career development discussions. As always, SAS basic programming techniques using the current SAS software environment will be the underpinning of many sessions.

Do you want to be a part of that great content but never presented at a conference? Now could be the time to learn new skills by participating in the SAS Global Forum Presenter Mentoring Program. Available for first-time presenters, you will collaborate with a seasoned expert to guide you on preparing and delivering a professional presentation. The mentors can also help before you even submit your abstract in the call for content – which is now open. Get help to craft your title and abstract for your submission. The open call is available through October 22.

How to get to Dallas
Did you know there are scholarships and awards to help SAS users attend SAS Global Forum? Our award program has been enhanced for New SAS Professionals. This award provides an opportunity for new SAS users to enhance their basic SAS skills as well as learn about the latest technology in their line of business. Hear what industry experts are doing to solve business issues by sharing real-world evidence cases.

In addition, we're offering an enhanced International Professional Award focused on global engagement and participation. This award is available for those outside of the continental 48 states to share their expertise and industry solutions from around the world.  At the conference, you will have a chance to learn and network with international professionals who work on analytic projects similar to your analytic career.

Don’t miss out on this valuable experience! Submit your abstract for consideration to be a presenter! I look forward to seeing you in Dallas and hearing about your work.

Call for content now open for SAS Global Forum 2019 was published on SAS Users.

8月 172018
 

Data density estimation is often used in statistical analysis as well as in data mining and machine learning. Visualization of data density estimation will show the data’s characteristics like distribution, skewness and modality, etc. The most widely-used visualizations people used for data density are boxplot, histogram, kernel density estimates, and some other plots. SAS has several procedures that can create such plots. Here, I'll visualize the kernel density estimates superimposing on histogram using SAS Visual Analytics.

A histogram shows the data distribution through some continuous interval bins, and it is a very useful visualization to present the data distribution. With a histogram, we can get a rough view of the density of the values distribution. However, the bin width (or number of bins) has significant impact to the shape of a histogram and thus gives different impressions to viewers. For example, we have same data for the two below histograms, the left one with 6 bins and the right one with 4 bins. Different bin width shows different distribution for same data. In addition, histogram is not smooth enough to visually compare with the mathematical density models. Thus, many people use kernel density estimates which looks more smoothly varying in the distribution.

Kernel density estimates (KDE) is a widely-used non-parametric approach of estimating the probability density of a random variable. Non-parametric means the estimation adjusts to the observations in the data, and it is more flexible than parametric estimation. To plot KDE, we need to choose the kernel function and its bandwidth. Kernel function is used to compute kernel density estimates. Bandwidth controls the smoothness of KDE plot, which is essentially the width of the sliding window used to generate the density. SAS offers several ways to generate the kernel density estimates. Here I use the Proc UNIVARIATE to create KDE output as an example (for simplicity, I set c = SJPI to have SAS select the bandwidth by using the Sheather-Jones plug-in method), then make the corresponding visualization in SAS Visual Analytics.

Visualize the kernel density estimates using SAS code

It is straightforward to run kernel density estimates using SAS Proc UNIVARIATE. Take the variable MSRP in SASHELP.CARS dataset as an example. The min/max value of MSRP column is 10280 and 192465 respectively. I plot the histogram with 15 bins here in the example. Below is the sample codes segment I used to construct kernel density estimates of the MSRP column:

title 'Kernel density estimates of MSRP';
proc univariate data = sashelp.cars noprint;	
   histogram MSRP / kernel (c = SJPI) endpoints = 10280 to 192465 by 12145 outkernel = KDE  odstitle = title; 
run;

Run above code in SAS Studio, and we get following graph.

Visualize the kernel density estimates using SAS Visual Analytics

  1. In SAS Visual Analytics, load the SASHELP.CARS and the KDE dataset (from previous Proc UNIVARIATE) to the CAS server.
  2. Drag and drop a ‘Precision Container’ in the canvas, and put a histogram and a numeric series plot in the container.
  3. Assign corresponding data to the histogram plot: assign CARS.MSRP as histogram Measure, and ‘Frequency Percent’ as histogram Frequency; Set the options of the histogram with following settings:
    Object -> Title: No title;

    Graph Frame: Grid lines: disabled

    Histogram -> Bin range: Measure values; check the ‘Set a fixed bin count’ and set ‘Bin count’ to 15.

    X Axis options:

       Fixed minimum: 10280

       Fixed maximum: 192465

       Axis label: disabled

       Axis Line: enabled

       Tick value: enabled

    Y Axis options:

       Fixed minimum: 0

       Fixed maximum: 0.5

       Axis label: disabled

       Axis Line: disabled

       Tick value: disabled

  1. Assign corresponding KDE data to the numeric series plot. Define a calculated item: Percent as (‘Percent of Observations Per Data Unit’n / 100) with the format of ‘PERCENT12.2’, and assign it to the ‘Y axis’; assign the ‘Data Value’ to the ‘X axis.’ Now set the options of the numeric series plot with following settings:
    Object -> Title: No title;

    Style -> Line/Marker: (change the first color to purple)

    Graph Frame -> Grid lines: disabled

    Series -> Line thickness: 2

    X Axis options:

       Axis label: disabled

       Axis Line: disabled

       Tick value: disabled

    Y Axis options:

       Fixed minimum: 0

       Fixed maximum: 0.5

       Axis label: enabled

       Axis Line: enabled

       Tick value: enabled

    Legend:

       Visibility: Off

  1. Now we can start to overlay the two charts. As can be seen in the screenshot below, SAS Visual Analytics 8.3 provides a smart guide with precision container, which shows grids to help you align the objects in it. If you hold the ctrl button while dragging the numeric series plot to overlay the histogram, some fine grids displayed by the smart guide to help you with basic alignment. It is a little tricky though, to make the overlay precisely, you may fine tune the value of the Left/Top/Width/Height in the Layout of VA Options panel. The goal is to make the intersection of the axes coincides with each other.

After that, we can add a text object above the charts we just made, and done with the kernel density estimates superimposing on a histogram shown in below screenshot, similarly as we got from SAS Proc UNIVARIATE. (If you'd like to use PROC KDE UNIVAR statement for data density estimates, you can visualize it in SAS Visual Analytics in a similar way.)

To go further, I make a KDE with a scatter plot where we can also get impression of the data density with those little circles; another KDE plot with a needle plot where the data density is also represented by the barcode-like lines. Both are created in similar ways as described in above histogram example.

So far, I’ve shown you how I visualize KDE using SAS Visual Analytics. There are other approaches to visualize the kernel density estimates in SAS Visual Analytics, for example, you may create a custom graph in Graph Builder and import it into SAS Visual Analytics to do the visualization. Anyway, KDE is a good visualization in helping you understand more about your data. Why not give a try?

Visualizing kernel density estimates in SAS Visual Analytics was published on SAS Users.

8月 152018
 

SAS Viya logoSAS Viya 3.4 has some new functionality that provides real help for those who want to transition from SAS Visual Analytics on 9.4 to SAS Viya. In prior releases of SAS Viya you could promote reports and explorations (and a few other supporting objects). In SAS Viya 3.4, promotion support is added for many additional SAS 9.4 resources, making it easier to make the leap to SAS Viya. In this blog, I will review this new functionality.

In SAS Viya 3.4, the following objects participate in promotion from SAS 9.4.

  • Configuration
    • Identities
    • Authorization
    • Data definitions
  • Content
    • Folders
    • Reports
    • Explorations
    • Stored processes
    • Supporting resources (such as themes, images, graphs templates)

The details of support for each resource are unique and are discussed below.

Identities

User and group promotion from SAS 9.4 to SAS Viya is used to support the transition to the target environment of authorization settings that are associated with content.  Metadata is exported to support the mapping of SAS 9.4 identity metadata (Users and Groups) to SAS Viya identities (Users, Groups and Custom groups).

During promotion of identity metadata:

  • Users connections are mapped using metadata DefaultAuth:logonid to SAS Viya identity id
  • Metadata-only groups from SAS 9.4 are converted to SAS Viya Custom groups (except SAS General Servers and SAS System Services)
  • If custom groups of the same name (or sometimes the same purpose but a different name) exist in the target, the group is preserved and any mapped members from the source system are added to the group.

Authorization

Identities are “promoted” to support re-implementation of authorization. You do not have to explicitly export authorization as it is included with libraries, tables, folders and reports when they are exported. Promotion of authorization is optional. If you don’t wish to include authorization, but rather re-implement it in
SAS Viya, you can switch this functionality off at import time.

SAS Viya has two authorization systems, the general authorization system for folders and content, and the CAS authorization system for data. These authorization systems are different than the metadata authorization model in SAS 9.4. So what happens when you promote content that includes authorization?

General Authorization (folders and content)

Promotion will attempt to convert SAS 9.4 authorization to rules in the General authorization system.  During the process:

  • Explicit Access Control Entries are converted to SAS Viya Rules
  • Access Control Entries with denials are discarded
  • Access Control Templates are not promoted

In addition, if an object (folder/report):

  • does not exist in the target environment,relevant authorization is set for the object and the access control entries from the source are implemented as rules on the object.
  • existsin the target environment, then access control entries from the source are merged with any pre-existing authorizations in the target environment.

CAS Authorization

The CAS authorization system covers CASlibs and data.  Promotion will attempt to convert SAS 9.4 authorization on libraries and tables to access controls in the CAS authorization system. During the process:

  • Access Control Entries are not promoted unless they are applied directly to a library or table.
  • Access Control Entries are converted to CAS access controls.
  • Row-level permissions are preserved.
  • If an object exists in the target environment no authorization settings are imported.
  • Access Control Templates are not promoted.

For details of how individual permissions for both data and content are mapped from SAS 9.4 to SAS Viya see the documentation has great coverage of the steps to follow.

The Process

To finish off, I'll share few observations on the process of exporting from 9.4 and importing in SAS Viya. Like SAS 9.4 promotion, you need to import in a specific order. This allows the software to make the relevant connections to dependent resources. For example, if the CASLIB already exists in the target, then import tables can be mapped to it. Typically, the order is: identities > library definitions > tables > reports and folders. To support this process, make sure, during export, you have a separate package for each resource type. Some considerations for the export process.

You should export:

  • Identities (users and groups) from the security folders in SAS 9.4 metadata to a separate package.
  • Only groups that you need in the target environment (you can subset any irrelevant SAS 9 groups at export time).
  • LASR and Base Libraries and tables directly from the library definition in the folder tree (this prevents extraneous folders being created in the target environment).
  • Libraries in a separate package from tables so that they may be imported first and be available for mapping when the tables are imported.
  • Content and reports from the base of the folder tree so that all directly applied access control entries will be included in the package.

Prior to importing, make sure that users and groups are configured correctly in LDAP. As I already mentioned, physical data is not promoted so ensure that required data and formats are accessible to the SAS Viya environment.

The new functionality for promotion is a great start in helping with the transition from SAS 9.4 to SAS Viya. Look for more functionality in future releases.

New functionality for transitioning from SAS Visual Analytics on 9.4 to SAS Viya was published on SAS Users.

8月 132018
 

Data in the cloud makes it easily accessible, and can help businesses run more smoothly. SAS Viya runs its calculations on Cloud Analytics Service (CAS). David Shannon of Amadeus Software spoke at SAS Global Forum 2018 and presented his paper, Come On, Baby, Light my SAS Viya: Programming for CAS. (In addition to being an avid SAS user and partner, David must be an avid Doors fan.) This article summarizes David's overview of how to run SAS programs in SAS Viya and how to use CAS sessions and libraries.

If you're using SAS Viya, you're going to need to know the basics of CAS to be able to perform calculations and use SAS Viya to its potential. SAS 9 programs are compatible with SAS Viya, and will run as-is through the CAS engine.

Using CAS sessions and libraries

Use a CAS statement to kick off a session, then use CAS libraries (caslibs) to store data and resources. To start the session, simply code "cas;" Each CAS session is given its own unique identifier (UUID) that you can use to reconnect to the session.

Handpicked Related VIDEO: SAS programming in the cloud: CASL code

There are a few significant codes that can help you to master CAS operations. Consider these examples, based on a CAS session that David labeled "speedyanalytics":

  • What CAS sessions do I have running?
    cas _all_ list;
  • Get the version and license specifics from the CAS server hosting my session:
    cas speedyanalytics listabout;
  • I want to sign out of SAS Studio for now, so I will disconnect from my CAS session, but return to it later…
    cas speedyanalytics disconnect;
  • ...later in the same or different SAS Studio session, I want to reconnect to the CAS session I started earlier using the UUID I previous grabbed from the macro variable or SAS log:
    cas uuid="&speedyanalytics_uuid";
  • At the end of my program(s), shutdown all my CAS sessions to release resources on the server:
    cas _all_ terminate;

Using CAS libraries

CAS libraries (caslib) are the method to access data that is being stored in memory, as well as the related metadata.

From the library, you can load data into CAS tables in a couple of different ways:

  1. Takes a sample data set, calculate a new measure and stores the output in memory
  2. Proc COPY can bring existing SAS data into a caslib
  3. Proc CASUTIL loads tables into caslibs

The Proc CASUTIL allows you to save your tables (named "classsi" data in David's examples) for future use through the SAVE statement:

proc casutil;
 save casdata="classsi" casout="classsi";
run;

And reload like this in a future session, using the LOAD statement:

proc casutil;
 load casdata="classsi" casout="classsi";
run;

When accessing your CAS libraries, remember that there are multiple levels of scope that can apply. "Session" refers to data from just the current session, whereas "Global" allows you to reach data from all CAS sessions.

Programming in CAS

Showing how to put CAS into action, David shared this diagram of a typical load/save/share flow:

Existing SAS 9 programs and CAS code can both be run in SAS Viya. The calculations and data memory occurs through CAS, the Cloud Analytics Service. Before beginning, it's important to understand a general overview of CAS, to be able to access CAS libraries and your data. For more about CAS architecture, read this paper from CAS developer Jerry Pendergrass.

The performance case for SAS Viya

To close out his paper, David outlined a small experiment he ran to demonstrate performance advantages that can be seen by using SAS Viya v3.3 over a standard, stand-alone SAS v9.4 environment. The test was basic, but performed reads, writes, and analytics on a 5GB table. The tests revealed about a 50 percent increase in performance between CAS and SAS 9 (see the paper for a detailed table of comparison metrics). SAS Viya is engineered for distributive computing (which works especially well in cloud deployments), so more extensive tests could certainly reveal even further increases in performance in many use cases.

Additional resources

A quick introduction to CAS in SAS Viya was published on SAS Users.

8月 092018
 

Recently I’ve been listening to the BBC Radio Series 50 Things That Made the Modern Economy, which was first broadcast in 2016. One of the episodes considers the impact of a simple box (the shipping container) and concludes its invention was a major contributor to the post-war boom in global trade. It’s worth a listen, if you can.

Notwithstanding the tenuous link, containerization is having perhaps an equally significant impact on Cloud Computing and I want to share a recent experience which highlights the convenience of containers. I’m not aiming to summarize all the multiple SAS initiatives in the Cloud (including SAS Viya and Cloud Foundry) here rather it’s to share a few observations about a specific offering for SAS 9.4.

Recently I attended a demonstration by SAS’ Doug Liming on SAS Analytics for Containers. While this product was launched in 2016, until now I confess I’d not appreciated its simplicity or potential. I’d like to use this blog post to share what I saw & learned because this session served as a bit of an epiphany for me.

As a reminder SAS Analytics for Containers consists of:

    • Foundation SAS (Base, STAT & Graph) ready-packaged to be deployed in a Docker container.
    • SAS Studio.
    • Optional SAS/Access connectors & Accelerators.

In the space of 20 minutes, Doug took us through the The power and potential of simplicity: SAS 9.4 and Containers was published on SAS Users.