Global Technology Practice

1月 272018
 

Are you interested in using SAS Visual Analytics 8.2 to visualize a state by regions, but all you have is a county shapefile?  As long as you can cross-walk counties to regions, this is easier to do than you might think.

Here are the steps involved:

Step 1

Obtain a county shapefile and extract all components to a folder. For example, I used the US Counties shapefile found in this SAS Visual Analytics community post.

Note: Shapefile is a geospatial data format developed by ESRI. Shapefiles are comprised of multiple files. When you unzip the shapefile found on the community site, make sure to extract all of its components and not just the .shp. You can get more information about shapefiles from this Wikipedia article:  https://en.wikipedia.org/wiki/Shapefile.

Step 2

Run PROC MAPIMPORT to convert the shapefile into a SAS map dataset.

libname geo 'C:\Geography'; /*location of the extracted shapefile*/
 
proc mapimport datafile="C:\Geography\UScounties.shp"
out=geo.shapefile_counties;
run;

Step 3

Add a Region variable to your SAS map dataset. If all you need is one state, you can subset the map dataset to keep just the state you need. For example, I only needed Texas, so I used the State_FIPS variable to subset the map dataset:

proc sql;
create table temp as select
*, 
/*cross-walk counties to regions*/
case
when name='Anderson' then '4'
when name='Andrews' then '9'
when name='Angelina' then '5'
when name='Aransas' then '11',
<……>
when name='Zapata' then '11'
when name='Zavala' then '8'
end as region 
from geo.shapefile_counties
/*subset to Texas*/
where state_fips='48'; 
quit;

Step 4

Use PROC GREMOVE to dissolve the boundaries between counties that belong to the same region. It is important to sort the county dataset by region before you run PROC GREMOVE.

proc sort data=temp;
by region;
run;
 
proc gremove
data=temp
out=geo.regions_shapefile
nodecycle;
by region;
id name; /*name is county name*/
run;

Step 5

To validate that your boundaries resolved correctly, run PROC GMAP to view the regions. If the regions do not look right when you run this step, it may signal an issue with the underlying data. For example, when I ran this with a county shapefile obtained from Census, I found that some of the counties were mislabeled, which of course, caused the regions to not dissolve correctly.

proc gmap map=geo.regions_shapefile data=geo.regions_shapefile all;
   id region;
   choro region / nolegend levels=1;
run;

Here’s the result I got, which is exactly what I expected:

Custom Regional Maps in SAS Visual Analytics

Step 6

Add a sequence number variable to the regions dataset. SAS Visual Analytics 8.2 needs it properly define a custom polygon inside a report:

data geo.regions_shapefile;
set geo.regions_shapefile;
seqno=_n_;
run;

Step 7

Load the new region shapefile in SAS Visual Analytics.

Step 8

In the dataset with the region variable that you want to visualize, create a new geography variable and define a new custom polygon provider.

Geography Variable:

Polygon Provider:

Step 9

Now, you can create a map of your custom regions:

How to create custom regional maps in SAS Visual Analytics 8.2 was published on SAS Users.

1月 262018
 

If you have worked with the different types of score code generated by the high-performance modeling nodes in SAS® Enterprise Miner™ 14.1, you have probably come across the Analytic Store (or ASTORE) file type for scoring.  The ASTOREfile type works very well for scoring complex machine learning models like random forests, gradient boosting, support vector machines and others. In this article, we will focus on ASTORE files generated by SAS® Viya® Visual Data Mining and Machine Learning (VDMML) procedures. An introduction to analytic stores on SAS Viya can be found here.

In this post, we will:

  1. Generate an ASTORE file for a PROC ASTORE in SAS Visual Data Mining and Machine Learning.

Generate an ASTORE file for a gradient boosting model

Our example dataset is a distributed in-memory CAS table that contains information about applicants who were granted credit for a certain home equity loan. The categorical binary-valued target variable ‘BAD’ identifies if a client either defaulted or repaid their loan. The remainder of the variables indicating the candidate’s credit history, debt-to-income ratio, occupation, etc., are used as predictors for the model. In the code below, we are training a gradient boosting model on a randomly sampled 70% of the data and validating against 30% of the data. The statement SAVESTATE creates an analytic store file (ASTORE) for the model and saves it as a binary file named “astore_gb.”

proc gradboost data=PUBLIC.HMEQ;
partition fraction(validate=0.3);
target BAD / level=nominal;
input LOAN MORTDUE DEBTINC VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO /
level=interval;
input REASON JOB / level=nominal;
score out=public.hmeq_scored copyvars=(_all_);
savestate rstore=public.astore_gb;
id _all_;
run;

Shown below are a few observations from the scored dataset hmeq_scored  where YOJ (years at present job) is greater than 10 years.

PROC ASTORE

Override the scoring decision using PROC ASTORE

In this segment, we will use PROC ASTORE to override the scoring decision from the gradient boosting model. To that end, we will first make use of the DESCRIBE statement in PROC ASTORE to produce basic DS2 scoring code using the EPCODE option. We will then edit the score code in DS2 language syntax to override the scoring decision produced from the gradient boosting model.

proc astore;
    describe rstore=public.astore_gb
        epcode="/viyafiles/jukhar/gb_epcode.sas"; run;

A snapshot of the output from the above code statements are shown below. The analytic store is assigned to a unique string identifier. We also get information about the analytic engine that produced the store (gradient boosting, in this case) and the time when the store was created. In addition, though not shown in the snapshot below, we get a list of the input and output variables used.

Let’s take a look at the DS2 score code (“gb_epcode.sas”) produced by the EPCODE option in the DESCRIBE statement within PROC ASTORE.

data sasep.out;
	 dcl package score sc();
	 dcl double "LOAN";
	 dcl double "MORTDUE";
	 dcl double "DEBTINC";
	 dcl double "VALUE";
	 dcl double "YOJ";
	 dcl double "DEROG";
	 dcl double "DELINQ";
	 dcl double "CLAGE";
	 dcl double "NINQ";
	 dcl double "CLNO";
	 dcl nchar(7) "REASON";
	 dcl nchar(7) "JOB";
	 dcl double "BAD";
	 dcl double "P_BAD1" having label n'Predicted: BAD=1';
	 dcl double "P_BAD0" having label n'Predicted: BAD=0';
	 dcl nchar(32) "I_BAD" having label n'Into: BAD';
	 dcl nchar(4) "_WARN_" having label n'Warnings';
	 Keep 
		 "P_BAD1" 
		 "P_BAD0" 
		 "I_BAD" 
		 "_WARN_" 
		 "BAD" 
		 "LOAN" 
		 "MORTDUE" 
		 "VALUE" 
		 "REASON" 
		 "JOB" 
		 "YOJ" 
		 "DEROG" 
		 "DELINQ" 
		 "CLAGE" 
		 "NINQ" 
		 "CLNO" 
		 "DEBTINC" 
		;
	 varlist allvars[_all_];
	 method init();
		 sc.setvars(allvars);
		 sc.setKey(n'F8E7B0B4B71C8F39D679ECDCC70F6C3533C21BD5');
	 end;
	 method preScoreRecord();
	 end;
	 method postScoreRecord();
	 end;
	 method term();
	 end;
	 method run();
		 set sasep.in;
		 preScoreRecord();
		 sc.scoreRecord();
		 postScoreRecord();
	 end;
 enddata;

The sc.setKey in the method init () method block contains a string identifier for the analytic store; this is the same ASTORE identifier that was previously outputted as part of PROC ASTORE. In order to override the scoring decision created from the original gradient boosting model, we will edit the gb_epcode.sas file (shown above) by inserting new statements in the postScoreRecord method block; the edited file must follow DS2 language syntax. For more information about the DS2 language, see 

method postScoreRecord();
      if YOJ>10 then do; I_BAD_NEW='0'; end; else do; I_BAD_NEW=I_BAD; end;
    end;

Because we are saving the outcome into a new variable called “I_BAD_NEW,” we will need to declare this variable upfront along with the rest of the variables in the score file.

In order for this override to take effect, we will need to run the SCORE statement in PROC ASTORE and provide both the original ASTORE file (astore_gb), as well as the edited DS2 score code (gb_epcode.sas).

proc astore;
    score data=public.hmeq epcode="/viyafiles/jukhar/gb_epcode.sas"
          rstore=public.astore_gb
          out=public.hmeq_new; run;

A comparison of “I_BAD” and “I_BAD_NEW” in the output of the above code for select variables shows that the override rule for scoring has indeed taken place.

In this article we explored how to override the scoring decision produced from a machine learning model in SAS Viya. You will find more information about scoring in the Using PROC ASTORE to override scoring decisions in SAS® Viya® was published on SAS Users.

1月 262018
 

If you have worked with the different types of score code generated by the high-performance modeling nodes in SAS® Enterprise Miner™ 14.1, you have probably come across the Analytic Store (or ASTORE) file type for scoring.  The ASTOREfile type works very well for scoring complex machine learning models like random forests, gradient boosting, support vector machines and others. In this article, we will focus on ASTORE files generated by SAS® Viya® Visual Data Mining and Machine Learning (VDMML) procedures. An introduction to analytic stores on SAS Viya can be found here.

In this post, we will:

  1. Generate an ASTORE file for a PROC ASTORE in SAS Visual Data Mining and Machine Learning.

Generate an ASTORE file for a gradient boosting model

Our example dataset is a distributed in-memory CAS table that contains information about applicants who were granted credit for a certain home equity loan. The categorical binary-valued target variable ‘BAD’ identifies if a client either defaulted or repaid their loan. The remainder of the variables indicating the candidate’s credit history, debt-to-income ratio, occupation, etc., are used as predictors for the model. In the code below, we are training a gradient boosting model on a randomly sampled 70% of the data and validating against 30% of the data. The statement SAVESTATE creates an analytic store file (ASTORE) for the model and saves it as a binary file named “astore_gb.”

proc gradboost data=PUBLIC.HMEQ;
partition fraction(validate=0.3);
target BAD / level=nominal;
input LOAN MORTDUE DEBTINC VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO /
level=interval;
input REASON JOB / level=nominal;
score out=public.hmeq_scored copyvars=(_all_);
savestate rstore=public.astore_gb;
id _all_;
run;

Shown below are a few observations from the scored dataset hmeq_scored  where YOJ (years at present job) is greater than 10 years.

PROC ASTORE

Override the scoring decision using PROC ASTORE

In this segment, we will use PROC ASTORE to override the scoring decision from the gradient boosting model. To that end, we will first make use of the DESCRIBE statement in PROC ASTORE to produce basic DS2 scoring code using the EPCODE option. We will then edit the score code in DS2 language syntax to override the scoring decision produced from the gradient boosting model.

proc astore;
    describe rstore=public.astore_gb
        epcode="/viyafiles/jukhar/gb_epcode.sas"; run;

A snapshot of the output from the above code statements are shown below. The analytic store is assigned to a unique string identifier. We also get information about the analytic engine that produced the store (gradient boosting, in this case) and the time when the store was created. In addition, though not shown in the snapshot below, we get a list of the input and output variables used.

Let’s take a look at the DS2 score code (“gb_epcode.sas”) produced by the EPCODE option in the DESCRIBE statement within PROC ASTORE.

data sasep.out;
	 dcl package score sc();
	 dcl double "LOAN";
	 dcl double "MORTDUE";
	 dcl double "DEBTINC";
	 dcl double "VALUE";
	 dcl double "YOJ";
	 dcl double "DEROG";
	 dcl double "DELINQ";
	 dcl double "CLAGE";
	 dcl double "NINQ";
	 dcl double "CLNO";
	 dcl nchar(7) "REASON";
	 dcl nchar(7) "JOB";
	 dcl double "BAD";
	 dcl double "P_BAD1" having label n'Predicted: BAD=1';
	 dcl double "P_BAD0" having label n'Predicted: BAD=0';
	 dcl nchar(32) "I_BAD" having label n'Into: BAD';
	 dcl nchar(4) "_WARN_" having label n'Warnings';
	 Keep 
		 "P_BAD1" 
		 "P_BAD0" 
		 "I_BAD" 
		 "_WARN_" 
		 "BAD" 
		 "LOAN" 
		 "MORTDUE" 
		 "VALUE" 
		 "REASON" 
		 "JOB" 
		 "YOJ" 
		 "DEROG" 
		 "DELINQ" 
		 "CLAGE" 
		 "NINQ" 
		 "CLNO" 
		 "DEBTINC" 
		;
	 varlist allvars[_all_];
	 method init();
		 sc.setvars(allvars);
		 sc.setKey(n'F8E7B0B4B71C8F39D679ECDCC70F6C3533C21BD5');
	 end;
	 method preScoreRecord();
	 end;
	 method postScoreRecord();
	 end;
	 method term();
	 end;
	 method run();
		 set sasep.in;
		 preScoreRecord();
		 sc.scoreRecord();
		 postScoreRecord();
	 end;
 enddata;

The sc.setKey in the method init () method block contains a string identifier for the analytic store; this is the same ASTORE identifier that was previously outputted as part of PROC ASTORE. In order to override the scoring decision created from the original gradient boosting model, we will edit the gb_epcode.sas file (shown above) by inserting new statements in the postScoreRecord method block; the edited file must follow DS2 language syntax. For more information about the DS2 language, see 

method postScoreRecord();
      if YOJ>10 then do; I_BAD_NEW='0'; end; else do; I_BAD_NEW=I_BAD; end;
    end;

Because we are saving the outcome into a new variable called “I_BAD_NEW,” we will need to declare this variable upfront along with the rest of the variables in the score file.

In order for this override to take effect, we will need to run the SCORE statement in PROC ASTORE and provide both the original ASTORE file (astore_gb), as well as the edited DS2 score code (gb_epcode.sas).

proc astore;
    score data=public.hmeq epcode="/viyafiles/jukhar/gb_epcode.sas"
          rstore=public.astore_gb
          out=public.hmeq_new; run;

A comparison of “I_BAD” and “I_BAD_NEW” in the output of the above code for select variables shows that the override rule for scoring has indeed taken place.

In this article we explored how to override the scoring decision produced from a machine learning model in SAS Viya. You will find more information about scoring in the Using PROC ASTORE to override scoring decisions in SAS® Viya® was published on SAS Users.

1月 242018
 

This article and accompanying technical white paper are written to help SAS 9 users process existing SAS 9 code multi-threaded in SAS Viya 3.3.
Read the full paper, Getting Your SAS 9 Code to Run Multi-Threaded in SAS Viya 3.3.

The Future is Multi-threaded Processing Using SAS® Viya®

When I first began researching how to run SAS 9 code as multi-threaded in SAS Viya, I decided to compile all the relevant information and detail the internal processing rules that guide how the SAS Viya Cloud Analytics Services (CAS) server handles code and data. What I found was that there are a simple set of guidelines, which if followed, help ensure that most of your existing SAS 9 code will run multi-threaded. I have stashed a lot of great information into a single whitepaper that is available today called “Getting Your SAS 9 Code to Run Multi-threaded in SAS Viya”.

Starting with the basic distinctions between single and parallel processing is key to understanding why and how some of the parallel processing changes have been implemented. You see, SAS Viya technology constructs code so that everything runs in pooled memory using multiple processors. Redefining SAS for this parallel processing paradigm has led to huge gains in decreasing program run-times, as well as concomitant increases in accuracy for a variety of machine learning techniques. Using SAS Viya products helps revolutionize how we think about undertaking large-scale work because now we can complete so many more tasks in a fraction of the time it took before.

The new SAS Viya products bring a ton of value compared to other choices you might have in the analytics marketplace. Unfortunately most open source libraries and packages, especially those developed for use in Python and R, are limited to single-threading. SAS Viya offers a way forward by coding in these languages using an alternative set of SAS objects that can run as parallel, multi-threaded, distributed processes. The real difference is in the shared memory architecture, which is not the same as parallel, distributed processing that you hear claimed from most Hadoop and cloud vendors. Even though parallel, distributed processing is faster than single-threading, it proverbially hits a performance wall that is far below what pooled and persisted data provides when using multi-threaded techniques and code.

For these reasons, I believe that SAS Viya is the future of data/decision science, with shared memory running against hundreds if not thousands of processors, and returning results back almost instantaneously. And it’s not for just a handful of statistical techniques. I’m talking about running every task in the analytics lifecycle as a multi-threaded process, meaning end-to-end processing through data staging, model development and deployment, all potentially accomplished through a point-and-click interface if you choose. Give SAS Viya a try and you will see what I am talking about. Oh, and don’t forget to read my technical white paper that provides a checklist of all the things that you may need to consider when transitioning your SAS 9 code to run multi-threaded in SAS Viya!

Questions or Comments?

If you have any questions about the details found in this paper, feel free to leave them in the comments field below.

Getting your SAS 9 code to run multi-threaded in SAS Viya 3.3 was published on SAS Users.

1月 182018
 

One of the most exciting features from the newest release of Visual Data Mining and Machine Learning on SAS Viya is the ability to perform Market Basket Analysis on large amounts of transactional data. Market Basket Analysis allows companies to analyze large transactional files to identify significant relationships between items. While most commonly used by retailers, this technique can be used by any company that has transactional data.

For this example, we will be looking at customer supermarket purchases over the past month. Customer is the Transaction ID; Time is the time of purchase; and Product is the item purchased. The data must be transactional in nature and not aggregated, with one row for each product purchased by each customer.

Market Basket Analysis in SAS Viya

With our data ready, we can now perform the analysis using the MBANALYSIS Procedure. As illustrated below in SAS Studio, by specifying pctsupport=1, we will only look at items, or groups of items, that appear in at least 1% of the transactions. For very large datasets this saves time by only looking at combinations of items that appear frequently. This allows extraction of the most common and most useful relationships.

The MBANALYSIS procedure outputs a list of significant relationships, called Association Rules, by calculating the LIFT metric. A lift greater than one generally indicates that a Rule is significant. By default, each relationship has two items, although this can be changed to include multiple items.

Below is a screenshot of the ten most important rules. The first item in the rule is the “Left Hand Side” and the second item after the arrow is the “Right Hand Side.” For the first rule, we can see that coke and ice cream appear together in 220 transactions and have a lift of 2.37, meaning purchasing Coke makes the purchase of ice cream about twice as likely.

Top 10 Association Rules

While Association Rules above give powerful insights into large transactional datasets, the challenge is exploring these rules visually. One way to do this is by linking the rules together via a Network Diagram. This allows users to see the relationships between multiple rules, and identify the most important items in the network. The following SQL code prepares the data for the Network Diagram.

Network Diagrams plot a set of “Source” values (T1_ITEM), and connects them to a “Target” value (ITEM2). If the source value represents the left hand side of the rule, the corresponding right hand side of the rule is listed as the Target variable. We will use the “Lift” value to link these source and target variables. If the target value is the right hand side of the rule, the target and the lift are missing. This allows us to plot the product, but no linkage will be made.

Now, my data is ready to be visualized as a Network Diagram. Using the following code, I am able to promote my Association Rules, making this dataset available via SAS® Visual Analytics.

Now, I am able to quickly and easily generate my Network Diagram without having to create any code.

Hovering over a node allows me to see specific information about that particular item. Here, we can see that Heineken was purchased in 59.9% of all transactions, which is 600 transactions.

Hovering over the linkage, we can see specific information about the rule. Below, we can see that purchasing artichoke (artichoke) makes the purchase of Heineken about 38% more likely. Many other rules link to Heineken, showing its importance in the network. Business Unit Experts can use this diagram as a starting point to analyze selling strategies to make proper adjustments for the business.

Conclusion

The Market Basket Analysis procedure in Visual Data Mining and Machine Learning on SAS Viya can help retailers quickly scan large transactional files and identify key relationships. These relationships can then be visualized in a Network Diagram to quickly and easily find important relationships in the network, not just a set of rules. As transactional data, whether in-store, online, or in any other form gets bigger, this Market Basket functionality is a must have weapon in the analytical toolkit of any business.

Visualizing the results of a Market Basket Analysis in SAS Viya was published on SAS Users.

1月 062018
 

Deep learning is not synonymous with artificial intelligence (AI) or even machine learning. Artificial Intelligence is a broad field which aims to "automate cognitive processes." Machine learning is a subfield of AI that aims to automatically develop programs (called models) purely from exposure to training data.

Deep Learning and AI

Deep learning is one of many branches of machine learning, where the models are long chains of geometric functions, applied one after the other to form stacks of layers. It is one among many approaches to machine learning but not on equal footing with the others.

What makes deep learning exceptional

Why is deep learning unequaled among machine learning techniques? Well, deep learning has achieved tremendous success in a wide range of tasks that have historically been extremely difficult for computers, especially in the areas of machine perception. This includes extracting useful information from images, videos, sound, and others.

Given sufficient training data (in particular, training data appropriately labelled by humans), it’s possible to extract from perceptual data almost anything that a human could extract. Large corporations and businesses are deriving value from deep learning by enabling human-level speech recognition, smart assistants, human-level image classification, vastly improved machine translation, and more. Google Now, Amazon Alexa, ad targeting used by Google, Baidu and Bing are all powered by deep learning. Think of superhuman Go playing and near-human-level autonomous driving.

In the summer of 2016, an experimental short movie, Sunspring, was directed using a script written by a long short-term memory (LSTM) algorithm a type of deep learning algorithm.

How to build deep learning models

Given all this success recorded using deep learning, it's important to stress that building deep learning models is more of an art than science. To build a deep learning or any machine learning model for that matter one need to consider the following steps:

  • Define the problem: What data does the organisation have? What are we trying to predict? Do we need to collect more data? How can we manually label the data? Make sure to work with domain expert because you can’t interpret what you don’t know!
  • What metrics can we use to reliably measure the success of our goals.
  • Prepare validation process that will be used to evaluate the model.
  • Data exploration and pre-processing: This is where most time will be spent such as normalization, manipulation, joining of multiple data sources and so on.
  • Develop an initial model that does better than a baseline model. This gives some indication of whether machine learning is ideal for the problem.
  • Refine model architecture by tuning hyperparameters and adding regularization. Make changes based on validation data.
  • Avoid overfitting.
  • Once happy with the model, deploy it into production environment. This may be difficult to achieve for many organisations giving that a deep learning score code is large. This is where SAS can help. SAS has developed a scoring mechanism called "astore" which allows deep learning method to be pushed into production with just a click.

Is the deep learning hype justified?

We're still in the middle of deep learning revolution trying to understand the limitations of this algorithm. Due to its unprecedented successes, there has been a lot of hype in the field of deep learning and AI. It’s important for managers, professionals, researchers and industrial decision makers to be able to distill this hype from reality created by the media.

Despite the progress on machine perception, we are still far from human level AI. Our models can only perform local generalization, adapting to new situations that must be similar to past data, whereas human cognition is capable of extreme generalization, quickly adapting to radically novel situations and planning for long-term future situations. To make this concrete, imagine you’ve developed a deep network controlling a human body, and you wanted it to learn to safely navigate a city without getting hit by cars, the net would have to die many thousands of times in various situations until it could infer that cars are dangerous, and develop appropriate avoidance behaviors. Dropped into a new city, the net would have to relearn most of what it knows. On the other hand, humans are able to learn safe behaviors without having to die even once—again, thanks to our power of abstract modeling of hypothetical situations.

Lastly, remember deep learning is a long chain of geometrical functions. To learn its parameters via gradient descent one key technical requirements is that it must be differentiable and continuous which is a significant constraint.

Looking beyond the AI and deep learning hype was published on SAS Users.

12月 212017
 

With SAS Data Management, you can set up SAS Data Remediation to manage and correct data issues. SAS Data Remediation allows user- or role-based access to data exceptions.

When a data issue is discovered it can be sent automatically or manually to a remediation queue where it can be corrected by designated users. The issue can be fixed from within SAS Remediation without the need of going to the affected source system. For more efficiency, the remediation process can also be linked to a purpose designed workflow.

It involves a few steps to set up a remediation process that allows you to correct data issues from within SAS Remediation:

  • Set up Data Management job to retrieve data and correct data in remediation.
  • Set up a Workflow to control the remediation process.
  • Register the remediation service.

Set up Data Management job to retrieve and correct data in remediation

To correct data issues from within Data Remediation we need two real-time data management jobs to retrieve and send data. The retrieve job will read the record in question to populate its data in the remediation UI and a send job to write the corrected data back to the data source or a staging area first.

Retrieve and sent job

If the following remediation fields are available in the retrieve or send job’s External Data Provider node, data will be passed to the fields. The field values can be used to identify and work with the correct record:

REM_KEY (internal field to store issues record id)
REM_USERNAME (the current remediation user)
1.  REM_ITEM_NAME
2.  REM_ISSUE_NAME
3.  REM_APPLICATION
4.  REM_SUBJECT_AREA
5.  REM_PACKAGE_NAME
6.  REM_STATUS
7.  REM_ASSIGNEE

Retrieve Action

The "retrieve” action occurs when the issue is opened in SAS Data Remediation. Data Remediation will only pass REM_ values to the data management job if the fields are present in the External Data Provider node. Although the REM_ values are the only way the data management job can communicate with SAS Data Remediation but they are not all required, meaning you can just call the fields in the External Data Provider node you need.

The job’s output fields will be displayed in the Remediation UI as edit fields to correct the issue record. it's best to use a Field Layout node as the last job node to pass out the wanted fields with desired labels.

Note: The retrieve job should only return one record.

A simple example of a retrieve job would be to have the issue record id coming from REM_KEY into the data management job to select the record from the source system.

Send Action

The “send” action occurs when pressing the “Commit Changes” button in the Data Remediation UI. All REM_ values in addition to the output fields of the retrieve job (the issue record edit fields) are passed to the send job. The job will receive values for those fields present in the External Data Provider node.

The send job can now work with the remediation record and save it to a staging area or submit it to the data source directly.

Note: Only one row will be sent as an input to the send job. Any data values returned by the send job will be ignored by Data Remediation.

Move jobs the Data Management Server

When both jobs are written, and tested you need to move them to Data Management Server into a Real-Time Data Services sub-directory for Data Remediation to call them.

When Data Remediation is calling the jobs, it will use the user credentials for the person logged on to Data Remediation. Therefore, you need to make sure that the jobs on Data Management Server have been granted the right permission.

Set up a Workflow to control the remediation process

Although you don’t need to involve a workflow in SAS Data Remediation but to improve efficiency it might be a good using one.

You can design your own workflow using SAS Workflow Studio or you can use a prepared workflow already coming with Data Remediation. You need to make sure that the desired workflow is loaded on to Workflow Server to link it to the Data Remediation Service.

Using SAS Workflow will help you to better control Data Remediation issues.

Register the remediation service

We can now register our remediation service in SAS Data Remediation. Therefore, we go to Data Remediation Administrator “Add New Client Application.”

Under Properties we supply an ID, which can be the name of the remediation service as long as it is unique, and a Display name, which is the name showing in the Remediation UI.

Next we set up the edit UI for the issue record. Under Issue User Interface we go to: User default remediation UI…. Using Data Management Server:

The Server address is the fully qualified address for Data Management Server including the port it is listening on. For example: http://dmserver.company.com:21036.

The Real-time service to retrieve item attributes and Real-time service to send item attributes needs to point to the retrieve/send job respectively on Data Management Server, including the job suffix .ddf as well as any directories under Real-Time Data Services where the jobs are stored.

 

 

 

 

 

 

Under the tab Subject Area, we can register different subject categories for this remediation service.  When calling the remediation service we can categorize different remediation issues by setting different subject areas.

Under the tab Issues Types, we can register issue categories. This enables us to categorize the different remediation issues.

At Task Templates/Select Templates you can set the workflow to be used for a particular issue type.

By saving the remediation service you will be able to use it. You can now assign data issues to the remediation service to efficiently correct the data and improve your data quality from within SAS Data Remediation.

Manage remediation issues using SAS Data Management was published on SAS Users.

12月 132017
 

PROC FREQ is one of the most popular procedures in the SAS language.  It is mostly used to describe frequency distribution of a variable or combination of variables in contingency tables.  However, PROC FREQ has much more functionality than that.  For an overview of all that it can do, see an introduction of the SAS documentation. SAS Viya does not have PROC FREQ, but that doesn’t mean you can’t take advantage of this procedure when working with BIG DATA. SAS 9.4m5 integration with SAS Viya allows you to summarize the data appropriately in CAS, and then pass the resulting summaries to PROC FREQ to get any of the statistics that it is designed to do, from your favorite SAS 9.4 coding client. In this article, you will see how easy it is to work with PROC FREQ in this manner.

These steps are necessary to accomplish this:

  1. Define the CAS environment you will be using from a SAS 9.4m5 interface.
  2. Identify variables and/or combination of variables that define the dimensionality of contingency tables.
  3. Load the table to memory that will need to be summarized.
  4. Summarize the data with a CAS enable procedure, use PROC FEDSQL for high cardinality cases.
  5. Write the appropriate PROC FREQ syntax using the WEIGHT statement to input cell count data.

Step 1: Define the CAS environment

Before you start writing any code, make sure that there is an _authinfo file in the appropriate user directory in your system (for example, c:\users\<userid>$$ with the following information:

host <cas server name> port <number> user "<cas user id>" password "<cas user password>"

This information can be found by running the following statement in the SAS Viya environment from SAS Studio:

cas;  
caslib _all_ assign;

 

Then, in your SAS 9.4m5 interface, run the following program to define the CAS environment that you will be working on:

options cashost=" " casport=;
cas mycas user=;
libname mycas cas;
/** Set the in memory shared library if you will be using any tables already promoted in CAS **/
libname public cas caslib=public;

 

PROC FREQ FOR BIG DATA

Figure 1

Figure 1 shows the log and libraries after connecting to the CAS environment that will be used to deal with Big Data summarizations.

Step 2: Identify variables

The variables here are those which will be use in the TABLE option in PROC FREQ.  Also, any numeric variable that is not part of TABLE statement can be used to determine the input cell count.

Step 3: Load tale to memory

There are two options to do this.  The first one is to load the table to the PUBLIC library in the SAS Viya environment directly.  The second option is to load it from your SAS 9.4m5 environment into CAS. Loading data into CAS can be as easy as writing a DATA step or using other more efficient methods depending on where the data resides and its size.  This is out of scope of this article.

Step 4: Summarize the data with a CAS enable procedure

Summarizing data for many cross tabulations can become very computing expensive, especially with Big Data. SAS Viya offers several ways to accomplish this (i.e. PROC MEANS, PROC MDSUMMARY, PROC FEDSQL). There are ways to optimize performance for high cardinality summarization when working with CAS.  PROC FEDSQL is my personal favorite since it has shown good performance.

Step 5: Write the PROC FREQ

When writing the PROC FREQ syntax make sure to use the WEIGHT statement to instruct the algorithm to use the appropriate cell count in the contingency tables. All feature and functionality of PROC FREQ is available here so use it to its max! The DATA option can point to the CAS library where your summarized table is, so there will be some overhead for data transfer to the SAS 9 work server. But, most of the time the summarized table is small enough that the transfer may not take long.  An alternative to letting PROC FREQ do the data transfer is doing a data step to bring the data from CAS to a SAS base library in SAS 9 and then running PROC FREQ with that table as the input.

Example

Figure 2 below shows a sample program on how to take advantage of CAS from SAS 9.4m5 to produce different analyses using PROC FREQ.

PROC FREQ for big data

Figure 2

 

Figure 3 shows the log of the summarization in CAS for a very large table in 1:05.97 minutes (over 571 million records with a dimensionality of 163,964 for all possible combinations of the values of feature, date and target).  The PROC FREQ shows three different ways to use the TABLE statement to produce the desired output from the summarized table which took only 1.22 seconds in the SAS 9.4m5 environment.

PROC FREQ for big data

Figure 3

Program

 

/** Connect to Viya Environment **/
options cashost="xxxxxxxxxxx.xxx.xxx.com" casport=5570;
cas mycas user=calara;
libname mycas cas datalimit=all;
 
/** Set the in memory shared library **/
libname public cas caslib=public datalimit=all;
 
/** Define macro variables **/
/* CAS Session where the FEDSQL will be performed */
%let casref=mycas;
 
/* Name of the output table of the frequency counts */
%let outsql=sql_sum;
 
/* Variables for cross classification cell counts */
%let dimvars=date, feature, target;
 
/* Variable for which frequencies are needed */
%let cntvar=visits;
 
/* Source table */
%let intble=public.clicks_summary;
 
proc FEDSQL sessref=&casref.;
	create table &outsql. as select &dimvars., count(&cntvar.) as freq 
		from &intble. group by &dimvars.;
	quit;
run;
 
proc freq data=&casref..&outsql.;
	weight freq;
	table date;
	table date*target;
	table feature*date / expected cellchi2 norow nocol chisq noprint;
	output out=ChiSqData n nmiss pchi lrchi;
run;

PROC FREQ for Big Data was published on SAS Users.

12月 052017
 

With SAS Data Management, you can setup SAS Data Remediation to manage and correct data issues. SAS Data Remediation allows user- or role-based access to data exceptions.

Last time I talked about how to register and use a Data Remediation service. This time we will look at how to use SAS Workflow together with SAS Data Remediation. SAS Workflow comes as part of SAS Data Remediation and you can use it to control the flow of a remediation issue to ensure that the right people are involved at the appropriate various steps.

SAS Workflow is very powerful and offers a lot of features to process data. You can, for example, design workflows to:

  • Process tasks (workflow steps) in parallel.
  • Follow different routes through a workflow depending on certain conditions.
  • Set a timer to automatically route to a different user if an assigned user isn’t available for a certain time.
  • Call Web Services (REST/SOAP) to perform some work automatically.
  • Send email notifications if a certain workflow step is reached.
    …and more.

Invoke an external URL via a workflow

In this article, we are going to look at a workflow feature which is specific to SAS Data Remediation.

With the latest version of Data Remediation there is a new feature that allows you to put a URL behind a workflow button and invoke it by clicking on the button in the Remediation UI. In the following we are going to take a closer look at this new feature.

When you have registered a Data Remediation service you can design a workflow, using SAS Workflow Studio, and link it to a Remediation Service. Workflow Studio comes with some sample workflows. It’s a good idea to take one of these sample workflows as a starting point and adding additional steps to it as desired.

Workflow Task

Data Remediation issues using SAS WorkflowIn a workflow, you have tasks. A task is a single step of a workflow and is associated with data objects, statuses, policies and participants.

  • Data Objects are fields to store values.
  • A Status transitions from one task to the next.
  • Policies are rules that will be executes at a certain task event. For example, calling a Web Service at the beginning of a task or send an email at the end of a task.
  • Participants are users or groups who can execute a task; i.e. if you are not a user assigned to a task, you can’t open an issue in Data Remediation to work on it.

When you add a new task to a workflow you must connect it to another task using a status. In the Data Remediation UI, a status will show up as a Workflow Button to transition from one task to another.

Assigning a URL to Status

You can also use a status on a task to call a URL instead of transitioning to the next task. Therefore, you add a status to a task but don’t use it to connect to another task.

At task Open you have four statuses assigned but only Associate, Cancel and Reject connect to other tasks. Status Review is not connected and it can be used to call a URL.

Right mouse click on status Review/Edit to open a dialogue box with a button Attributes… Here, you need to add the Key Attribute with name: URL. The value of URL points to the fully qualified name of the URL to be called:

http://MyServer/showAddtionalInformation/?recordid={Field1}

 

 

 

 

 

The URL can take parameters, in curly brackets (i.e. {Field1}), pointing to the task’s data objects. When calling the URL, the parameters will be substituted with the appropriate data object value. This way the URL call can be dynamic.

Dynamic URL calls via parameters

When you create a Data Remediation issue you can pass up to three values from the issue record to Remediation. For example, the issue record key. When the workflow is called by Data Remediation these values are copied from Data Remediation to the workflow’s top-level fields: Field1, Field2, Field3.

To use these values in the workflow, to pass them to a URL, you need to create an appropriate data object (Field1) at the task where you want to call a URL. You also need to add a policy to the task to copy the data object value from the top-level data object to the task data object.

This will make the value available at the task and you can use it as a parameter in the URL call as shown above.

Link workflow to Data Remediation Service

When you have finished the workflow, you can save it and upload it to the workflow server.

Once it is uploaded you can link the workflow to the Remediation Service. At Data Remediation Administration, open the remediation service and go to Issue Types. Workflows are linked to Issue types, they are not compulsory, but you can link one or more workflows to an issue type, depending on your business requirements.

At Issue Types, under Templates, select your workflow and link it to an issue type. You can make workflows mandatory for certain issue types by ticking the check box: “Always require a task template…” In this case Data Remediation expects one of the nominated workflows to be used when creating a remediation issue for the appropriate issue type.

Creating a remediation issue

You now can create a remediation issue for the appropriate issue type and assign a workflow to it by submitting the workflow name via the field "workflowName" in JSON structure when creating the remediation issue. See “Improving data quality through SAS Data Remediation” for more about creating Data Remediation issues.

Call a URL via workflow button

When the Remediation issue is created you can open it in SAS Data Remediation where you will see the Review Button. When you click on the Review button the URL will be called that you have assigned to the URL Key attribute value for status Review.

By using workflows with SAS Data Remediation, you can better control the process addressing data issues. Being able to put a URL behind a workflow button and invoke it will enhance your capabilities around solving the issues and improving data quality.

Process and control Data Remediation issues using SAS Workflow was published on SAS Users.

11月 292017
 

The CAS procedure (PROC CAS) enables us to interact with SAS Cloud Analytic Services (CAS) from the SAS client based on the CASL (the scripting language of CAS) specification. CASL supports a variety of data types including INT, DOUBLE, STRING, TABLE, LIST, BLOB, and others. The result of a CAS action could be any of the data types mentioned above. In this article, we will use the CAS procedure to explore techniques to interact with the results produced from CAS actions.

PROC CAS enables us to run CAS actions in SAS Viya; CAS actions are at the heart of how the CAS server receives requests, performs the relevant tasks and returns results back to the user. CAS actions representing common functionality constitute CAS action sets.  Consider the CAS action set TABLE. Within the TABLE action set, we will currently find 24 different actions relevant to various tasks that can be performed on a table. For example, the COLUMNINFO action within the TABLE action set will inform the user of the contents of a CAS in-memory table, including column names, lengths, data types, and so on. Let’s take a look at the following code to understand how this works:

proc cas;
	table.columnInfo  / table='HMEQ';
	simple.summary    / table={name='HMEQ' groupby='BAD'};
run;

In the code above, the table called ‘HMEQ’ is a distributed in-memory CAS table that contains data about applicants who were granted credit for a certain home equity loan. The categorical binary-valued variable ‘BAD’ identifies a client who has either defaulted or repaid their home equity loan. Since PROC CAS is an interactive procedure, we can invoke multiple statements within a single run block. In the code above we have executed the COLUMNINFO action from the TABLE actionset and the SUMMARY action from the SIMPLE action set within the same run block. Notice that we are able to obtain a summary statistic of all the interval variables in the dataset grouped by the binary variable ‘BAD’ without having to first sort the data by that variable. Below is a snapshot of the resulting output.

PROC CAS

Fig1: Output from table.columninfo

Fig 2: Output from executing the summary actionset with a BY group option

Let’s dig into this a little deeper by considering the following statements:

proc cas;
simple.summary result=S /table={name='hmeq'};
describe S;
run;

In the code snippet above, the result of the summary action is returned to the user in a variable ‘S’ which is a dictionary. How do we know? To find that, we have invoked the DESCRIBE statement to help us understand exactly what the form of this result, S, is. The DESCRIBE statement is used to display the data type of a variable. Let’s take a look at the log file:

The log file informs the user that ‘S’ is a dictionary with one entry of type table named “Summary.” In the above example the column names and the attributes are shown on the log file. If we want to print the entire summary table or a subset, we would use the code below:

proc cas;
simple.summary result=S /table={name='hmeq'};
print s[“Summary”];
print s[“summary”, 3:5];
run;

The first PRINT statement will fetch the entire summary table; the second PRINT statement will fetch rows 3 through 5. We can also use WHERE expression processing to create a new table with rows that match the WHERE expression. The output of the second PRINT statements above are shown in the figure below:

The result of an action could also be more complex in nature; it could be a dictionary containing dictionaries, arrays, and lists, or the result could be a list containing lists and arrays and tables etc. Therefore, it is important to understand these concepts through some simple cases. Let’s consider another example where the result is slightly more complex.

proc cas;
simple.summary result=s /table={name='hmeq', groupby='bad'};
describe s;
print s["ByGroup1.Summary", 3:5]; run;

In the example above, we are executing a summary using a BY group on a binary variable. The log shows that the result in this case is a dictionary with three entries, all of which are tables. Below is a snapshot of the log file as well as the output of PRINT statement looking at the summary for the first BY group for row 3 through 5.

If we are interested in saving the output of the summary action as a SAS data set (sas7bdat file), we will execute the SAVERESULT statement. The code below saves the summary statistics of the first BY group in the work library as a SAS dataset.

proc cas;
simple.summary result=s /table={name='hmeq', groupby='bad'};
describe s;
print s["ByGroup1.Summary", 3:5]; 
saveresult s["ByGroup1.Summary"] dataout=work.data;
run;

A helpful function we can often use is findtable. This function will search the given value for the first table it finds.

proc cas;
simple.summary result=s /table={name='hmeq'};
val = findtable(s);
saveresult val dataout=work.summary;
run;

In the example above, I used findtable to locate the first table produced within the dictionary, S, and save it under ‘Val,’ which is then directly invoked with SAVERESULT statement to save the output as a SAS dataset.

Finally, we also have the option to save the output of a CAS action in another CAS Table. The table summary1 in the code below is an in-memory CAS table that contains the output of the summary action. The code and output are shown below:

proc cas;
simple.summary /table={name='hmeq'} casout={name='mylib.summary1'}; 
fetch /table={name='mylib.summary1'};
run;

In this post, we saw several ways of interacting with the results of a CAS action using the CAS procedure. Depending on what our end goal is, we can use any of these options to either view the results or save the data for further processing.

Interacting with the results of PROC CAS was published on SAS Users.