Tech

12月 022016
 

Nearly every organization has to deal with big data, and that often means dealing with big data problems. For some organizations, especially government agencies, addressing these problems provides more than a competitive advantage, it helps them ensure public confidence in their work or meet standards mandated by law. In this blog I wanted to share with you how SAS worked with a government revenue collection agency to successfully manage their big data issues and seamlessly integrate with Hadoop and other technologies.

Hadoop Security

We all know Hadoop pretty well, and if you haven’t heard of Hadoop yet, it is about time you invest some resources to learn more about this upcoming defacto standard for storage and compute. The core of Apache Hadoop consist of a storage part known as HDFS (Hadoop Distributed File System) and a processing part (called MapReduce). Hadoop splits large files into large blocks and distributes them across the nodes of a cluster.

Hadoop was initially developed to solve web-scale problems like webpage search and indexing at Yahoo. However, the potential of the platform to handle big data and analytics caught the attention of a number of industries. Since the initial used of Hadoop was to count webpages and implement algorithms like page-rank, security was never considered a major requirement, until it started getting used by enterprises across the world.

Security incidents and massive fines have become commonplace and financial institutions, in particular, are doing everything to avoid such incidents. Security should never be an afterthought and should be considered in the initial design of the system. The five core pillars of Enterprise Security are as follows:

sas-integration-with-hadoop05

Our customer had the four core pillars covered from Administration to Auditing using tools provided by their Hadoop vendor. While there are options in the open-source community that provide data protection, in this case the organization decided to use a data security company to protect data at rest on top of Cloudera Navigator Encryption. They refer to it as “Double Encryption.”

The challenge

SAS has multiple products around the Hadoop ecosystem to provide the best support for customers. The traditional way of working with Hadoop involves SAS/ACCESS which can involve pulling the data from Hadoop using Hive. However for larger installations, where data movement is a concern, SAS provides Embedded Process technology, which allows you to push SAS code inside of a Hadoop cluster and run it alongside the data blocks. This is a super-efficient way to access large data sets inside of Hadoop by pushing the compute to the data.

Our customer's data security vendor’s product supports access via Hive UDF’s which means you can tokenize/detokenize when working with SAS/ACCESS Interface to Hadoop using PROC SQL and other options, relatively out of the box. In addition, the SAS language (BASE SAS) can be added using the security company’s API (and PROC FMCP and PROC PROTO) to add additional new SAS language functions for the de/tokenisation of data inside BASE SAS already.

However, SAS Embedded Process has no default support for our customer's security vendor and SAS products which utilize SAS EP include SAS Code Accelerator, SAS Scoring Accelerator and LASR-based products and cannot work with data tokenized by the vendor. This was a major challenge for our customer who wanted to use SAS products like SAS Visual Analytics and SAS Visual Statistics on large volumes of data stored on Hadoop.

The challenge hence was to make SAS Embedded Process work with their data security vendor’s software to perform detokenization before passing the data to SAS procedures.

The possible solutions

We considered various solutions before agreeing on a solution that satisfies all current requirements and could be extended to meet the future needs of our customer. Let’s discuss the top two solutions and the final implementation.

Solution 1: SERDE approach

Our first approach was to create a custom Hive SERDE that wraps the data security company’s APIs. With 9.4M3 the SAS Embedded Process (EP) can read & write via SERDE APIs with some possible constraints and limitations including DS2’s SET/MERGE capabilities and potential identity credentials being passed on from SAS to the company’s APIs.

sas-integration-with-hadoop02

The approach had various drawbacks but the top drawback was in working with various file formats. This approach was discarded because it would have meant lots of rework with every new data format being released by the Hadoop community. While it is true that generally an organization would standardize a few formats to be used for its use cases, it is nonetheless a limiting factor.

Solution 2: Use HDMD with Custom Input Formats

The second approach was to use HDMD with custom input formats. SAS HDMD supports custom input formats which will allow you to plug in your custom input format. A high-level architectural diagram looks something like Figure 2.  This approach works with a variety of file formats, and we have tested it with Parquet, Avro and ORC with good results. The objective is to load a dataset onto Hadoop or use an existing data set and generate an HDMD file for the dataset. We plug in our custom reader in the HDMD file and as a part of the custom reader we make a number of API calls to the data security company’s API. The API will call on the specific protect and unprotect procedures of the security vendor to protect and/or unprotect the data depending on the requirements and pass the results back to the client.

sas-integration-with-hadoop03

What is an Input/Custom input format BTW?

Data inside Hadoop is typically stored on HDFS (Hadoop Distributed File System). The data needs to be read from the filesystem before being processed. This is achieved using Input Format, which has the following responsibilities.

  • Compute input splits
    • Input splits represent the part of the data that will be processed by each Map phase. A unique input split is passed to the process. At the start of a Map Reduce job, input format will split the data into multiple parts based on logical record boundaries and HDFS block size. To get the input splits, the following method is called:
      • List getSplits(JobContext ctx)
  • Provide a logic to read the input split
    • Each mapper gets a unique input split to process the data. Input format provides a logic to read the split, which is an implementation of the RecordReader interface. The record reader will read the split and emit <key,value> pairs as an input for each map function. The record reader is created using the following method:
      • RecordReader<K,V> createRecordReader(InputSplit is, TaskAttemptContext ctx)

All the common formats will provide a way to split the data and read records. However, if you want to read a custom data set for which data parsing isn’t available out of the box with Hadoop, you are better off writing a custom input format.

How to write a Custom Input Format?

Writing a custom input format needs Java skills (the programming language in which Hadoop has been written). You have the option to implement Abstract methods of InputFormat class, or extend one of the pre-existing input formats. In our case, we had extended FileInputFormat, and overrode few critical methods like

  • getSplits()
  • getRecordReader()

The getSplits() will create the splits from the input data, while the getRecordReader() should return an instance of a Java object, which has the ability to read custom records, which in our case was the security vendor’s API.

You can use one of the predefined Record Reader classes or implement your own (most likely if you are writing a custom input format). In our case, we implemented RecordReader interface, and implemented the next() method which is called whenever a new record is found. This is the method where your core business logic is implemented. In our case, we had to write the integration logic by looking at the data, understanding the user who has logged in (available as a part of JobConf object), and then calling the vendor’s APIs to decrypt the data. Sample codes can be requested by contacting me directly.

Integrating a custom input format with SAS

Integrating a custom input format is fairly easy with SAS. SAS allows us to plug in custom formats, which are called before the data is processed via SAS Embedded Process using HDMD files.

When you generate an HDMD file using PROC HDMD, you can specify your custom input format as a part of the generated XML file. Please refer to PROC HDMD documentation.

The generated HDMD file would look something like this.

sas-integration-with-hadoop04

When loading the data from HDFS, SAS will ensure that the specified input format is called prior to any data processing taking place.

The ultimate solution

The solution was demonstrated using data from the tax authorities and included tokenization of data via hive UDFS, detokenization of data according to the policies set on the data security appliance, and performing analytics using SAS Visual Analytics. Only users with permissions on the specific policy were able to view the data, while users with no permissions had access to decrypted data. This additional security helped the enterprise protect users’ information from inadvertent access and resulted in widespread use of Big Data technologies within the Enterprise.

Summary

As you can see from the example above, SAS is open for business, and is already providing deep integration with Hadoop and other technologies using custom APIs. The sky is the limit for people willing to explore the capabilities of SAS.

tags: Global Technology Practice, Hadoop, SAS/ACCESS Interface to Hadoop

SAS integration with Hadoop - one success story was published on SAS Users.

12月 012016
 

Data virtualization is an agile way to provide virtual views of data from multiple sources without moving the data. Think of data virtualization as an another arrow in your quiver in terms of how you approach combining data from different sources to augment your existing Extract, Transform and Load ETL batch processes. SAS® Federation Server is a unique data virtualization offering that provides not only blending of data, but also on-demand data masking, encryption and cleansing of the data. It provides a central, virtual environment for administering and securing access to your Personally Identifiable Information (PII) and other data.

Data privacy is a major concern for organizations and one of the features of SAS Federation Server is it allows you to effectively and efficiently control access to your data, so you can limit who is able to view sensitive data such as credit card numbers, personal identification numbers, names, etc. In this three part blog series, I will explore the topic of controlling data access using SAS Federation Server. The series will cover the following topics:

Part 1: Securing sensitive data using SAS Federation Server at the data source level
Part 2: Securing sensitive data using SAS Federation Server at the row and column level
Part 3: Securing sensitive data using SAS Federation Server data masking

SAS Metadata Server is used to perform authentication for users and groups in SAS Federation Server and SAS Federation Server Manager is used to help control access to the data. In this blog, I want to explore controlling data access to specific sources of data using SAS Federation Server.  Obviously, you can secure data at its source by using secured metadata-bound libraries in SAS Metadata Server or by using a database’s or file’s own security mechanisms. However, SAS Federation Server can be used to control access to these data sources by authenticating with the users and groups in SAS Management Console and setting authorizations within SAS Federation Server Manager.

In order to show how SAS Federation Server can be used to control access data, I will explore an example where Finance Users in our fictitious company SHOULD have access to the Salary data in a SAS dataset, but our Business Users should NOT.Instead, ourBusiness Users should have access to all other BASE tables with the exception of SALARY. In my scenario, Kate is a Finance User and David and Sally are Business Users. These users have already been setup as such in SAS Metadata Server.

The SAS Federation Server Administrator has setup the BASE catalog and schema information in Federation Server Manager. The SALARY table is located in the Employee_Info schema within the Global catalog.

securing-sensitive-data-using-sas-federation-server01

The SAS Federation Server Administrator has also explicitly granted the CONNECT and SELECT permissions to both the Business Users and Finance Users group for the BASE Data Service.

securing-sensitive-data-using-sas-federation-server02

securing-sensitive-data-using-sas-federation-server03

This gives both groups permission to connect to and select information from the items within this Data Service. The information is inherited by all children items of the Data Service – Data Source Names, Catalogs, Schemas, Tables and Views.  For example, note that the Business Users group has inherited the Grant setting for the CONNECT permission on the BASE Data Source Name (DSN) and the SELECT permission on the EMPLOYEES table.  Permission inheritance is denoted by the diamond symbol (u).

securing-sensitive-data-using-sas-federation-server04

securing-sensitive-data-using-sas-federation-server05

For the SALARY table, the SAS Federation Server Administrator has explicitly denied the SELECT permission for the Business Users group whereas the Finance Users groups has inherited the Grant setting for the SELECT permission on the SALARY table.

securing-sensitive-data-using-sas-federation-server06

securing-sensitive-data-using-sas-federation-server07

Kate, who is a member of the Finance Users group, has permission to select records from the SALARY table.

securing-sensitive-data-using-sas-federation-server08

Note: The user does not need to know the physical location of where the SAS data resides. They simply refer to the Federation Server Data Source Name which in this case is BASE.

By denying the Business Users group the SELECT permission on the SALARY table, David, who is a member of the Business Users group, does NOT have access to select records from this table. He is denied access.

securing-sensitive-data-using-sas-federation-server09

However, David still has access to the EMPLOYEES table since the Business Users group inherited the SELECT permission for that table.

securing-sensitive-data-using-sas-federation-server10

If I want to prevent David from accessing any of the tables or views in the Employee_Info schema, but still allow other Business Users to access them, then as the SAS Federation Server Administrator I can explicitly deny the user, David, the SELECT permission for the Employee_Info schema as shown below.

securing-sensitive-data-using-sas-federation-server11

Now, David inherits the Deny setting for the SELECT permission for all tables and views within that schema and he will no longer be able to access the EMPLOYEES table.

securing-sensitive-data-using-sas-federation-server12

However, Sally, another member of the Business Users group, is still able to access the EMPLOYEES table.

securing-sensitive-data-using-sas-federation-server13

In this blog entry, I covered the first part of this series on controlling data access to SAS Federation Server 4.2:

Part 1: Securing sensitive data using SAS Federation Server at the data source level
Part 2: Securing sensitive data using SAS Federation Server at the row and column level
Part 3: Securing sensitive data using SAS Federation Server data masking

I’ll be posting Part 2 of this series soon. Keep an eye out for it.

For more information on SAS Federation Server:

tags: SAS Federation Server, SAS Professional Services, Securing data

Securing sensitive data using SAS Federation Server at the data source level was published on SAS Users.

11月 292016
 

Present at SAS Global Forum 2017est plus près de la maison, está más cerca de casa, está mais perto de casa, dichter bij huis, is closer to home, eh!

In analytics and statistics, we often talk about sample sizes. The size of the data sets that you analyze are a measure of the amount of information contained within those data. When observations are very similar or correlated due to study design, then the information added by having multiple (correlated) observations may be negligible. This is a common problem with clustered data; the information contained in clustered data is closer to the number of clusters than to the number of observations. As a result, study designers seek to measure many clusters.

When it comes to global presenters, SAS Global Forum is seeking more clusters.

Global representation at SAS Global Forum enriches the conference experience for all attendees, providing each of us with more innovation and information to advance the goals of our organizations.

However, we know that attending our conference from the far corners of the globe is expensive … but not as expensive as it used to be! We’ve got good news for SAS users who reside outside the contiguous 48 states of the United States (residents of Alaska, Hawaii, and U.S. territories, read this carefully!).

To ease the financial burden of travelling from afar to the conference, two new policies have been adopted by the SAS Global Users Group – largely in response to your concerns about cost.

Doubled discount for accepted contributed sessions

Each year, SAS Global Forum attracts about 700 proposed sessions from the user community. The review process is competitive as we can only accept 400 session talks. To attract even more submissions from around the globe, we’ve raised the registration discount from 25% to 50% for accepted proposals from the international user community. If you reside outside the 48 contiguous States, and your abstract is approved, you will automatically receive the 50% discount when you register.

As of the writing of this blog, SAS Global Forum 2017 will include four sessions from Africa, ninefrom Australia, 18 from Asia, 12 from South America and the Caribbean, 37 from Canada, 21 from Europe, and 23 from the United Kingdom. With this new policy, we expect far more in 2018 and beyond!

International Professional Awards

Forty SAS Users will be selected from submitted applications to have their registration fee waived. SAS Users who reside outside the contiguous 48 States can apply by completing the application found on the conference website. In this application, you will be asked to describe your SAS experience, barriers to attendance, and about your commitment to attend. Submitters of contributed content are eligible.

We are certain these changes will help make SAS Global Forum the most diverse, international conference yet! I look forward to meeting many SAS users from near and far in Orlando. See you there! Or should I say Wir sehen uns dort! Ci vediamo lì! Nähdään siellä!

tags: International Professional Awards, papers & presentations, SAS Global Forum

SAS Global Forum 2017 is closer to home, or should I say… was published on SAS Users.

11月 252016
 

Using the DATASETS procedure, we can easily modify SAS variable attributes such as name, format, informat and label:

proc datasets library=libref;
  modify table_name;
    format var_name = date9.;
    informat var_name = mmddyy10.;
    label var_name = 'New label';
    rename var_name = var_new_name;
run;
quit;

We cannot, however, modify fixed variable attributes such as variable type and length.

Notice that we apply variable attributes modifications to one table at a time; that table name is specified in the modify statement of the proc datasets. There might be multiple modify statements within a single proc datasets to apply modifications to multiple data tables.

But what if we need to change variable attributes in all tables of a data library where that variable is present? Say, there is a data model change that needs to be applied consistently to all the data tables in that model.

Business case

Let’s consider a business case for a SAS programmer working for a pharmaceutical company or a contract research organization (CRO) managing data for a clinical trials project. Imagine that you need to apply some variable attributes modifications to all SDTM (Study Data Tabulation Model) and/or ADaM (Analysis Data Model) datasets to bring them in compliance with the CDISC (Clinical Data Interchange Standards Consortium) requirements. Obviously, “manually” going through dozens or hundreds of data tables in different domains to identify whether or not they contain your variable of interest is not an efficient solution.

Better data management practice

This variable attributes update process can be fully automated using the Data-driven SAS macro loops approach as described in my earlier blog post. First, we need to programmatically loop through all the tables in a specified library to find the tables that contain the variable to be modified. Then, for each iteration of that data loop we use SAS macro looping to generate multiple modify statements in proc datasets with subsequent sub-statements (format, informat, label, rename).

SASHELP library

SAS provides a multitude of datasets and views in the SASHELP library. While many of them can be used in coding examples, demos, and testing, there are some that provide invaluable information about your SAS environment. Take a look at the views in the SASHELP library whose names start with the letter “V”: Vcolumn, Vlibname, Vmacro, Voption, Vtable, Vtitle, and others.

These SAS views contain a wealth of information about your SAS environment, and reflect any new changes to your SAS environment dynamically. For example, if you run the following code to create a macro variable:

%let aaa = a;

and then open the Vmacro view, you will see that macro variable listed right there at the very top.

For our purpose of looping through all the tables in a library containing a particular variable, we are going to use the Vcolumn view, as it has all the information we need – libname (Library Name), memname (Member Name), name (Column Name):

Modifying variable attributes in a SAS library

Code implementation

Having that SASHELP.VCOLUMN table view available we can easily implement our code for modifying variable attributes in all the tables of a data library. Here is the SAS code example:

libname sdtm 'c:';

data sdtm.a;
  x=1; label x='Old x-label';
run;

data sdtm.b;
  y=1; label y='Old y-label'; format y mmddyy8.
run;

data sdtm.c;
  y=2; label y='Old y-label'; format y date9.;
run;

%macro change_var_attr (
  lib=,           /* libref,               required parameter */
  var=,           /* variable name,        required parameter */
  new_label='_',  /* new label, in quotes, optional parameter */
  new_format=_,   /* new format,           optional parameter */
  new_informat=_, /* new informat,         optional parameter */
  new_name=_      /* new variable name,    optional parameter */
  );

  /* get all tables in a library &lib containing variable &var */
  data target_tables (keep=memname);
    set sashelp.vcolumn (keep=libname memname name);
    where (upcase(libname) eq upcase("&lib")) and (upcase(name) eq upcase("&var"));
  run;

  proc datasets library=&lib;
    %let dsid = %sysfunc(open(target_tables));
    %syscall set(dsid); /* no leading ampersand for dsid with %syscall */ 
    %do %while(%sysfunc(fetch(&dsid)) eq 0);
      modify &memname;
      %if &new_label ne '_'  %then %str(label    &var = &new_label;);
      %if &new_format ne _   %then %str(format   &var &new_format;);
      %if &new_informat ne _ %then %str(informat &var &new_informat;);
      %if &new_name ne _     %then %str(rename   &var = &new_name;);
    %end;
    %let rc = %sysfunc(close(&dsid));
  run;
  quit;

%mend change_var_attr;

%change_var_attr(lib=sdtm, var=y, new_label='New label, commas allowed',
                 new_format=yymmddd10., new_name=z)

The centerpiece of the code is the change_var_attr macro. The macro has two required parameters, lib and var, and four optional parameters, new_label, new_format, new_informat, and new_name.

To get all the tables in a library &lib containing the variable &var, we query and subset the SAS view SASHELP.VCOLUMN and create a target_tables SAS dataset.

Then in PROC DATASETS, we macro-loop through the target_tables and generate as many modify statements as there are observations in the target_tables dataset, one for each table to be modified. Each modify statement is followed by one or several optional statements – label, format, informat, and rename.

%syscall set (dsid); statement invokes a CALL SET routine that automatically sets the value of macro variable memname when an observation with memname variable is fetched (read) from the target_tables dataset.

There may be other programming techniques to solve the problem of modifying variable attributes across all data sets of a data library. You are welcome to suggest your solution in the comments section below.

Bonus

For those who were patient enough to read to the end… If you need to apply modifications to a variable in tables across multiple data libraries, you can still do it with a single invocation of the change_var_attr macro. In order to do that you would need to define a new data library logically concatenating all the relevant libraries. For example, the following libname statement assigns the libref combo as a logical concatenation of both, sdtm and adam libraries:

libname combo (sdtm adam);

 
tags: modifying variables, SAS Programmers

Modifying variable attributes in all datasets of a SAS library was published on SAS Users.

11月 182016
 

ProblemSolversWith fall comes cooler weather and, of course, football. Lots of football. Often times there will be two NFL games on that my husband wants to watch at the same time. Instead of flipping back and forth between two television stations, he can watch both games simultaneously, thanks to the picture-in-picture feature that we have on our television. This same concept works for SAS® ODS Graphics.

Have you ever been viewing two graphs across pages, flipping back and forth between the two and wishing you could see them together? Now you can. The Graph Template Language (GTL) and PROC SGRENDER enable you to produce a graph inside of a graph, similar to the picture-in-picture feature on your television.

The Game Plan

In this example, we are going to create a graph in the upper right corner of the axis area of a larger graph. When we define the GTL, we always start with the same GTL wrapper, as is shown below. In the wrapper below, INSET is the name of the GTL definition:

proc template;
define statgraph inset;
begingraph;
 
/* insert the code that produces the graphics output */
 
endgraph;
end;
run;

For demonstration purposes, we are going to use the SAS data set Sashelp.Heart and we are going to plot the variable CHOLESTEROL. The ENTRYTITLE statement defines the title for the graph. This statement is valid within the BEGINGRAPH block or after the last ENDLAYOUT statement. The plotting statements are contained within a LAYOUT block. In our example, we have enclosed the HISTOGRAM and DENSITYPLOT plotting statements inside a LAYOUT OVERLAY block. A standard axis is displayed with the BINAXIS=FALSE option in the HISTOGRAM statement. In the PROC SGRENDER statement, we point to the template definition, INSET, using the TEMPLATE option.

proc template;
define statgraph inset;
begingraph;
entrytitle 'Framingham Heart Study';
   layout overlay;
     histogram cholesterol /  binaxis=false datatransparency=0.5;
     densityplot cholesterol /  datatransparency=0.5;
   endlayout;
endgraph;
end;
run;
 
proc sgrender data=sashelp.heart template=inset;
run;

The results are shown in Figure 1.

SAS ODS Graphics

Figure 1

Special Play

Once we have produced the graph in Figure 1, we can see that we have room to display a second graph in it, in the upper right corner of the axis area. We can insert the graph inside the axis by placing the plotting statements inside of a LAYOUT OVERLAY block within a LAYOUT GRIDDED block.  Here are the details.

In the following LAYOUT GRIDDED statement, which is located after the DENSITYPLOT statement, we define the size of the graph using the options WIDTH=300px and HEIGHT=200px. And the options HALIGN=RIGHT and VALIGN=TOP place the graph in the top right corner.

layout gridded / width=300px height=200px halign=right valign=top;

The inset graph contains two regression lines, one for Cholesterol by Diastolic, which is a dashed blue line. The solid green regression line represents Cholesterol by Systolic. The NAME option is added to each of the REGRESSIONPLOT statements in order to produce a legend. The DISCRETELEGEND specifies that the legend be drawn for both SYSTOLIC and DIASTOLIC with the NAME option. The NAME values are case sensitive. I specify the Y axis label with the LABEL option within the YAXISOPTS option in the LAYOUT OVERLAY statement. I also specify THRESHOLDMAX=1 within the LINEAROPTS option within both YAXISOPTS and XAXISOPTS to ensure that the last tick mark value includes the highest value in the data.

I added a red fill pattern of L3 to the larger graph to make it stand out more.  In addition to specifying the fill pattern with the PATTERN option within the FILLPATTERNATTRS option, you must also specify FILLPATTERN within the DISPLAY option. Valid values for the PATTERN option within the FILLPATTERNATTRS option are L1-L5, R1-R5, and X1-X5.

Note:  The FILLPATTERN option and DISPLAY option FILLPATTERN are available beginning in SAS® 9.4 TS1M1. If you are running an older version of SAS, you need to remove this syntax from the program shown below.

proc template;
define statgraph inset;
begingraph;
entrytitle 'Framingham Heart Study';
  layout overlay;
     histogram cholesterol /  binaxis=false datatransparency=0.5 
                   display=(fillpattern outline fill)       fillattrs=(color=lightred) 
                   fillpatternattrs=(pattern=l3 color=red);
     densityplot cholesterol /  datatransparency=0.5 lineattrs=(color=darkred);
  layout gridded / width=300px height=200px halign=right valign=top;
  layout overlay / yaxisopts=(label='Blood Pressure' linearopts=(thresholdmax=1))
                   xaxisopts=(linearopts=(thresholdmax=1));
     regressionplot x=cholesterol y=diastolic / lineattrs=(color=blue pattern=2) name='Diastolic';
     regressionplot x=cholesterol y=systolic / lineattrs=(color=green) name='Systolic';
     discretelegend 'Diastolic' 'Systolic' / across=2; 
  endlayout;
  endlayout;
  endlayout;
endgraph;
end;
run;
 
proc sgrender data=sashelp.heart template=inset; 
run;

The results are displayed in Figure 2.

sas-ods-graphics02

Figure 2

Now you have the tools you need to display a graph within a graph. For other tips on creating graphs with ODS Graphics and the SG procedures, check out Sanjay Matange’s blog series, Graphically Speaking, http://blogs.sas.com/content/graphicallyspeaking/.pre

tags: Graph Template Language (GTL), SAS ODS, SAS Problem Solvers

Picture-in-Picture - It’s Not Just for Television Anymore was published on SAS Users.

11月 152016
 

If your SAS Visual Analytics report requirements include linking out to separate reports without the need to pass values, you may want to consider using images to enhance the appearance of your base report. Here are three style examples using images that you can use depending on your report design requirements and report user preference:

1.     Visually appealing
2.     Generic
3.     Screenshot of actual report.

There is no better substitute for looking at examples so here are some screenshots for you:

1.     Visually appealing

Use images in SAS Visual Analytics

2.     Generic

use-images-in-sas-visual-analytics02

3.     Screenshot of actual report

use-images-in-sas-visual-analytics03

Image selection

Using an image in your report has never been easier. You can navigate your local machine for the image and, if you want, you can also save the image in metadata. This allows other users with access to that metadata location the ability to use the same image. This is great when you want to impose consistency throughout your reports.

use-images-in-sas-visual-analytics04

use-images-in-sas-visual-analytics05

Setting link using image

To define a report link from your image, click on your image then open the Interactions tab. Then use the New drop-down menu and select the type of link you wish to define. For a Report Link or Section Link, use the Browse button to navigate the metadata and select your target. If you are linking to an External URL then enter the fully qualified URL or you can define a link to a stored process.

use-images-in-sas-visual-analytics06

Here is a breakdown of the report objects used in the main dashboard report. I also included the screenshots of my Daily, Weekly, and Monthly report examples.

Dashboard example breakdown

use-images-in-sas-visual-analytics07

Daily report example

use-images-in-sas-visual-analytics08

Weekly report example

use-images-in-sas-visual-analytics09

Monthly report example

use-images-in-sas-visual-analytics10

 

 

tags: SAS Professional Services, SAS Visual Analytics

Use images in SAS Visual Analytics to enhance your report link was published on SAS Users.

11月 102016
 

We all have challenges in getting an accurate and consistent view of our customers across multiple applications or sources of customer information. Suggestion-based matching is a technique found in SAS Data Quality to improve matching results for data that has arbitrary typos and incorrect spellings in it. The suggestion-based concept and benefits were described in a previous blog post. In this post, I will expand on the topic and show how to build a data job that uses suggestion-based matching in DataFlux Data Management Studio, the key component of SAS Data Quality and other SAS Data Management offerings. This article takes a simple example job to illustrate the steps needed to configure suggestion-based matching for person names.

In DataFlux Data Management Studio I first configure a Job Specific Data node to define the columns and example records that I’d like to feed into the matching process. In this example, I use a two column data table made up of Rec_ID and a Name column and sample records as shown below.

suggestion-based-matching-in-sas-data-quality

To build the suggestion-based matching feature, I have to insert and configure at least a Create Match Codes node, a Clustering Node and a Cluster Aggregation node in the data job.

suggestion-based-matching-in-sas-data-quality02

This example uses names with randomly injected typographical errors, like missing characters, additional characters and character transpositions. Please note, these types of data errors would not be matched correctly using the standard matching technique in SAS Data Quality, therefore I am leveraging the suggestion-based matching feature. The picture below shows the person names and highlights the injected errors for Ethan Baker.

suggestion-based-matching-in-sas-data-quality03

For suggestion-based matching in SAS Data Quality I need to use a match definition that supports suggestions. The current Quality Knowledge Base (QKB) for Customer Information ships with a ready to use match definition called “Name (with suggestions).” For other data types, I can easily create new definitions or expand existing definitions to support the suggestions feature as well.

To continue building my suggestion-based matching job I next need to configure the Create Match Codes node as shown in the picture below.

suggestion-based-matching-in-sas-data-quality045

In the property screen of the Create Match Codes node I select the Locale that matches my input data and pick the “Name (with Suggestions)” definition.

Next, I check Allow generation of multiple match codes per definition for each sensitivity in the property window. This enables the data job node to generate suggestions and also create an additional Match Score field as output.

suggestion-based-matching-in-sas-data-quality05

Looking at the output of the Match Codes node, we can see that we generate multiple different match codes (suggestions), and match scores for a single input (Ethan Baker). Because I selected Allow generation of multiple match codes per definition for each sensitivity, the Create Match Code node generates a match code representing the input name, plus additional match codes (suggestions) with character deletions, insertions, replacements and transpositions applied to the input name. The score value, in the Name_Match Code Score column, is an indicator for the closeness of the generated suggestion to the original input data. The more change operations are used to create the suggestion, the lower the score value. Therefore the lower the score the less likely it is that the suggestion is the true name.

Next in configuring my suggestion-based matching job is the Clustering node.

suggestion-based-matching-in-sas-data-quality06

There are no specific configuration needed in the Clustering node when using it with suggestion-based matching. In my example I only set the Name_Match Code field as the match condition and I pass all input fields as output (in Additional Outputs).

suggestion-based-matching-in-sas-data-quality07

Previewing the output of the cluster node I can already see the misspelled names of “Ethan Baker” are clustered correctly. But because I generated multiple suggestions for each input record, I end up with multiple clusters holding the same input records. Ethan Baker, Ethn Baker and Epthan Baker and its suggestions are assigned to cluster 0 to 7 and would also appear in single row clusters further down the output list. It is ok at this step of the data job to have two or more clusters containing the same set of input records when using suggestion-based matching. The next node in the data job will resolve the issue and use the match score to determine the single best cluster.

In the properties of the Cluster Aggregation node I set Cluster ID, Primary Key and Score fields (which were outputs of the previous Cluster node). In order to determine the single best cluster, I select the Cluster as a scoring method and Highest Mean as scoring algorithm. The Cluster Aggregation node will compute the mean value in each cluster. By checking Remove subclusters, I make sure only the cluster with the highest mean is outputted.

suggestion-based-matching-in-sas-data-quality08

With the Cluster Aggregation node configured the output looks like this:

suggestion-based-matching-in-sas-data-quality09

The final output of the Cluster Aggregation is reduced to the eight input records only. With the described set-up I successfully matched names that contain typographical errors like additional or missing characters.

As you can see above, the accuracy of your matching rules, and ultimately, your understanding of your customers, can be augmented through use of suggestion-based matching.

For more information, please refer to the product documentation:

tags: data management, DataFlux Data Management Studio, SAS Data Quality

Using suggestion-based matching in SAS Data Quality was published on SAS Users.

11月 022016
 

With DataFlux Data Management 2.7, the major component of SAS Data Quality and other SAS Data Management solutions, every job has a REST API automatically created once moved to the Data Management Server. This is a great feature and enables us to easily call Data Management jobs from programming languages like Python. We can then involve the Quality Knowledge Base (QKB), a  pre-built set of data quality rules, and do other Data Quality work that is impossible or challenging to do when using only Python.

calling-sas-data-quality-jobs-from-pythonIn order to make a RESTful call from Python we need to first get the REST API information for our Data Management job. The best way to get this information is to go to Data Management Server in your browser where you’ll find respective links for:

  • Batch Jobs
  • Real-Time Data Jobs
  • Real-Time Process Jobs.

From here you can drill through to your job REST API.

Alternatively, you can use a “shortcut” to get the information by calling the job’s REST API metadata URL directly. The URL looks like this:

http://<DM Server>:<port>/<job type>/rest/jobFlowDefns/<job id>/metadata

calling-sas-data-quality-jobs-from-python02

The <job id> is simply the job name (with subdirectory and extension) Base64 encoded. This is a common method to avoid issues with illegal URL characters like: # % & * { } : < > ? / + or space. You can go to this website to Base64 encode your job name.

If you have many jobs on the Data Management Server it might be quicker to use the “shortcut” instead of drilling through from the top.

Here an example to get the REST API information for the Data Management job “ParseAddress.ddf” which is in the subdirectory Demo of Real-Time Data Services on DM Server:

calling-sas-data-quality-jobs-from-python03

We Base64 encode the job name “Demo/ParseAddress.ddf” using the website mentioned above…

calling-sas-data-quality-jobs-from-python04

…and call the URL for the job’s REST API metadata:

http://DMServer:21036/SASDataMgmtRTDataJob/rest/jobFlowDefns/ RGVtby9QYXJzZUFkZHJlc3MuZGRm/metadata

calling-sas-data-quality-jobs-from-python05

From here we collect the following information:

The REST API URL and Content-Type information…
calling-sas-data-quality-jobs-from-python06

…the JSON structure for input data

calling-sas-data-quality-jobs-from-python07

…which we need in this format when calling the Data Management job from Python:

{"inputs" : {"dataTable" : {"data" : [[ "sample string" ],[ "another string" ]], "metadata" : [{"maxChars" : 255, "name" : "Address", "type" : "string"}]}}}

…and the JSON structure for the data returned by the Data Management job.

calling-sas-data-quality-jobs-from-python08

When you have this information, the Python code to call the Data Management job would look like this:

calling-sas-data-quality-jobs-from-python09

The output data from the Data Management job will be in data_raw. We call the built-in JSON decoder from the “request” module to move the output into a dictionary (data_out) from where we can access the data. The structure of the dictionary is according to the REST metadata. We can access the relevant output data via data_out[‘outputs’][‘dataTable’][‘data’]

calling-sas-data-quality-jobs-from-python10

calling-sas-data-quality-jobs-from-python11

The Python program will produce an output like this…

calling-sas-data-quality-jobs-from-python12

You can find more information about the DataFlux Data Management REST API here.

Calling Data Management jobs from Python is straight forward and is a convenient way to augment your Python code with the more robust set of Data Quality rules and capabilities found in the SAS Data Quality solution.

Learn more about SAS Data Quality.

tags: data management, DataFlux Data Management Studio, open source, REST API

Calling SAS Data Quality jobs from Python was published on SAS Users.

10月 252016
 

Requirements that are the most easily described can often be the most difficult to implement. I’m referring to requests like:

  • Display a gauge with the most recently collected metric.
  • Plot a 18 month rolling window of profit.
  • Display last month’s products percent of total metrics for visual comparison.

Okay, so these are pretty specific requests, which I built a report to answer, but none the less, requirements like these do exist.

Use Rank in SAS Visual Analytics

So, how do you implement these requests? Use rank! You might be wondering how this is possible since the rank feature requires a numeric value and these requirements are based on dates. Solution: use the TreatAs function. Let’s break it down step by step.

But first, here is a breakdown of the report objects used in this report. Notice that this report contains a section prompt via a button bar which prompts the user to select a Product Line. This section prompt filters all of the other objects by that Product Line value.

use-rank-in-sas-visual-analytics02

Step 1: Use TreatAs to create a metric from your date category

I am assuming that your data source has a date category. This will work with a date or date by month or date by year formatted data item. So long as the data item is recognized as a date then this technique will work.

This example will use the Date by Month data item. We will use the TreatAs function to create a metric, or in other words, a numeric representation of the date. That’s the great thing about dates in SAS, they simply represent the number of days before or after January 1, 1960. So the most recent the date, the larger the number, which we can then use rank to order.

From the Data tab, use the drop-down menu and select New Calculated Item….

use-rank-in-sas-visual-analytics03

Give your new calculated data item a name.

The result type will be numeric.

Under Operators, use the search window to find the TreatAs function; then drag that onto the visual pane. For the drop-down option, select _Number_.
Finally, drag the date data item onto the visual pane. In this example, we are using Date by Month

use-rank-in-sas-visual-analytics04

Step 2: Change the aggregation on your new measure to be non-additive

Next, we need to make sure this new metric that represents the Date by Month date is non-additive. We will not get the proper result if this new metric takes the sum or average when displayed on a visualization. To do this, navigate to the Data tab and click on the name of the new metric you created. In my example, I created a new metric named DateByMonthNum.

Toward the bottom of the Data tab are the data properties. Under the Aggregation property use the drop-down menu and select one of the non-additive metrics such as: Minimum, Median, or Maximum.

use-rank-in-sas-visual-analytics05

Step 3: Verify that your new measure returns the correct results

Now we can verify that when we rank our new measure, we get the expected results. To do this, I used a list table and added both the date data item Date by Month and the new metric data item DateByMonthNum. Here we can see that when I sort the metric data item by descending I get the expected results where each Date by Month value gives me a different DateByMonthNum value. I can also see that the more recent Date by Month value pairs to a larger DateByMonthNum value.

use-rank-in-sas-visual-analytics06

To be sure that you properly assigned a non-additive aggregation type, you can use the Show detail data property from the Properties tab. At the detail level you should see the same value pairs for the date and metric data items. Once you de-select Show detail data you should see the exact same value pairs. If you do, then you have correctly assigned your non-additive aggregation type.

use-rank-in-sas-visual-analytics07

Step 4: Use Rank to meet report requirements

Now that we have our metric properly created, we can use the Rank feature to display the last month’s metrics or a rolling window.

Last Month’s Metrics
In this visualization I used the Gauge Object.

use-rank-in-sas-visual-analytics08

On the Roles tab, I assigned Profit to the Measure role and Product to the Group role. I then created a five interval Display Rule between 0% and 50% at 10% intervals where anything over 50% is grouped together under the darkest green rule.

use-rank-in-sas-visual-analytics09

Now we must filter this visualization to display only the last month’s profit metrics; we do this by using the Rank feature. From the Ranks tab, you must first select the category data item you wish to subset by the rank. In our example, we want to display the last month’s metrics, so we will want to add a rank for the Date by Month data item. Once selected, click the button Add Rank.

use-rank-in-sas-visual-analytics10

Next we will need to select the metric we want to rank by. Next to the By drop-down; select our newly created metric DateByMonthNum. Then we will want to select the type of rank and how many to return. In this example, we will return the Top Count, i.e. the greatest value. And for the Count we want to return 1.

use-rank-in-sas-visual-analytics11

To help with the titling of the report, I added the exact same rank to a List Table object to display the data’s last month and to help report users know which month they are looking at.

use-rank-in-sas-visual-analytics12

use-rank-in-sas-visual-analytics13

Rolling 18 Month Window
The next visualization I created was a Line Chart Object plotting a rolling window of 18 month profit.

use-rank-in-sas-visual-analytics14

On the Roles tab, I selected Date by Month as the Category and Profit as the Measure.
On the Ranks tab, I selected the same values as I did for the list table and gauge objects, except I selected a Count of 18 to return the top 18 values of Date by Month ranked on our newly created metric DateByMonthNum. The rank will return the top 18 highest values for DatebyMonthNum which pair to the most recent 18 values for Date by Month giving us a rolling 18 month window.

use-rank-in-sas-visual-analytics15

Other Applications

In this example I used Rank at the month level but you could use this technique at the day level, quarter level, essentially for any supported date interval.

Assuming you have the proper data collected, you could also use Rank for the standard use of ranking the top X performing products, sales representatives or investment funds. You could also use rank to identify your bottom performing manufacturing equipment, car mileage, or school ratings.

Other Report Screenshots

use-rank-in-sas-visual-analytics16

use-rank-in-sas-visual-analytics18

tags: SAS Professional Services, SAS Programmers, SAS Visual Analytics

Use Rank in SAS Visual Analytics to display the last date, month or rolling window was published on SAS Users.

10月 212016
 

ProblemSolversHave you ever needed to run code based on the client application that you are using? Or have you needed to know the version of SAS® software that you are running and the operating system that you are running it on? This blog post describes a few automatic macro variables that can help with gathering this information.

Application Name

You can use the &_CLIENTAPP macro variable to obtain the name of the client application. Here are some details:

  • Referencing &_CLIENTAPP in SAS® Studio returns a value of SAS Studio
  • Referencing &_CLIENTAPP in SAS® Enterprise Guide® returns a value of ‘SAS Enterprise Guide
    Note: The quotation marks around SAS Enterprise Guide are part of the value.

Program Name

You can use the &SYSPROCESSNAME macro variable to obtain the name of the current SAS process. Here are some details:

  • Referencing &SYSPROCESSNAME interactively within the DMS window returns a value of DMS Process
  • Referencing &SYSPROCESSNAME in the SAS windowing environment of your second SAS session returns a value of DMS Process (2)
  • Referencing &SYSPROCESSNAME in SAS Enterprise Guide or SAS Studio returns a value of Object Server
  • Referencing &SYSPROCESSNAME in batch returns the word Program followed by the name of the program being run (for example: Program 'c:test.sas')
    Note: For information about other techniques for retrieving the program name, see SAS Note 24301: “How to retrieve the program name that is currently running in batch mode or interactively.”

Example

The following code illustrates how you can use both of these macro variables to check which client application you are using and display a message in the SAS log based on that result:

%macro check;
 
  %if %symexist(_clientapp) %then %do;
   %if &amp;_clientapp = SAS Studio %then %do;
    %put Running SAS Studio;
   %end;
   %else %if &amp;_clientapp= 'SAS Enterprise Guide' %then %do;
    %put Running SAS Enterprise Guide; 
   %end;
  %end;
 
  %else %if %index(&amp;sysprocessname,DMS) %then %do;
    %put Running in Display Manager;
  %end;
  %else %if %index(&amp;sysprocessname,Program) %then %do;
     %let prog=%qscan(%superq(sysprocessname),2,%str( ));
     %put Running in batch and the program running is &amp;prog;
  %end;
 
  %mend check;
 %check

SAS Session Run Mode or Server Type

Another helpful SAS read-only automatic macro variable is &SYSPROCESSMODE. You can use &SYSPROCESSMODE to obtain the current SAS session run mode or server type name. Here is a list of possible values:

• SAS Batch Mode

• SAS/CONNECT Session 

• SAS DMS Session

• SAS IntrNet Server

• SAS Line Mode

• SAS Metadata Server

• SAS OLAP Server

• SAS Pooled Workspace Server

• SAS Share Server

• SAS Stored Process Server

• SAS Table Server

• SAS Workspace Server

Operating System and Version of SAS

Having the information detailed above is helpful, but you might also need to know the operating system and exact version of SAS that you are running. The following macro variables help with obtaining this information.

You can use &SYSSCP and &SYSSCPL to obtain an abbreviation of the name of your operating system.  Here are some examples:

macrovariables

For a complete list of values, see the “SYSSCP and SYSSCPL Automatic Macro Variables” section of SAS® 9.4 Macro Language: Reference, Fourth Edition.

SAS Release

&SYSVLONG4 is the most informative of the macro variables that provide SAS release information. You can use it to obtain the release number and maintenance level of SAS as well as a four-digit year. Here is an example:

%put &amp;sysvlong4;

This code would print something similar to the following in the log:

9.04.01M3D06292015

Here is what this output means:

SAS release: 9.04.01

Maintenance level: M3

Ship Event date: D06292015

I hope that some of the tools described above are useful to you for obtaining information about your SAS environment. If you have any questions, please contact SAS Technical Support, and we will be happy to assist you. Thank you for using SAS!

tags: macro, Problem Solvers, SAS Macro, SAS Programmers

Macro variables that provide information about your SAS® environment was published on SAS Users.