Tech

12月 102016
 

During a recent customer visit, I was asked how to include a calculated variable within SAS Asset Performance Analytics’ (APA) Root Cause Analysis workflow. This is a simple request. Is there a simple approach to do this?

To remind you, in the APA workflow, an ETL Administrator makes a Data Mart available in the solution for the APA users. They can select variables and explore, analyze and create a model based on the columns present in the Data Mart.

But if you want to analyze a calculated column, like a difference between two variables, do you need to change the Data Mart? Yes, if you want the APA Data Selection to include this calculated column. But this takes time, and do you really need to change the Data Mart? No!

A simpler and faster approach to adding a calculated column is modifying the APA Root Cause Analysis workflow. And, this is simple!

SAS Asset Performance Analytics is easily and highly configurable. You can easily customize analytical workflows by modifying its underlying stored process. Let me show you how to customize an existing analysis and add a calculation step to enhance APA’s Root Cause Analysis with a calculated column.

Benefits

The main purpose of this customized analysis is to avoid the SAS Enterprise Guide utilization. The users are rarely SAS experts. Thereby, asking users to switch between the tools depending on the functionalities availability on the APA GUI isn’t recommended. The more you can do within the APA interface via wizard-guided workflows, the easier it will be.

The second benefit is the keeping the Data Mart limited to only crucial variables. Instead of asking an ETL Administrator to add non-validated and/or infrequently used calculated columns to the Data Mart, allow the APA user to test and create meaningful tags to enhance workflows as needed. Once the APA user identifies new and meaningful calculated variables, they can easily be added to the Data Mart and available to APA Explorations and APA Stability Monitoring. Limiting only critical variables within the Data Mart will ensure the data size is optimizing and only adjusted as needed.

Root Cause Analysis Use

The use of this new Root Cause Analysis is very easy, instead of selecting “Root Cause Analysis,” select “Root Cause Analysis Calculated Columns” when required.

sas-asset-performance-analytics-root-cause-analysis01

Figure 1: Stored Process Folder allows us to choose the appropriate Analysis

Clicking on “Ok” triggers the same steps as the original RCA.

sas-asset-performance-analytics-root-cause-analysis02

Figure 2: steps triggered by the Root Cause Analysis

After specifying a data selection, the “Filter Data” step contains six new user prompts:

Figure 3: New “Filter Data” interface

Figure 3: New “Filter Data” interface

These prompts allow the user to define what calculation is needed. Six calculations choices are currently available. These choices can be further customized as needed. Calculations types currently available include:

  1. Difference between two variables if you want to study a gap between two tags
  2. Absolute difference of two variables if large gaps are suspicious regardless of the order
  3. Ratio of two variables if you want to control the proportionality
  4. Multiplication of two variables
  5. Summation of two variables
  6. None

By default, the Calculation option is set to “None,” to perform a classical Root Cause Analysis.

Figure 4: Dropdown list with calculation details

Figure 4: Dropdown list with calculation details

After choosing your variables, you can apply variable coefficients you want to test. By default, APA performs an absolute difference Variable 1 – Variable 2, with the coefficients set to 1. If the variables don’t have the same order of magnitude, you can apply a coefficient to put the variables at the same level. By doing this the newly created calculated variable fluctuates around 0, which is easily interpretable.

In the example below, the goal is to realize a simple absolute difference between the Variable 2 (V.Four_Press_Clap) and the Variable 1 (V.Four_Press_C). By default the newly created column name is DIFF_TAGS. You can modify it to something more descriptive for your purposes. Don’t forget that the new column name must follow standard SAS column naming conventions.

Before processing the step, be careful to check the data is not missing during the data period you’ve selected. If it’s missing during the full time period, the DIFF_TAGS variable will not be created and the output will be the same than if you’d selected “None” for the calculation prompt.

Figure 5: DIFF_TAGS variable creation corresponding to the result of Four_Press_Clap – Four_Press_C

Figure 5: DIFF_TAGS variable creation corresponding to the result of Four_Press_Clap – Four_Press_C

Click “Run and Save Step.” As a result, the newly created calculated column is added to your input RCA data and available in the “Expand and Build Statistics” step. Now you can apply minimum and maximum value bands to the calculated column similar to the original RCA workflow.

Figure 6: minimum and maximum values to consider as atypical

Figure 6: minimum and maximum values to consider as atypical

As a result, the calculated column is used like a standard input tag during the full Root Cause Analysis workflow.

Things to keep in mind

If you have a specific need, SAS Asset Performance Analytics is easily customizable. In France, for example, we propose an unlimited and modular APA interface.

Creating columns within the APA user interface analysis workflow has two main benefits:

  1. The calculation is intuitive for a non SAS user
  2. You don’t need to change the data mart

Only two steps are required to implement Root Cause Analysis including Calculated Columns within APA:

  1. Copy and paste the pam_rca_filterData_NEW.sas file in the SASHome/SASFoundation/ 9.4/pamsrmv/sasstp directory.
  2. Using the SAS Management Console, import the RCA_CC.spk file at the following location: /Products/SAS Asset Performance Analytics/Analytical Workbench in the Folders tab.

If you have any questions about this process, feel free to add a comment below.

tags: Global Technology Practice, SAS Asset Performance Analytics

Enhancing SAS Asset Performance Analytics’ Root Cause Analysis with Calculated Columns was published on SAS Users.

12月 022016
 

Nearly every organization has to deal with big data, and that often means dealing with big data problems. For some organizations, especially government agencies, addressing these problems provides more than a competitive advantage, it helps them ensure public confidence in their work or meet standards mandated by law. In this blog I wanted to share with you how SAS worked with a government revenue collection agency to successfully manage their big data issues and seamlessly integrate with Hadoop and other technologies.

Hadoop Security

We all know Hadoop pretty well, and if you haven’t heard of Hadoop yet, it is about time you invest some resources to learn more about this upcoming defacto standard for storage and compute. The core of Apache Hadoop consist of a storage part known as HDFS (Hadoop Distributed File System) and a processing part (called MapReduce). Hadoop splits large files into large blocks and distributes them across the nodes of a cluster.

Hadoop was initially developed to solve web-scale problems like webpage search and indexing at Yahoo. However, the potential of the platform to handle big data and analytics caught the attention of a number of industries. Since the initial used of Hadoop was to count webpages and implement algorithms like page-rank, security was never considered a major requirement, until it started getting used by enterprises across the world.

Security incidents and massive fines have become commonplace and financial institutions, in particular, are doing everything to avoid such incidents. Security should never be an afterthought and should be considered in the initial design of the system. The five core pillars of Enterprise Security are as follows:

sas-integration-with-hadoop05

Our customer had the four core pillars covered from Administration to Auditing using tools provided by their Hadoop vendor. While there are options in the open-source community that provide data protection, in this case the organization decided to use a data security company to protect data at rest on top of Cloudera Navigator Encryption. They refer to it as “Double Encryption.”

The challenge

SAS has multiple products around the Hadoop ecosystem to provide the best support for customers. The traditional way of working with Hadoop involves SAS/ACCESS which can involve pulling the data from Hadoop using Hive. However for larger installations, where data movement is a concern, SAS provides Embedded Process technology, which allows you to push SAS code inside of a Hadoop cluster and run it alongside the data blocks. This is a super-efficient way to access large data sets inside of Hadoop by pushing the compute to the data.

Our customer's data security vendor’s product supports access via Hive UDF’s which means you can tokenize/detokenize when working with SAS/ACCESS Interface to Hadoop using PROC SQL and other options, relatively out of the box. In addition, the SAS language (BASE SAS) can be added using the security company’s API (and PROC FMCP and PROC PROTO) to add additional new SAS language functions for the de/tokenisation of data inside BASE SAS already.

However, SAS Embedded Process has no default support for our customer's security vendor and SAS products which utilize SAS EP include SAS Code Accelerator, SAS Scoring Accelerator and LASR-based products and cannot work with data tokenized by the vendor. This was a major challenge for our customer who wanted to use SAS products like SAS Visual Analytics and SAS Visual Statistics on large volumes of data stored on Hadoop.

The challenge hence was to make SAS Embedded Process work with their data security vendor’s software to perform detokenization before passing the data to SAS procedures.

The possible solutions

We considered various solutions before agreeing on a solution that satisfies all current requirements and could be extended to meet the future needs of our customer. Let’s discuss the top two solutions and the final implementation.

Solution 1: SERDE approach

Our first approach was to create a custom Hive SERDE that wraps the data security company’s APIs. With 9.4M3 the SAS Embedded Process (EP) can read & write via SERDE APIs with some possible constraints and limitations including DS2’s SET/MERGE capabilities and potential identity credentials being passed on from SAS to the company’s APIs.

sas-integration-with-hadoop02

The approach had various drawbacks but the top drawback was in working with various file formats. This approach was discarded because it would have meant lots of rework with every new data format being released by the Hadoop community. While it is true that generally an organization would standardize a few formats to be used for its use cases, it is nonetheless a limiting factor.

Solution 2: Use HDMD with Custom Input Formats

The second approach was to use HDMD with custom input formats. SAS HDMD supports custom input formats which will allow you to plug in your custom input format. A high-level architectural diagram looks something like Figure 2.  This approach works with a variety of file formats, and we have tested it with Parquet, Avro and ORC with good results. The objective is to load a dataset onto Hadoop or use an existing data set and generate an HDMD file for the dataset. We plug in our custom reader in the HDMD file and as a part of the custom reader we make a number of API calls to the data security company’s API. The API will call on the specific protect and unprotect procedures of the security vendor to protect and/or unprotect the data depending on the requirements and pass the results back to the client.

sas-integration-with-hadoop03

What is an Input/Custom input format BTW?

Data inside Hadoop is typically stored on HDFS (Hadoop Distributed File System). The data needs to be read from the filesystem before being processed. This is achieved using Input Format, which has the following responsibilities.

  • Compute input splits
    • Input splits represent the part of the data that will be processed by each Map phase. A unique input split is passed to the process. At the start of a Map Reduce job, input format will split the data into multiple parts based on logical record boundaries and HDFS block size. To get the input splits, the following method is called:
      • List getSplits(JobContext ctx)
  • Provide a logic to read the input split
    • Each mapper gets a unique input split to process the data. Input format provides a logic to read the split, which is an implementation of the RecordReader interface. The record reader will read the split and emit <key,value> pairs as an input for each map function. The record reader is created using the following method:
      • RecordReader<K,V> createRecordReader(InputSplit is, TaskAttemptContext ctx)

All the common formats will provide a way to split the data and read records. However, if you want to read a custom data set for which data parsing isn’t available out of the box with Hadoop, you are better off writing a custom input format.

How to write a Custom Input Format?

Writing a custom input format needs Java skills (the programming language in which Hadoop has been written). You have the option to implement Abstract methods of InputFormat class, or extend one of the pre-existing input formats. In our case, we had extended FileInputFormat, and overrode few critical methods like

  • getSplits()
  • getRecordReader()

The getSplits() will create the splits from the input data, while the getRecordReader() should return an instance of a Java object, which has the ability to read custom records, which in our case was the security vendor’s API.

You can use one of the predefined Record Reader classes or implement your own (most likely if you are writing a custom input format). In our case, we implemented RecordReader interface, and implemented the next() method which is called whenever a new record is found. This is the method where your core business logic is implemented. In our case, we had to write the integration logic by looking at the data, understanding the user who has logged in (available as a part of JobConf object), and then calling the vendor’s APIs to decrypt the data. Sample codes can be requested by contacting me directly.

Integrating a custom input format with SAS

Integrating a custom input format is fairly easy with SAS. SAS allows us to plug in custom formats, which are called before the data is processed via SAS Embedded Process using HDMD files.

When you generate an HDMD file using PROC HDMD, you can specify your custom input format as a part of the generated XML file. Please refer to PROC HDMD documentation.

The generated HDMD file would look something like this.

sas-integration-with-hadoop04

When loading the data from HDFS, SAS will ensure that the specified input format is called prior to any data processing taking place.

The ultimate solution

The solution was demonstrated using data from the tax authorities and included tokenization of data via hive UDFS, detokenization of data according to the policies set on the data security appliance, and performing analytics using SAS Visual Analytics. Only users with permissions on the specific policy were able to view the data, while users with no permissions had access to decrypted data. This additional security helped the enterprise protect users’ information from inadvertent access and resulted in widespread use of Big Data technologies within the Enterprise.

Summary

As you can see from the example above, SAS is open for business, and is already providing deep integration with Hadoop and other technologies using custom APIs. The sky is the limit for people willing to explore the capabilities of SAS.

tags: Global Technology Practice, Hadoop, SAS/ACCESS Interface to Hadoop

SAS integration with Hadoop - one success story was published on SAS Users.

12月 022016
 

Nearly every organization has to deal with big data, and that often means dealing with big data problems. For some organizations, especially government agencies, addressing these problems provides more than a competitive advantage, it helps them ensure public confidence in their work or meet standards mandated by law. In this blog I wanted to share with you how SAS worked with a government revenue collection agency to successfully manage their big data issues and seamlessly integrate with Hadoop and other technologies.

Hadoop Security

We all know Hadoop pretty well, and if you haven’t heard of Hadoop yet, it is about time you invest some resources to learn more about this upcoming defacto standard for storage and compute. The core of Apache Hadoop consist of a storage part known as HDFS (Hadoop Distributed File System) and a processing part (called MapReduce). Hadoop splits large files into large blocks and distributes them across the nodes of a cluster.

Hadoop was initially developed to solve web-scale problems like webpage search and indexing at Yahoo. However, the potential of the platform to handle big data and analytics caught the attention of a number of industries. Since the initial used of Hadoop was to count webpages and implement algorithms like page-rank, security was never considered a major requirement, until it started getting used by enterprises across the world.

Security incidents and massive fines have become commonplace and financial institutions, in particular, are doing everything to avoid such incidents. Security should never be an afterthought and should be considered in the initial design of the system. The five core pillars of Enterprise Security are as follows:

sas-integration-with-hadoop05

Our customer had the four core pillars covered from Administration to Auditing using tools provided by their Hadoop vendor. While there are options in the open-source community that provide data protection, in this case the organization decided to use a data security company to protect data at rest on top of Cloudera Navigator Encryption. They refer to it as “Double Encryption.”

The challenge

SAS has multiple products around the Hadoop ecosystem to provide the best support for customers. The traditional way of working with Hadoop involves SAS/ACCESS which can involve pulling the data from Hadoop using Hive. However for larger installations, where data movement is a concern, SAS provides Embedded Process technology, which allows you to push SAS code inside of a Hadoop cluster and run it alongside the data blocks. This is a super-efficient way to access large data sets inside of Hadoop by pushing the compute to the data.

Our customer's data security vendor’s product supports access via Hive UDF’s which means you can tokenize/detokenize when working with SAS/ACCESS Interface to Hadoop using PROC SQL and other options, relatively out of the box. In addition, the SAS language (BASE SAS) can be added using the security company’s API (and PROC FMCP and PROC PROTO) to add additional new SAS language functions for the de/tokenisation of data inside BASE SAS already.

However, SAS Embedded Process has no default support for our customer's security vendor and SAS products which utilize SAS EP include SAS Code Accelerator, SAS Scoring Accelerator and LASR-based products and cannot work with data tokenized by the vendor. This was a major challenge for our customer who wanted to use SAS products like SAS Visual Analytics and SAS Visual Statistics on large volumes of data stored on Hadoop.

The challenge hence was to make SAS Embedded Process work with their data security vendor’s software to perform detokenization before passing the data to SAS procedures.

The possible solutions

We considered various solutions before agreeing on a solution that satisfies all current requirements and could be extended to meet the future needs of our customer. Let’s discuss the top two solutions and the final implementation.

Solution 1: SERDE approach

Our first approach was to create a custom Hive SERDE that wraps the data security company’s APIs. With 9.4M3 the SAS Embedded Process (EP) can read & write via SERDE APIs with some possible constraints and limitations including DS2’s SET/MERGE capabilities and potential identity credentials being passed on from SAS to the company’s APIs.

sas-integration-with-hadoop02

The approach had various drawbacks but the top drawback was in working with various file formats. This approach was discarded because it would have meant lots of rework with every new data format being released by the Hadoop community. While it is true that generally an organization would standardize a few formats to be used for its use cases, it is nonetheless a limiting factor.

Solution 2: Use HDMD with Custom Input Formats

The second approach was to use HDMD with custom input formats. SAS HDMD supports custom input formats which will allow you to plug in your custom input format. A high-level architectural diagram looks something like Figure 2.  This approach works with a variety of file formats, and we have tested it with Parquet, Avro and ORC with good results. The objective is to load a dataset onto Hadoop or use an existing data set and generate an HDMD file for the dataset. We plug in our custom reader in the HDMD file and as a part of the custom reader we make a number of API calls to the data security company’s API. The API will call on the specific protect and unprotect procedures of the security vendor to protect and/or unprotect the data depending on the requirements and pass the results back to the client.

sas-integration-with-hadoop03

What is an Input/Custom input format BTW?

Data inside Hadoop is typically stored on HDFS (Hadoop Distributed File System). The data needs to be read from the filesystem before being processed. This is achieved using Input Format, which has the following responsibilities.

  • Compute input splits
    • Input splits represent the part of the data that will be processed by each Map phase. A unique input split is passed to the process. At the start of a Map Reduce job, input format will split the data into multiple parts based on logical record boundaries and HDFS block size. To get the input splits, the following method is called:
      • List getSplits(JobContext ctx)
  • Provide a logic to read the input split
    • Each mapper gets a unique input split to process the data. Input format provides a logic to read the split, which is an implementation of the RecordReader interface. The record reader will read the split and emit <key,value> pairs as an input for each map function. The record reader is created using the following method:
      • RecordReader<K,V> createRecordReader(InputSplit is, TaskAttemptContext ctx)

All the common formats will provide a way to split the data and read records. However, if you want to read a custom data set for which data parsing isn’t available out of the box with Hadoop, you are better off writing a custom input format.

How to write a Custom Input Format?

Writing a custom input format needs Java skills (the programming language in which Hadoop has been written). You have the option to implement Abstract methods of InputFormat class, or extend one of the pre-existing input formats. In our case, we had extended FileInputFormat, and overrode few critical methods like

  • getSplits()
  • getRecordReader()

The getSplits() will create the splits from the input data, while the getRecordReader() should return an instance of a Java object, which has the ability to read custom records, which in our case was the security vendor’s API.

You can use one of the predefined Record Reader classes or implement your own (most likely if you are writing a custom input format). In our case, we implemented RecordReader interface, and implemented the next() method which is called whenever a new record is found. This is the method where your core business logic is implemented. In our case, we had to write the integration logic by looking at the data, understanding the user who has logged in (available as a part of JobConf object), and then calling the vendor’s APIs to decrypt the data. Sample codes can be requested by contacting me directly.

Integrating a custom input format with SAS

Integrating a custom input format is fairly easy with SAS. SAS allows us to plug in custom formats, which are called before the data is processed via SAS Embedded Process using HDMD files.

When you generate an HDMD file using PROC HDMD, you can specify your custom input format as a part of the generated XML file. Please refer to PROC HDMD documentation.

The generated HDMD file would look something like this.

sas-integration-with-hadoop04

When loading the data from HDFS, SAS will ensure that the specified input format is called prior to any data processing taking place.

The ultimate solution

The solution was demonstrated using data from the tax authorities and included tokenization of data via hive UDFS, detokenization of data according to the policies set on the data security appliance, and performing analytics using SAS Visual Analytics. Only users with permissions on the specific policy were able to view the data, while users with no permissions had access to decrypted data. This additional security helped the enterprise protect users’ information from inadvertent access and resulted in widespread use of Big Data technologies within the Enterprise.

Summary

As you can see from the example above, SAS is open for business, and is already providing deep integration with Hadoop and other technologies using custom APIs. The sky is the limit for people willing to explore the capabilities of SAS.

tags: Global Technology Practice, Hadoop, SAS/ACCESS Interface to Hadoop

SAS integration with Hadoop - one success story was published on SAS Users.

12月 012016
 

Data virtualization is an agile way to provide virtual views of data from multiple sources without moving the data. Think of data virtualization as an another arrow in your quiver in terms of how you approach combining data from different sources to augment your existing Extract, Transform and Load ETL batch processes. SAS® Federation Server is a unique data virtualization offering that provides not only blending of data, but also on-demand data masking, encryption and cleansing of the data. It provides a central, virtual environment for administering and securing access to your Personally Identifiable Information (PII) and other data.

Data privacy is a major concern for organizations and one of the features of SAS Federation Server is it allows you to effectively and efficiently control access to your data, so you can limit who is able to view sensitive data such as credit card numbers, personal identification numbers, names, etc. In this three part blog series, I will explore the topic of controlling data access using SAS Federation Server. The series will cover the following topics:

Part 1: Securing sensitive data using SAS Federation Server at the data source level
Part 2: Securing sensitive data using SAS Federation Server at the row and column level
Part 3: Securing sensitive data using SAS Federation Server data masking

SAS Metadata Server is used to perform authentication for users and groups in SAS Federation Server and SAS Federation Server Manager is used to help control access to the data. In this blog, I want to explore controlling data access to specific sources of data using SAS Federation Server.  Obviously, you can secure data at its source by using secured metadata-bound libraries in SAS Metadata Server or by using a database’s or file’s own security mechanisms. However, SAS Federation Server can be used to control access to these data sources by authenticating with the users and groups in SAS Management Console and setting authorizations within SAS Federation Server Manager.

In order to show how SAS Federation Server can be used to control access data, I will explore an example where Finance Users in our fictitious company SHOULD have access to the Salary data in a SAS dataset, but our Business Users should NOT.Instead, ourBusiness Users should have access to all other BASE tables with the exception of SALARY. In my scenario, Kate is a Finance User and David and Sally are Business Users. These users have already been setup as such in SAS Metadata Server.

The SAS Federation Server Administrator has setup the BASE catalog and schema information in Federation Server Manager. The SALARY table is located in the Employee_Info schema within the Global catalog.

securing-sensitive-data-using-sas-federation-server01

The SAS Federation Server Administrator has also explicitly granted the CONNECT and SELECT permissions to both the Business Users and Finance Users group for the BASE Data Service.

securing-sensitive-data-using-sas-federation-server02

securing-sensitive-data-using-sas-federation-server03

This gives both groups permission to connect to and select information from the items within this Data Service. The information is inherited by all children items of the Data Service – Data Source Names, Catalogs, Schemas, Tables and Views.  For example, note that the Business Users group has inherited the Grant setting for the CONNECT permission on the BASE Data Source Name (DSN) and the SELECT permission on the EMPLOYEES table.  Permission inheritance is denoted by the diamond symbol (u).

securing-sensitive-data-using-sas-federation-server04

securing-sensitive-data-using-sas-federation-server05

For the SALARY table, the SAS Federation Server Administrator has explicitly denied the SELECT permission for the Business Users group whereas the Finance Users groups has inherited the Grant setting for the SELECT permission on the SALARY table.

securing-sensitive-data-using-sas-federation-server06

securing-sensitive-data-using-sas-federation-server07

Kate, who is a member of the Finance Users group, has permission to select records from the SALARY table.

securing-sensitive-data-using-sas-federation-server08

Note: The user does not need to know the physical location of where the SAS data resides. They simply refer to the Federation Server Data Source Name which in this case is BASE.

By denying the Business Users group the SELECT permission on the SALARY table, David, who is a member of the Business Users group, does NOT have access to select records from this table. He is denied access.

securing-sensitive-data-using-sas-federation-server09

However, David still has access to the EMPLOYEES table since the Business Users group inherited the SELECT permission for that table.

securing-sensitive-data-using-sas-federation-server10

If I want to prevent David from accessing any of the tables or views in the Employee_Info schema, but still allow other Business Users to access them, then as the SAS Federation Server Administrator I can explicitly deny the user, David, the SELECT permission for the Employee_Info schema as shown below.

securing-sensitive-data-using-sas-federation-server11

Now, David inherits the Deny setting for the SELECT permission for all tables and views within that schema and he will no longer be able to access the EMPLOYEES table.

securing-sensitive-data-using-sas-federation-server12

However, Sally, another member of the Business Users group, is still able to access the EMPLOYEES table.

securing-sensitive-data-using-sas-federation-server13

In this blog entry, I covered the first part of this series on controlling data access to SAS Federation Server 4.2:

Part 1: Securing sensitive data using SAS Federation Server at the data source level
Part 2: Securing sensitive data using SAS Federation Server at the row and column level
Part 3: Securing sensitive data using SAS Federation Server data masking

I’ll be posting Part 2 of this series soon. Keep an eye out for it.

For more information on SAS Federation Server:

tags: SAS Federation Server, SAS Professional Services, Securing data

Securing sensitive data using SAS Federation Server at the data source level was published on SAS Users.

11月 292016
 

Present at SAS Global Forum 2017est plus près de la maison, está más cerca de casa, está mais perto de casa, dichter bij huis, is closer to home, eh!

In analytics and statistics, we often talk about sample sizes. The size of the data sets that you analyze are a measure of the amount of information contained within those data. When observations are very similar or correlated due to study design, then the information added by having multiple (correlated) observations may be negligible. This is a common problem with clustered data; the information contained in clustered data is closer to the number of clusters than to the number of observations. As a result, study designers seek to measure many clusters.

When it comes to global presenters, SAS Global Forum is seeking more clusters.

Global representation at SAS Global Forum enriches the conference experience for all attendees, providing each of us with more innovation and information to advance the goals of our organizations.

However, we know that attending our conference from the far corners of the globe is expensive … but not as expensive as it used to be! We’ve got good news for SAS users who reside outside the contiguous 48 states of the United States (residents of Alaska, Hawaii, and U.S. territories, read this carefully!).

To ease the financial burden of travelling from afar to the conference, two new policies have been adopted by the SAS Global Users Group – largely in response to your concerns about cost.

Doubled discount for accepted contributed sessions

Each year, SAS Global Forum attracts about 700 proposed sessions from the user community. The review process is competitive as we can only accept 400 session talks. To attract even more submissions from around the globe, we’ve raised the registration discount from 25% to 50% for accepted proposals from the international user community. If you reside outside the 48 contiguous States, and your abstract is approved, you will automatically receive the 50% discount when you register.

As of the writing of this blog, SAS Global Forum 2017 will include four sessions from Africa, ninefrom Australia, 18 from Asia, 12 from South America and the Caribbean, 37 from Canada, 21 from Europe, and 23 from the United Kingdom. With this new policy, we expect far more in 2018 and beyond!

International Professional Awards

Forty SAS Users will be selected from submitted applications to have their registration fee waived. SAS Users who reside outside the contiguous 48 States can apply by completing the application found on the conference website. In this application, you will be asked to describe your SAS experience, barriers to attendance, and about your commitment to attend. Submitters of contributed content are eligible.

We are certain these changes will help make SAS Global Forum the most diverse, international conference yet! I look forward to meeting many SAS users from near and far in Orlando. See you there! Or should I say Wir sehen uns dort! Ci vediamo lì! Nähdään siellä!

tags: International Professional Awards, papers & presentations, SAS Global Forum

SAS Global Forum 2017 is closer to home, or should I say… was published on SAS Users.

11月 252016
 

Using the DATASETS procedure, we can easily modify SAS variable attributes such as name, format, informat and label:

proc datasets library=libref;
  modify table_name;
    format var_name = date9.;
    informat var_name = mmddyy10.;
    label var_name = 'New label';
    rename var_name = var_new_name;
run;
quit;

We cannot, however, modify fixed variable attributes such as variable type and length.

Notice that we apply variable attributes modifications to one table at a time; that table name is specified in the modify statement of the proc datasets. There might be multiple modify statements within a single proc datasets to apply modifications to multiple data tables.

But what if we need to change variable attributes in all tables of a data library where that variable is present? Say, there is a data model change that needs to be applied consistently to all the data tables in that model.

Business case

Let’s consider a business case for a SAS programmer working for a pharmaceutical company or a contract research organization (CRO) managing data for a clinical trials project. Imagine that you need to apply some variable attributes modifications to all SDTM (Study Data Tabulation Model) and/or ADaM (Analysis Data Model) datasets to bring them in compliance with the CDISC (Clinical Data Interchange Standards Consortium) requirements. Obviously, “manually” going through dozens or hundreds of data tables in different domains to identify whether or not they contain your variable of interest is not an efficient solution.

Better data management practice

This variable attributes update process can be fully automated using the Data-driven SAS macro loops approach as described in my earlier blog post. First, we need to programmatically loop through all the tables in a specified library to find the tables that contain the variable to be modified. Then, for each iteration of that data loop we use SAS macro looping to generate multiple modify statements in proc datasets with subsequent sub-statements (format, informat, label, rename).

SASHELP library

SAS provides a multitude of datasets and views in the SASHELP library. While many of them can be used in coding examples, demos, and testing, there are some that provide invaluable information about your SAS environment. Take a look at the views in the SASHELP library whose names start with the letter “V”: Vcolumn, Vlibname, Vmacro, Voption, Vtable, Vtitle, and others.

These SAS views contain a wealth of information about your SAS environment, and reflect any new changes to your SAS environment dynamically. For example, if you run the following code to create a macro variable:

%let aaa = a;

and then open the Vmacro view, you will see that macro variable listed right there at the very top.

For our purpose of looping through all the tables in a library containing a particular variable, we are going to use the Vcolumn view, as it has all the information we need – libname (Library Name), memname (Member Name), name (Column Name):

Modifying variable attributes in a SAS library

Code implementation

Having that SASHELP.VCOLUMN table view available we can easily implement our code for modifying variable attributes in all the tables of a data library. Here is the SAS code example:

libname sdtm 'c:';

data sdtm.a;
  x=1; label x='Old x-label';
run;

data sdtm.b;
  y=1; label y='Old y-label'; format y mmddyy8.
run;

data sdtm.c;
  y=2; label y='Old y-label'; format y date9.;
run;

%macro change_var_attr (
  lib=,           /* libref,               required parameter */
  var=,           /* variable name,        required parameter */
  new_label='_',  /* new label, in quotes, optional parameter */
  new_format=_,   /* new format,           optional parameter */
  new_informat=_, /* new informat,         optional parameter */
  new_name=_      /* new variable name,    optional parameter */
  );

  /* get all tables in a library &lib containing variable &var */
  data target_tables (keep=memname);
    set sashelp.vcolumn (keep=libname memname name);
    where (upcase(libname) eq upcase("&lib")) and (upcase(name) eq upcase("&var"));
  run;

  proc datasets library=&lib;
    %let dsid = %sysfunc(open(target_tables));
    %syscall set(dsid); /* no leading ampersand for dsid with %syscall */ 
    %do %while(%sysfunc(fetch(&dsid)) eq 0);
      modify &memname;
      %if &new_label ne '_'  %then %str(label    &var = &new_label;);
      %if &new_format ne _   %then %str(format   &var &new_format;);
      %if &new_informat ne _ %then %str(informat &var &new_informat;);
      %if &new_name ne _     %then %str(rename   &var = &new_name;);
    %end;
    %let rc = %sysfunc(close(&dsid));
  run;
  quit;

%mend change_var_attr;

%change_var_attr(lib=sdtm, var=y, new_label='New label, commas allowed',
                 new_format=yymmddd10., new_name=z)

The centerpiece of the code is the change_var_attr macro. The macro has two required parameters, lib and var, and four optional parameters, new_label, new_format, new_informat, and new_name.

To get all the tables in a library &lib containing the variable &var, we query and subset the SAS view SASHELP.VCOLUMN and create a target_tables SAS dataset.

Then in PROC DATASETS, we macro-loop through the target_tables and generate as many modify statements as there are observations in the target_tables dataset, one for each table to be modified. Each modify statement is followed by one or several optional statements – label, format, informat, and rename.

%syscall set (dsid); statement invokes a CALL SET routine that automatically sets the value of macro variable memname when an observation with memname variable is fetched (read) from the target_tables dataset.

There may be other programming techniques to solve the problem of modifying variable attributes across all data sets of a data library. You are welcome to suggest your solution in the comments section below.

Bonus

For those who were patient enough to read to the end… If you need to apply modifications to a variable in tables across multiple data libraries, you can still do it with a single invocation of the change_var_attr macro. In order to do that you would need to define a new data library logically concatenating all the relevant libraries. For example, the following libname statement assigns the libref combo as a logical concatenation of both, sdtm and adam libraries:

libname combo (sdtm adam);

 
tags: modifying variables, SAS Programmers

Modifying variable attributes in all datasets of a SAS library was published on SAS Users.

11月 182016
 

ProblemSolversWith fall comes cooler weather and, of course, football. Lots of football. Often times there will be two NFL games on that my husband wants to watch at the same time. Instead of flipping back and forth between two television stations, he can watch both games simultaneously, thanks to the picture-in-picture feature that we have on our television. This same concept works for SAS® ODS Graphics.

Have you ever been viewing two graphs across pages, flipping back and forth between the two and wishing you could see them together? Now you can. The Graph Template Language (GTL) and PROC SGRENDER enable you to produce a graph inside of a graph, similar to the picture-in-picture feature on your television.

The Game Plan

In this example, we are going to create a graph in the upper right corner of the axis area of a larger graph. When we define the GTL, we always start with the same GTL wrapper, as is shown below. In the wrapper below, INSET is the name of the GTL definition:

proc template;
define statgraph inset;
begingraph;
 
/* insert the code that produces the graphics output */
 
endgraph;
end;
run;

For demonstration purposes, we are going to use the SAS data set Sashelp.Heart and we are going to plot the variable CHOLESTEROL. The ENTRYTITLE statement defines the title for the graph. This statement is valid within the BEGINGRAPH block or after the last ENDLAYOUT statement. The plotting statements are contained within a LAYOUT block. In our example, we have enclosed the HISTOGRAM and DENSITYPLOT plotting statements inside a LAYOUT OVERLAY block. A standard axis is displayed with the BINAXIS=FALSE option in the HISTOGRAM statement. In the PROC SGRENDER statement, we point to the template definition, INSET, using the TEMPLATE option.

proc template;
define statgraph inset;
begingraph;
entrytitle 'Framingham Heart Study';
   layout overlay;
     histogram cholesterol /  binaxis=false datatransparency=0.5;
     densityplot cholesterol /  datatransparency=0.5;
   endlayout;
endgraph;
end;
run;
 
proc sgrender data=sashelp.heart template=inset;
run;

The results are shown in Figure 1.

SAS ODS Graphics

Figure 1

Special Play

Once we have produced the graph in Figure 1, we can see that we have room to display a second graph in it, in the upper right corner of the axis area. We can insert the graph inside the axis by placing the plotting statements inside of a LAYOUT OVERLAY block within a LAYOUT GRIDDED block.  Here are the details.

In the following LAYOUT GRIDDED statement, which is located after the DENSITYPLOT statement, we define the size of the graph using the options WIDTH=300px and HEIGHT=200px. And the options HALIGN=RIGHT and VALIGN=TOP place the graph in the top right corner.

layout gridded / width=300px height=200px halign=right valign=top;

The inset graph contains two regression lines, one for Cholesterol by Diastolic, which is a dashed blue line. The solid green regression line represents Cholesterol by Systolic. The NAME option is added to each of the REGRESSIONPLOT statements in order to produce a legend. The DISCRETELEGEND specifies that the legend be drawn for both SYSTOLIC and DIASTOLIC with the NAME option. The NAME values are case sensitive. I specify the Y axis label with the LABEL option within the YAXISOPTS option in the LAYOUT OVERLAY statement. I also specify THRESHOLDMAX=1 within the LINEAROPTS option within both YAXISOPTS and XAXISOPTS to ensure that the last tick mark value includes the highest value in the data.

I added a red fill pattern of L3 to the larger graph to make it stand out more.  In addition to specifying the fill pattern with the PATTERN option within the FILLPATTERNATTRS option, you must also specify FILLPATTERN within the DISPLAY option. Valid values for the PATTERN option within the FILLPATTERNATTRS option are L1-L5, R1-R5, and X1-X5.

Note:  The FILLPATTERN option and DISPLAY option FILLPATTERN are available beginning in SAS® 9.4 TS1M1. If you are running an older version of SAS, you need to remove this syntax from the program shown below.

proc template;
define statgraph inset;
begingraph;
entrytitle 'Framingham Heart Study';
  layout overlay;
     histogram cholesterol /  binaxis=false datatransparency=0.5 
                   display=(fillpattern outline fill)       fillattrs=(color=lightred) 
                   fillpatternattrs=(pattern=l3 color=red);
     densityplot cholesterol /  datatransparency=0.5 lineattrs=(color=darkred);
  layout gridded / width=300px height=200px halign=right valign=top;
  layout overlay / yaxisopts=(label='Blood Pressure' linearopts=(thresholdmax=1))
                   xaxisopts=(linearopts=(thresholdmax=1));
     regressionplot x=cholesterol y=diastolic / lineattrs=(color=blue pattern=2) name='Diastolic';
     regressionplot x=cholesterol y=systolic / lineattrs=(color=green) name='Systolic';
     discretelegend 'Diastolic' 'Systolic' / across=2; 
  endlayout;
  endlayout;
  endlayout;
endgraph;
end;
run;
 
proc sgrender data=sashelp.heart template=inset; 
run;

The results are displayed in Figure 2.

sas-ods-graphics02

Figure 2

Now you have the tools you need to display a graph within a graph. For other tips on creating graphs with ODS Graphics and the SG procedures, check out Sanjay Matange’s blog series, Graphically Speaking, http://blogs.sas.com/content/graphicallyspeaking/.pre

tags: Graph Template Language (GTL), SAS ODS, SAS Problem Solvers

Picture-in-Picture - It’s Not Just for Television Anymore was published on SAS Users.

11月 152016
 

If your SAS Visual Analytics report requirements include linking out to separate reports without the need to pass values, you may want to consider using images to enhance the appearance of your base report. Here are three style examples using images that you can use depending on your report design requirements and report user preference:

1.     Visually appealing
2.     Generic
3.     Screenshot of actual report.

There is no better substitute for looking at examples so here are some screenshots for you:

1.     Visually appealing

Use images in SAS Visual Analytics

2.     Generic

use-images-in-sas-visual-analytics02

3.     Screenshot of actual report

use-images-in-sas-visual-analytics03

Image selection

Using an image in your report has never been easier. You can navigate your local machine for the image and, if you want, you can also save the image in metadata. This allows other users with access to that metadata location the ability to use the same image. This is great when you want to impose consistency throughout your reports.

use-images-in-sas-visual-analytics04

use-images-in-sas-visual-analytics05

Setting link using image

To define a report link from your image, click on your image then open the Interactions tab. Then use the New drop-down menu and select the type of link you wish to define. For a Report Link or Section Link, use the Browse button to navigate the metadata and select your target. If you are linking to an External URL then enter the fully qualified URL or you can define a link to a stored process.

use-images-in-sas-visual-analytics06

Here is a breakdown of the report objects used in the main dashboard report. I also included the screenshots of my Daily, Weekly, and Monthly report examples.

Dashboard example breakdown

use-images-in-sas-visual-analytics07

Daily report example

use-images-in-sas-visual-analytics08

Weekly report example

use-images-in-sas-visual-analytics09

Monthly report example

use-images-in-sas-visual-analytics10

 

 

tags: SAS Professional Services, SAS Visual Analytics

Use images in SAS Visual Analytics to enhance your report link was published on SAS Users.

11月 102016
 

We all have challenges in getting an accurate and consistent view of our customers across multiple applications or sources of customer information. Suggestion-based matching is a technique found in SAS Data Quality to improve matching results for data that has arbitrary typos and incorrect spellings in it. The suggestion-based concept and benefits were described in a previous blog post. In this post, I will expand on the topic and show how to build a data job that uses suggestion-based matching in DataFlux Data Management Studio, the key component of SAS Data Quality and other SAS Data Management offerings. This article takes a simple example job to illustrate the steps needed to configure suggestion-based matching for person names.

In DataFlux Data Management Studio I first configure a Job Specific Data node to define the columns and example records that I’d like to feed into the matching process. In this example, I use a two column data table made up of Rec_ID and a Name column and sample records as shown below.

suggestion-based-matching-in-sas-data-quality

To build the suggestion-based matching feature, I have to insert and configure at least a Create Match Codes node, a Clustering Node and a Cluster Aggregation node in the data job.

suggestion-based-matching-in-sas-data-quality02

This example uses names with randomly injected typographical errors, like missing characters, additional characters and character transpositions. Please note, these types of data errors would not be matched correctly using the standard matching technique in SAS Data Quality, therefore I am leveraging the suggestion-based matching feature. The picture below shows the person names and highlights the injected errors for Ethan Baker.

suggestion-based-matching-in-sas-data-quality03

For suggestion-based matching in SAS Data Quality I need to use a match definition that supports suggestions. The current Quality Knowledge Base (QKB) for Customer Information ships with a ready to use match definition called “Name (with suggestions).” For other data types, I can easily create new definitions or expand existing definitions to support the suggestions feature as well.

To continue building my suggestion-based matching job I next need to configure the Create Match Codes node as shown in the picture below.

suggestion-based-matching-in-sas-data-quality045

In the property screen of the Create Match Codes node I select the Locale that matches my input data and pick the “Name (with Suggestions)” definition.

Next, I check Allow generation of multiple match codes per definition for each sensitivity in the property window. This enables the data job node to generate suggestions and also create an additional Match Score field as output.

suggestion-based-matching-in-sas-data-quality05

Looking at the output of the Match Codes node, we can see that we generate multiple different match codes (suggestions), and match scores for a single input (Ethan Baker). Because I selected Allow generation of multiple match codes per definition for each sensitivity, the Create Match Code node generates a match code representing the input name, plus additional match codes (suggestions) with character deletions, insertions, replacements and transpositions applied to the input name. The score value, in the Name_Match Code Score column, is an indicator for the closeness of the generated suggestion to the original input data. The more change operations are used to create the suggestion, the lower the score value. Therefore the lower the score the less likely it is that the suggestion is the true name.

Next in configuring my suggestion-based matching job is the Clustering node.

suggestion-based-matching-in-sas-data-quality06

There are no specific configuration needed in the Clustering node when using it with suggestion-based matching. In my example I only set the Name_Match Code field as the match condition and I pass all input fields as output (in Additional Outputs).

suggestion-based-matching-in-sas-data-quality07

Previewing the output of the cluster node I can already see the misspelled names of “Ethan Baker” are clustered correctly. But because I generated multiple suggestions for each input record, I end up with multiple clusters holding the same input records. Ethan Baker, Ethn Baker and Epthan Baker and its suggestions are assigned to cluster 0 to 7 and would also appear in single row clusters further down the output list. It is ok at this step of the data job to have two or more clusters containing the same set of input records when using suggestion-based matching. The next node in the data job will resolve the issue and use the match score to determine the single best cluster.

In the properties of the Cluster Aggregation node I set Cluster ID, Primary Key and Score fields (which were outputs of the previous Cluster node). In order to determine the single best cluster, I select the Cluster as a scoring method and Highest Mean as scoring algorithm. The Cluster Aggregation node will compute the mean value in each cluster. By checking Remove subclusters, I make sure only the cluster with the highest mean is outputted.

suggestion-based-matching-in-sas-data-quality08

With the Cluster Aggregation node configured the output looks like this:

suggestion-based-matching-in-sas-data-quality09

The final output of the Cluster Aggregation is reduced to the eight input records only. With the described set-up I successfully matched names that contain typographical errors like additional or missing characters.

As you can see above, the accuracy of your matching rules, and ultimately, your understanding of your customers, can be augmented through use of suggestion-based matching.

For more information, please refer to the product documentation:

tags: data management, DataFlux Data Management Studio, SAS Data Quality

Using suggestion-based matching in SAS Data Quality was published on SAS Users.

11月 022016
 

With DataFlux Data Management 2.7, the major component of SAS Data Quality and other SAS Data Management solutions, every job has a REST API automatically created once moved to the Data Management Server. This is a great feature and enables us to easily call Data Management jobs from programming languages like Python. We can then involve the Quality Knowledge Base (QKB), a  pre-built set of data quality rules, and do other Data Quality work that is impossible or challenging to do when using only Python.

calling-sas-data-quality-jobs-from-pythonIn order to make a RESTful call from Python we need to first get the REST API information for our Data Management job. The best way to get this information is to go to Data Management Server in your browser where you’ll find respective links for:

  • Batch Jobs
  • Real-Time Data Jobs
  • Real-Time Process Jobs.

From here you can drill through to your job REST API.

Alternatively, you can use a “shortcut” to get the information by calling the job’s REST API metadata URL directly. The URL looks like this:

http://<DM Server>:<port>/<job type>/rest/jobFlowDefns/<job id>/metadata

calling-sas-data-quality-jobs-from-python02

The <job id> is simply the job name (with subdirectory and extension) Base64 encoded. This is a common method to avoid issues with illegal URL characters like: # % & * { } : < > ? / + or space. You can go to this website to Base64 encode your job name.

If you have many jobs on the Data Management Server it might be quicker to use the “shortcut” instead of drilling through from the top.

Here an example to get the REST API information for the Data Management job “ParseAddress.ddf” which is in the subdirectory Demo of Real-Time Data Services on DM Server:

calling-sas-data-quality-jobs-from-python03

We Base64 encode the job name “Demo/ParseAddress.ddf” using the website mentioned above…

calling-sas-data-quality-jobs-from-python04

…and call the URL for the job’s REST API metadata:

http://DMServer:21036/SASDataMgmtRTDataJob/rest/jobFlowDefns/ RGVtby9QYXJzZUFkZHJlc3MuZGRm/metadata

calling-sas-data-quality-jobs-from-python05

From here we collect the following information:

The REST API URL and Content-Type information…
calling-sas-data-quality-jobs-from-python06

…the JSON structure for input data

calling-sas-data-quality-jobs-from-python07

…which we need in this format when calling the Data Management job from Python:

{"inputs" : {"dataTable" : {"data" : [[ "sample string" ],[ "another string" ]], "metadata" : [{"maxChars" : 255, "name" : "Address", "type" : "string"}]}}}

…and the JSON structure for the data returned by the Data Management job.

calling-sas-data-quality-jobs-from-python08

When you have this information, the Python code to call the Data Management job would look like this:

calling-sas-data-quality-jobs-from-python09

The output data from the Data Management job will be in data_raw. We call the built-in JSON decoder from the “request” module to move the output into a dictionary (data_out) from where we can access the data. The structure of the dictionary is according to the REST metadata. We can access the relevant output data via data_out[‘outputs’][‘dataTable’][‘data’]

calling-sas-data-quality-jobs-from-python10

calling-sas-data-quality-jobs-from-python11

The Python program will produce an output like this…

calling-sas-data-quality-jobs-from-python12

You can find more information about the DataFlux Data Management REST API here.

Calling Data Management jobs from Python is straight forward and is a convenient way to augment your Python code with the more robust set of Data Quality rules and capabilities found in the SAS Data Quality solution.

Learn more about SAS Data Quality.

tags: data management, DataFlux Data Management Studio, open source, REST API

Calling SAS Data Quality jobs from Python was published on SAS Users.