Helmut Plinke

12月 072017
 

The primary obstacle to becoming a data-driven business is that data is not readily available, leaving valuable insights unused in data silos. To overcome this hurdle, today’s companies are creating a new role: Chief Data Officers (CDO). Responsible for unlocking insights hidden in data silos, the CDO is tasked with [...]

Are you ready for the Chief Data Officer challenge? was published on SAS Voices by Helmut Plinke

7月 312017
 

When we talk about consent management for the EU’s General Data Protection Regulation (GDPR), one of the key considerations is “consent for a purpose.” It might have been sufficient in the past to provide a form with a single generic consent check box and store the fact that consent was [...]

Addressing consent management in GDPR was published on SAS Voices by Helmut Plinke

4月 262017
 

Oracle databases from SAS ViyaSAS Data Connector to Oracle lets you easily load data from your Oracle DB into SAS Viya for advanced analytics. SAS/ACCESS Interface to Oracle (on SAS Viya) provides the required SAS Data Connector to Oracle, that should be deployed to your CAS environment. Once the below described configuration steps for SAS Data Connector to Oracle are completed, SAS Studio or Visual Data Builder will be able to directly access Oracle tables and load them into the CAS engine.

SAS Data Connector to Oracle requires Oracle client components (release 12c or later) to be installed and configured and configurations deployed to your CAS server. Here is a guide that walks you through the process of installing the Oracle client on a Linux server and configuring SAS Data Connector to Oracle on SAS Viya 3.1 (or 3.2):

Step 1: Get the Oracle Instant client software (release 12.c or later)

To find the oracle client software package open a web browser and navigate to Oracle support at:
http://www.oracle.com/technetwork/topics/linuxx86-64soft-092277.html

Download the following two packages to your CAS controller server:

  • oracle-instantclient12.1-basic-12.1.0.2.0-1.x86_64.rpm
  • oracle-instantclient12.1-sqlplus-12.1.0.2.0-1.x86_64.rpm (optional for testing only)

Step 2: Install and configure Oracle instant client

On your CAS controller server, execute the following commands to install the Oracle instant client and SQLPlus utilities.

rpm –ivh oracle-instantclient12.1-basic-12.1.0.2.0-1.x86_64.rpm
rpm –ivh oracle-instantclient12.1-sqlplus-12.1.0.2.0-1.x86_64.rpm

The Oracle client should be installed to /usr/lib/oracle/12.1/client64.
To configure the Oracle client, create a file called tnsnames.ora, for example, in the /etc/ directory. Paste the following lines with the appropriate connection parameters of your Oracle DB into the tnsnames.ora file. (Replace "your_tnsname", "your_oracle_host", "your_oracle_port" and "your_oracle_db_service_name" with parameters according to your Oracle DB implementation)

your_tnsname =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = your_oracle_host)(PORT = your_oracle_port ))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME =your_oracle_db_service_name)
    )
  )

Next you need to set environment variables for the Oracle client to work:

LD_LIBRARY_PATH: Specifies the directory of your Oracle instant client libs
PATH: Must include your Oracle instant client bin directory
TNS_ADMIN: Directory of your tnsnames.ora file
ORACLE_HOME: Location of your Oracle instant client installation

In a console window on your CAS controller Linux server, issue the following commands to set environment variables for the Oracle client: (replace the directory locations if needed)

export LD_LIBRARY_PATH=/usr/lib/oracle/12.1/client64/lib:$LD_LIBRARY_PATH
export PATH=/usr/lib/oracle/12.1/client64/bin:$PATH 
export TNS_ADMIN=/etc 
export ORACLE_HOME=/usr/lib/oracle/12.1/client64

If you installed the SQLPlus package from Oracle, you can test connectivity with the following command: (replace "your_oracle_user" and "your_tnsname" with a valid oracle user and the tnsname configured previously)

sqlplus your_oracle_user@your_tnsname

When prompted for a password, use your Oracle DB password to log on.
You should see a “SQL>” prompt and you can issue SQL queries against the Oracle DB tables to verify your DB connection. This test indicates if the Oracle instant client is successfully installed and configured.

Step 3: Configure SAS Data Connector to Oracle on SAS Viya

Next you need to configure the CAS server to use the instant client.

The recommended way is to edit the vars.yml file in your Ansible playbook and deploy the required changes to your SAS Viya cluster.

Locate the vars.yml file on your cluster deployment and change the CAS_SETTINGS section to reflect the correct environment variables needed for CAS to use the Oracle instant client:
To do so, uncomment the lines for ORACLE_HOME and LD_LIBRARY_PATH and insert the respective path for your Oracle instant client installation as shown below.

CAS Specific ####
# Anything in this list will end up in the cas.settings file
CAS_SETTINGS:
1: ORACLE_HOME=/usr/lib/oracle/12.1/client64
#3: ODBCHOME=ODBC home directory
#4: JAVA_HOME=/usr/lib/jvm/jre-1.8.0
5:LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ORACLE_HOME/lib

Run the ansible-playbook to deploy the changes to your CAS server.
After ansible finished the update, your cas.settings file should contain the following lines:

export ORACLE_HOME=/usr/lib/oracle/12.1/client64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ORACLE_HOME/lib

Now you are ready to use SAS/ACCESS Interface to Oracle in SAS Viya.

Step 4: Test SAS/ACCESS Interface to Oracle in SAS Studio

Log on to SAS Studio to load data from your Oracle DB into CAS.
Execute the following SAS Code example in SAS Studio to connect to your Oracle DB and load data into CAS. Change the parameters starting with "your_" in the SAS code below according to your Oracle DB implementation.

/************************************************************************/
/*  Start a CAS session   */
/************************************************************************/
cas;
/************************************************************************/
/*  Create a Caslib for the Oracle connection   */
/************************************************************************/
caslib ORACLE datasource=(                                           
    srctype="oracle",
    uid="your_oracle_user_ID",
    pwd="your_oracle_password",
    path="//your_db_hostname:your_db_port/your_oracle_service_name",
    schema="your_oracle_schema_name" );
 
/************************************************************************/
/*  Load a table from Oracle into CAS   */
/************************************************************************/
proc casutil;
   list files incaslib="ORACLE"; 
   load casdata="your_oracle_table_name" incaslib="ORACLE" outcaslib="casuser" casout="your_cas_table_name";                   
   list tables incaslib="casuser";
quit;
/************************************************************************/
/*  Assign caslib to SAS Studio   */
/************************************************************************/
 caslib _all_ assign;

The previous example is a simple SAS program to test access to Oracle and load data from an Oracle table into CAS memory. As a result, the program loads a table in your CAS library with data from your Oracle database.

How to configure Oracle client for successful access to Oracle databases from SAS Viya was published on SAS Users.

12月 052016
 

Data quality initiatives challenge organizations because the discipline encompasses so many issues, approaches and tools. Across the board, there are four main activity areas – or pillars – that underlie any successful data quality initiative. Let’s look at what each pillar means, then consider the benefits SAS Data Management brings […]

The post How SAS supports the four pillars of a data quality initiative appeared first on The Data Roundtable.

11月 102016
 

We all have challenges in getting an accurate and consistent view of our customers across multiple applications or sources of customer information. Suggestion-based matching is a technique found in SAS Data Quality to improve matching results for data that has arbitrary typos and incorrect spellings in it. The suggestion-based concept and benefits were described in a previous blog post. In this post, I will expand on the topic and show how to build a data job that uses suggestion-based matching in DataFlux Data Management Studio, the key component of SAS Data Quality and other SAS Data Management offerings. This article takes a simple example job to illustrate the steps needed to configure suggestion-based matching for person names.

In DataFlux Data Management Studio I first configure a Job Specific Data node to define the columns and example records that I’d like to feed into the matching process. In this example, I use a two column data table made up of Rec_ID and a Name column and sample records as shown below.

suggestion-based-matching-in-sas-data-quality

To build the suggestion-based matching feature, I have to insert and configure at least a Create Match Codes node, a Clustering Node and a Cluster Aggregation node in the data job.

suggestion-based-matching-in-sas-data-quality02

This example uses names with randomly injected typographical errors, like missing characters, additional characters and character transpositions. Please note, these types of data errors would not be matched correctly using the standard matching technique in SAS Data Quality, therefore I am leveraging the suggestion-based matching feature. The picture below shows the person names and highlights the injected errors for Ethan Baker.

suggestion-based-matching-in-sas-data-quality03

For suggestion-based matching in SAS Data Quality I need to use a match definition that supports suggestions. The current Quality Knowledge Base (QKB) for Customer Information ships with a ready to use match definition called “Name (with suggestions).” For other data types, I can easily create new definitions or expand existing definitions to support the suggestions feature as well.

To continue building my suggestion-based matching job I next need to configure the Create Match Codes node as shown in the picture below.

suggestion-based-matching-in-sas-data-quality045

In the property screen of the Create Match Codes node I select the Locale that matches my input data and pick the “Name (with Suggestions)” definition.

Next, I check Allow generation of multiple match codes per definition for each sensitivity in the property window. This enables the data job node to generate suggestions and also create an additional Match Score field as output.

suggestion-based-matching-in-sas-data-quality05

Looking at the output of the Match Codes node, we can see that we generate multiple different match codes (suggestions), and match scores for a single input (Ethan Baker). Because I selected Allow generation of multiple match codes per definition for each sensitivity, the Create Match Code node generates a match code representing the input name, plus additional match codes (suggestions) with character deletions, insertions, replacements and transpositions applied to the input name. The score value, in the Name_Match Code Score column, is an indicator for the closeness of the generated suggestion to the original input data. The more change operations are used to create the suggestion, the lower the score value. Therefore the lower the score the less likely it is that the suggestion is the true name.

Next in configuring my suggestion-based matching job is the Clustering node.

suggestion-based-matching-in-sas-data-quality06

There are no specific configuration needed in the Clustering node when using it with suggestion-based matching. In my example I only set the Name_Match Code field as the match condition and I pass all input fields as output (in Additional Outputs).

suggestion-based-matching-in-sas-data-quality07

Previewing the output of the cluster node I can already see the misspelled names of “Ethan Baker” are clustered correctly. But because I generated multiple suggestions for each input record, I end up with multiple clusters holding the same input records. Ethan Baker, Ethn Baker and Epthan Baker and its suggestions are assigned to cluster 0 to 7 and would also appear in single row clusters further down the output list. It is ok at this step of the data job to have two or more clusters containing the same set of input records when using suggestion-based matching. The next node in the data job will resolve the issue and use the match score to determine the single best cluster.

In the properties of the Cluster Aggregation node I set Cluster ID, Primary Key and Score fields (which were outputs of the previous Cluster node). In order to determine the single best cluster, I select the Cluster as a scoring method and Highest Mean as scoring algorithm. The Cluster Aggregation node will compute the mean value in each cluster. By checking Remove subclusters, I make sure only the cluster with the highest mean is outputted.

suggestion-based-matching-in-sas-data-quality08

With the Cluster Aggregation node configured the output looks like this:

suggestion-based-matching-in-sas-data-quality09

The final output of the Cluster Aggregation is reduced to the eight input records only. With the described set-up I successfully matched names that contain typographical errors like additional or missing characters.

As you can see above, the accuracy of your matching rules, and ultimately, your understanding of your customers, can be augmented through use of suggestion-based matching.

For more information, please refer to the product documentation:

tags: data management, DataFlux Data Management Studio, SAS Data Quality

Using suggestion-based matching in SAS Data Quality was published on SAS Users.

8月 302016
 

Suggestion based matching in SAS Data QualityHave you ever had problems matching data that has typographical errors in it? Because of the nature of arbitrary typos and incorrect spelled words a specific matching technique is required to tackle those cases. SAS Data Quality, with its traditional, in nature deterministic matching approach is by nature not best suited for correctly matching typos such as character transpositions and missing or additional characters in words. But SAS provides a feature called suggestion based matching in SAS Data Quality especially designed for matching data with typos. Suggestion based matching provides a more probabilistic alike way towards matching. With suggestion based matching, SAS Data Quality will output multiple matchcodes based on alternative “suggestions” for a data field. Each suggestion also includes a score that reflects the closeness of the suggestion to input word.

Let's dive a little deeper.

The concept of SAS Data Quality suggestion based matching

Suggestion based matching in SAS Data Quality02Prerequisite for suggestion based matching in SAS Data Quality is a dictionary of known words along with a frequency count for each word. The matching engine will generate “suggestions” for potentially misspelled input values from the known words dictionary. The generated suggestions are made of the data input, but with spelling errors like character deletions, insertions, replacements, and transpositions. Whitespace insertion, casing, and context-dependent pronunciation can also be taken into account. For each suggestion a score, that reflects the closeness of it to the input, is calculated. The higher the match score for a suggestion, the more likely it is the true entity.

The known words dictionary is generated with the idea that the number of correctly spelled entity names are more frequent and therefore outnumber the misspelled ones. This is an important aspect of the match score calculation for the suggestions. The matching engine will generate suggestions by taking the input value and “inject” character transpositions, replacements and other typos and compare it against the known words dictionary entries. During the whole process the input value is seen as the potentially corrupted version of the true entity. By making various character based alterations to the input data, the matching engine tries to find possible candidate entities in the known words dictionary. If one of the suggestions matches a word of the known words dictionary the engine calculates a match score based on the frequency count and the changes required to create the suggestion. This concept potentially results in a list of possible matches including the true entity and other misspelled or “close enough” words identified as Suggestion based matching in SAS Data Quality03potential matches. The resulting match score can finally help to resolve the true entity.

Suggestion based matching involves more compute resource and therefore will slow down data throughput of the matching process. Still, it is a proven approach to provide match results for input data that contains character level data quality issues. To minimize performance implication, suggestion based matching is best used as a second iteration for input data that could not be matched using the standard matchcode method.

Suggestion based matching in SAS Data Quality04

tags: data management, SAS Data Quality, suggestion based matching

Improve matching results with suggestion based matching in SAS Data Quality was published on SAS Users.

8月 302016
 

Suggestion based matching in SAS Data QualityHave you ever had problems matching data that has typographical errors in it? Because of the nature of arbitrary typos and incorrect spelled words a specific matching technique is required to tackle those cases. SAS Data Quality, with its traditional, in nature deterministic matching approach is by nature not best suited for correctly matching typos such as character transpositions and missing or additional characters in words. But SAS provides a feature called suggestion based matching in SAS Data Quality especially designed for matching data with typos. Suggestion based matching provides a more probabilistic alike way towards matching. With suggestion based matching, SAS Data Quality will output multiple matchcodes based on alternative “suggestions” for a data field. Each suggestion also includes a score that reflects the closeness of the suggestion to input word.

Let's dive a little deeper.

The concept of SAS Data Quality suggestion based matching

Suggestion based matching in SAS Data Quality02Prerequisite for suggestion based matching in SAS Data Quality is a dictionary of known words along with a frequency count for each word. The matching engine will generate “suggestions” for potentially misspelled input values from the known words dictionary. The generated suggestions are made of the data input, but with spelling errors like character deletions, insertions, replacements, and transpositions. Whitespace insertion, casing, and context-dependent pronunciation can also be taken into account. For each suggestion a score, that reflects the closeness of it to the input, is calculated. The higher the match score for a suggestion, the more likely it is the true entity.

The known words dictionary is generated with the idea that the number of correctly spelled entity names are more frequent and therefore outnumber the misspelled ones. This is an important aspect of the match score calculation for the suggestions. The matching engine will generate suggestions by taking the input value and “inject” character transpositions, replacements and other typos and compare it against the known words dictionary entries. During the whole process the input value is seen as the potentially corrupted version of the true entity. By making various character based alterations to the input data, the matching engine tries to find possible candidate entities in the known words dictionary. If one of the suggestions matches a word of the known words dictionary the engine calculates a match score based on the frequency count and the changes required to create the suggestion. This concept potentially results in a list of possible matches including the true entity and other misspelled or “close enough” words identified as Suggestion based matching in SAS Data Quality03potential matches. The resulting match score can finally help to resolve the true entity.

Suggestion based matching involves more compute resource and therefore will slow down data throughput of the matching process. Still, it is a proven approach to provide match results for input data that contains character level data quality issues. To minimize performance implication, suggestion based matching is best used as a second iteration for input data that could not be matched using the standard matchcode method.

Suggestion based matching in SAS Data Quality04

tags: data management, SAS Data Quality, suggestion based matching

Improve matching results with suggestion based matching in SAS Data Quality was published on SAS Users.

12月 172015
 

In 2014, big data was on everyone’s mind. So in 2015, I expected to see data quality initiatives make a major shift toward big data. But I was surprised by a completely new requirement for data quality, which proves that the world is not all about big data – not […]

The post Regulatory requirements drove new data quality projects in 2015 appeared first on The Data Roundtable.

1月 232015
 
Utilizing big data analytics is currently one of the most promising strategies for businesses to gain competitive advantage and ensure future growth. But as we saw with “small data analytics,” the success of “big data analytics” relies heavily on the quality of its source data. In fact, when combining “small” and “big” data […]