Global Technology Practice

12月 052017
 

With SAS Data Management, you can setup SAS Data Remediation to manage and correct data issues. SAS Data Remediation allows user- or role-based access to data exceptions.

Last time I talked about how to register and use a Data Remediation service. This time we will look at how to use SAS Workflow together with SAS Data Remediation. SAS Workflow comes as part of SAS Data Remediation and you can use it to control the flow of a remediation issue to ensure that the right people are involved at the appropriate various steps.

SAS Workflow is very powerful and offers a lot of features to process data. You can, for example, design workflows to:

  • Process tasks (workflow steps) in parallel.
  • Follow different routes through a workflow depending on certain conditions.
  • Set a timer to automatically route to a different user if an assigned user isn’t available for a certain time.
  • Call Web Services (REST/SOAP) to perform some work automatically.
  • Send email notifications if a certain workflow step is reached.
    …and more.

Invoke an external URL via a workflow

In this article, we are going to look at a workflow feature which is specific to SAS Data Remediation.

With the latest version of Data Remediation there is a new feature that allows you to put a URL behind a workflow button and invoke it by clicking on the button in the Remediation UI. In the following we are going to take a closer look at this new feature.

When you have registered a Data Remediation service you can design a workflow, using SAS Workflow Studio, and link it to a Remediation Service. Workflow Studio comes with some sample workflows. It’s a good idea to take one of these sample workflows as a starting point and adding additional steps to it as desired.

Workflow Task

Data Remediation issues using SAS WorkflowIn a workflow, you have tasks. A task is a single step of a workflow and is associated with data objects, statuses, policies and participants.

  • Data Objects are fields to store values.
  • A Status transitions from one task to the next.
  • Policies are rules that will be executes at a certain task event. For example, calling a Web Service at the beginning of a task or send an email at the end of a task.
  • Participants are users or groups who can execute a task; i.e. if you are not a user assigned to a task, you can’t open an issue in Data Remediation to work on it.

When you add a new task to a workflow you must connect it to another task using a status. In the Data Remediation UI, a status will show up as a Workflow Button to transition from one task to another.

Assigning a URL to Status

You can also use a status on a task to call a URL instead of transitioning to the next task. Therefore, you add a status to a task but don’t use it to connect to another task.

At task Open you have four statuses assigned but only Associate, Cancel and Reject connect to other tasks. Status Review is not connected and it can be used to call a URL.

Right mouse click on status Review/Edit to open a dialogue box with a button Attributes… Here, you need to add the Key Attribute with name: URL. The value of URL points to the fully qualified name of the URL to be called:

http://MyServer/showAddtionalInformation/?recordid={Field1}

 

 

 

 

 

The URL can take parameters, in curly brackets (i.e. {Field1}), pointing to the task’s data objects. When calling the URL, the parameters will be substituted with the appropriate data object value. This way the URL call can be dynamic.

Dynamic URL calls via parameters

When you create a Data Remediation issue you can pass up to three values from the issue record to Remediation. For example, the issue record key. When the workflow is called by Data Remediation these values are copied from Data Remediation to the workflow’s top-level fields: Field1, Field2, Field3.

To use these values in the workflow, to pass them to a URL, you need to create an appropriate data object (Field1) at the task where you want to call a URL. You also need to add a policy to the task to copy the data object value from the top-level data object to the task data object.

This will make the value available at the task and you can use it as a parameter in the URL call as shown above.

Link workflow to Data Remediation Service

When you have finished the workflow, you can save it and upload it to the workflow server.

Once it is uploaded you can link the workflow to the Remediation Service. At Data Remediation Administration, open the remediation service and go to Issue Types. Workflows are linked to Issue types, they are not compulsory, but you can link one or more workflows to an issue type, depending on your business requirements.

At Issue Types, under Templates, select your workflow and link it to an issue type. You can make workflows mandatory for certain issue types by ticking the check box: “Always require a task template…” In this case Data Remediation expects one of the nominated workflows to be used when creating a remediation issue for the appropriate issue type.

Creating a remediation issue

You now can create a remediation issue for the appropriate issue type and assign a workflow to it by submitting the workflow name via the field "workflowName" in JSON structure when creating the remediation issue. See “Improving data quality through SAS Data Remediation” for more about creating Data Remediation issues.

Call a URL via workflow button

When the Remediation issue is created you can open it in SAS Data Remediation where you will see the Review Button. When you click on the Review button the URL will be called that you have assigned to the URL Key attribute value for status Review.

By using workflows with SAS Data Remediation, you can better control the process addressing data issues. Being able to put a URL behind a workflow button and invoke it will enhance your capabilities around solving the issues and improving data quality.

Process and control Data Remediation issues using SAS Workflow was published on SAS Users.

11月 292017
 

The CAS procedure (PROC CAS) enables us to interact with SAS Cloud Analytic Services (CAS) from the SAS client based on the CASL (the scripting language of CAS) specification. CASL supports a variety of data types including INT, DOUBLE, STRING, TABLE, LIST, BLOB, and others. The result of a CAS action could be any of the data types mentioned above. In this article, we will use the CAS procedure to explore techniques to interact with the results produced from CAS actions.

PROC CAS enables us to run CAS actions in SAS Viya; CAS actions are at the heart of how the CAS server receives requests, performs the relevant tasks and returns results back to the user. CAS actions representing common functionality constitute CAS action sets.  Consider the CAS action set TABLE. Within the TABLE action set, we will currently find 24 different actions relevant to various tasks that can be performed on a table. For example, the COLUMNINFO action within the TABLE action set will inform the user of the contents of a CAS in-memory table, including column names, lengths, data types, and so on. Let’s take a look at the following code to understand how this works:

proc cas;
	table.columnInfo  / table='HMEQ';
	simple.summary    / table={name='HMEQ' groupby='BAD'};
run;

In the code above, the table called ‘HMEQ’ is a distributed in-memory CAS table that contains data about applicants who were granted credit for a certain home equity loan. The categorical binary-valued variable ‘BAD’ identifies a client who has either defaulted or repaid their home equity loan. Since PROC CAS is an interactive procedure, we can invoke multiple statements within a single run block. In the code above we have executed the COLUMNINFO action from the TABLE actionset and the SUMMARY action from the SIMPLE action set within the same run block. Notice that we are able to obtain a summary statistic of all the interval variables in the dataset grouped by the binary variable ‘BAD’ without having to first sort the data by that variable. Below is a snapshot of the resulting output.

PROC CAS

Fig1: Output from table.columninfo

Fig 2: Output from executing the summary actionset with a BY group option

Let’s dig into this a little deeper by considering the following statements:

proc cas;
simple.summary result=S /table={name='hmeq'};
describe S;
run;

In the code snippet above, the result of the summary action is returned to the user in a variable ‘S’ which is a dictionary. How do we know? To find that, we have invoked the DESCRIBE statement to help us understand exactly what the form of this result, S, is. The DESCRIBE statement is used to display the data type of a variable. Let’s take a look at the log file:

The log file informs the user that ‘S’ is a dictionary with one entry of type table named “Summary.” In the above example the column names and the attributes are shown on the log file. If we want to print the entire summary table or a subset, we would use the code below:

proc cas;
simple.summary result=S /table={name='hmeq'};
print s[“Summary”];
print s[“summary”, 3:5];
run;

The first PRINT statement will fetch the entire summary table; the second PRINT statement will fetch rows 3 through 5. We can also use WHERE expression processing to create a new table with rows that match the WHERE expression. The output of the second PRINT statements above are shown in the figure below:

The result of an action could also be more complex in nature; it could be a dictionary containing dictionaries, arrays, and lists, or the result could be a list containing lists and arrays and tables etc. Therefore, it is important to understand these concepts through some simple cases. Let’s consider another example where the result is slightly more complex.

proc cas;
simple.summary result=s /table={name='hmeq', groupby='bad'};
describe s;
print s["ByGroup1.Summary", 3:5]; run;

In the example above, we are executing a summary using a BY group on a binary variable. The log shows that the result in this case is a dictionary with three entries, all of which are tables. Below is a snapshot of the log file as well as the output of PRINT statement looking at the summary for the first BY group for row 3 through 5.

If we are interested in saving the output of the summary action as a SAS data set (sas7bdat file), we will execute the SAVERESULT statement. The code below saves the summary statistics of the first BY group in the work library as a SAS dataset.

proc cas;
simple.summary result=s /table={name='hmeq', groupby='bad'};
describe s;
print s["ByGroup1.Summary", 3:5]; 
saveresult s["ByGroup1.Summary"] dataout=work.data;
run;

A helpful function we can often use is findtable. This function will search the given value for the first table it finds.

proc cas;
simple.summary result=s /table={name='hmeq'};
val = findtable(s);
saveresult val dataout=work.summary;
run;

In the example above, I used findtable to locate the first table produced within the dictionary, S, and save it under ‘Val,’ which is then directly invoked with SAVERESULT statement to save the output as a SAS dataset.

Finally, we also have the option to save the output of a CAS action in another CAS Table. The table summary1 in the code below is an in-memory CAS table that contains the output of the summary action. The code and output are shown below:

proc cas;
simple.summary /table={name='hmeq'} casout={name='mylib.summary1'}; 
fetch /table={name='mylib.summary1'};
run;

In this post, we saw several ways of interacting with the results of a CAS action using the CAS procedure. Depending on what our end goal is, we can use any of these options to either view the results or save the data for further processing.

Interacting with the results of PROC CAS was published on SAS Users.

11月 162017
 
connect sas db2 via odbc

Illustration by Dooder on Freepik

A few weeks ago I helped a customer set up a successful connection between SAS and a remote DB2 database using ODBC. At first, I thought this case would be as simple as plugging in a cable. Why? Because I already had a fully-functional testing environment with direct interaction using SAS/ACCESS Interface to DB2. Besides, I had other RDBMS connected via ODBC. Therefore, it was just a matter of setting up my ODBC driver manager for DB2. No big deal right? Oh boy, was I wrong! It was not easy at all. It took me two full days to make it work in my lab, and only after some in-depth research and a lot of trial and error.

DB2 can be challenging, especially when most of the documentation I found was either incomplete, outdated, scarce or even unclear. I don't want you to undergo the same frustration I felt. As a blogger and SAS user, it is my duty to make your SAS life a little easier. That's why I decided to publish this blog post to share the solution with the community. This guide provides you with the steps to take to set up a successful connection between SAS and a remote DB2 database via ODBC on Linux, without tears.

Table of Contents

Keep in mind this is an extensive guide. I documented all the technical issues I came across in my lab, and also provided detailed explanations to overcome them. If you are beginning your journey, I recommend you to go through all the steps. However, for your convenience, if you want to jump right into a specific topic, use the following table of contents:

My Environment

My Environment

All commands throughout this guide are based on this table. You should replace the code with your settings when appropriate.

Prerequisites

Before we begin, make sure you meet all of these SAS and DB2 requirements:

System Requirements

It may sound obvious, but you must verify that your SAS release supports your DB2 database version. For instance, My Environment section shows that I'm using SAS 9.4 M4. This document recommends that I need DB2 10.5 or later. Considering I currently have DB2 11.1, I'm ready to go. To review the rest of the combinations visit the SAS System Requirements page.

Another evident must-have is an ODBC driver manager already installed. As you noticed, I use unixODBC version 2.3.1. It is always recommended to have the latest version, which nowadays is 2.3.4. Still, you should be fine if you decide to use any of the 2.3.X releases. In the past, I've seen users facing difficulties with unixODBC 2.2.X.

SAS/ACCESS Interface to ODBC licensed and installed

For this connection to work, it is essential to have SAS/ACCESS Interface to ODBC licensed and installed. First, to verify that this product is licensed, run this code in a SAS session:

proc setinit no alias;
run;

In the output log look for this line:

---SAS/ACCESS Interface to ODBC
                          06JAN2018

If the product is displayed, move on to the next step. If not, look for SAS/ACCESS Interface to DB2 instead. If you have the latter, consider yourself a lucky person. You don't have to continue with this guide because you can connect to DB2 directly. If you need assistance with this off-topic issue, send me a message, and I'll be glad to help you. If you don't have either one, you should get in touch with your sales representative.

Next, to verify that this product is installed, go to your !SASROOT/sasexe directory on your SAS server and list these files:

[sas@muse sasexe]$ ls -l *odb*
-r-xr-xr-x. 1 sas sas 630196 Nov  9  2016 sasiodbu
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb_u
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb_u
-r-xr-xr-x. 1 sas sas 102142 Nov  9  2016 sasodbcs
-r-xr-xr-x. 1 sas sas  71982 Nov  9  2016 sasodbdb
-r-xr-xr-x. 1 sas sas  71990 Nov  9  2016 sasodbdb_u
-r-xr-xr-x. 1 sas sas 202343 Nov  9  2016 sasodb
-r-xr-xr-x. 1 sas sas 201815 Nov  9  2016 sasodb_u
-r-xr-xr-x. 1 sas sas 443852 Nov  9  2016 tkeodbc.so

If you don't see the above list, then the product is not installed. Although, if your SAS license includes it, use your SAS depot to install it on your server.

Choosing the right DB2 client package

IBM offers five different client packages. If you don't know which package to choose, you can't go wrong with IBM Data Server Client, it's the package I use, it includes all the client tools and libraries available. Another good alternative is IBM Data Server Runtime Client, but you will have to create your DB2 instance and user yourself. I would not recommend IBM Data Server Driver for ODBC and CLI. During my research I found that people could not get it to connect to a remote DB2 database. For more information about each client package, please visit IBM data server client and driver types.

Testing your DB2 connection outside of SAS

This test is the most critical requirement before starting with the actual configuration process. Take my advice for any connection you plan to configure in the future. Firstly, you must establish a successful connection between your database client and database server outside of SAS, using your database vendor tools. This validation is a must, regardless of your RDBMS or SAS engine.

To test your connection to your remote DB2 server, run an equivalent command on your DB2 client:

[db2inst@muse ~]$ db2 connect to GSDB user db2inst using mypassword
 
   Database Connection Information
 
 Database server        = DB2/LINUXX8664 11.1.1
 SQL authorization ID   = DB2INST
 Local database alias   = GSDB

If you received a similar output you might continue; otherwise, you will have to catalog your DB2 server as a TCP/IP node, and then catalog the database on that node. Since this is beyond the scope of this guide, please review these IBM documents: CATALOG NODE command and CATALOG DATABASE command.

Setting up unixODBC with DB2

Considering that you have tested a successful connection between your database client and database server with the db2 command, it's time to set up a new one using an ODBC driver instead. Please bear with me. Things are about to get interesting.

As mentioned at the beginning, unixODBC is my ODBC driver manager of choice. I use it for all my RDBMS in my lab to test ODBC connections. Its usage is almost effortless, but one of the downsides is the documentation. The only official document available for DB2 is: Using IBM DB2 with unixODBC. Judge for yourself.

This driver manager has two configuration files: odbcinst.ini for database drivers and odbc.ini for database definitions. Typically with most databases, those two files are all that it takes to set up an ODBC connection. DB2 plays a different game though. It uses two extra files: db2cli.ini for DB2 definitions and an XML based file called db2dsdriver.cfg.

For now, let's take a closer look at the contents of both unixODBC files:

odbcinst.ini

[DB2]
Description     = IBM DB2 ODBC Driver
Driver          = /opt/ibm/db2/V11.1/lib32/libdb2.so
Driver64        = /opt/ibm/db2/V11.1/lib64/libdb2.so
FileUsage       = 1
DontDLClose     = 1

odbc.ini

[db2odbc]
Driver          = DB2
Description     = DB2 ODBC connection for SAS

At this point I was feeling confident and thought the connection would work, so I tested it out with the isql command, but to my surprise, I received this error:

[db2inst@muse ~]$ isql -v db2odbc db2inst mypassword
[08001][unixODBC][IBM][CLI Driver] SQL1013N  The database alias name or database name "DB2ODBC" could not be found.  SQLSTATE=42705
 
[ISQL]ERROR: Could not SQLConnect

db2cli.ini came to the rescue

As a result, I started looking up online documentation. After a while, I discovered the existence of the db2cli.ini file. To get you started, IBM provides a sample configuration file called db2cli.ini.sample, located in the installation_path/cfg directory. The DB2 documentation recommends creating a db2cli.ini file based on that sample file and place it in the same location. I followed their advice, and created my file with the same data source name:

[db2odbc]
Database        = GSDB
Protocol        = TCPIP
Hostname        = redhat
ServiceName     = 50000

Note: For IBM, the ServiceName keyword is the port number. The default port number for the DB2 default instance is 50000. Review the correct port with your DBA.

Having set up both unixODBC and DB2 files correctly, I was feeling lucky. Therefore, I attempted a connection again, but I received the same error. My frustration was on the rise.

db2cli validate tool is your best friend

I continued digging into the documentation until I found a powerful tool: db2cli with a particular parameter: validate. This command/parameter duo can help you verify the contents of your configuration files (db2cli.ini and db2dsdriver.cfg), list their location, test your data source names and databases, display invalid/duplicate keywords, among other functionalities. It basically is your new best friend.

db2cli validate helped me identify why the DB2 ODBC driver was unable to find my data source name. I ran this command to examine my environment, and the output was eye-opening:

[db2inst@muse ~]$ db2cli validate -dsn db2odbc
 
===============================================================================
Client information for the current copy:
===============================================================================
 
Client Package Type       : IBM Data Server Client
Client Version (level/bit): DB2 v11.1.0.0 (s1606081100/64-bit)
Client Platform           : Linux/X8664
Install/Instance Path     : /opt/ibm/db2/V11.1
DB2DSDRIVER_CFG_PATH value: 
db2dsdriver.cfg Path      : /home/db2inst/sqllib/cfg/db2dsdriver.cfg
DB2CLIINIPATH value       : 
db2cli.ini Path           : /home/db2inst/sqllib/cfg/db2cli.ini
db2diag.log Path          : /home/db2inst/sqllib/db2dump/db2diag.log
 
===============================================================================
db2dsdriver.cfg schema validation for the entire file:
===============================================================================
 
Note: The validation utility could not find the configuration file 
db2dsdriver.cfg. The file is searched at 
"/home/db2inst/sqllib/cfg/db2dsdriver.cfg".
 
===============================================================================
db2cli.ini validation for data source name "db2odbc":
===============================================================================
 
Note: The validation utility could not find the configuration file db2cli.ini. 
The file is searched at "/home/db2inst/sqllib/cfg/db2cli.ini".
 
===============================================================================
db2dsdriver.cfg validation for data source name "db2odbc":
===============================================================================
 
Note: The validation utility could not find the configuration file 
db2dsdriver.cfg. The file is searched at 
"/home/db2inst/sqllib/cfg/db2dsdriver.cfg".
 
===============================================================================
The validation is completed.
===============================================================================

As you can see, the validation tool didn't find my db2cli.ini. That file contained my data source name; consequently, the DB2 ODBC driver didn't find it either. As previously indicated, I created that file in the installation_path/cfg directory, which it was: /opt/ibm/db2/V11.1/cfg, but the DB2 ODBC driver was searching for it in: /home/db2inst/sqllib/cfg. I didn't choose that path randomly; I just followed this IBM document. Nevertheless, if you encounter the same issue, the solution is fairly simple. You can either create a copy in the required path or point to the original file using a symbolic link. I chose the latter to avoid having duplicates. To fix this misunderstanding just execute this command as root or sudo:

[root@muse ~]# ln -s /opt/ibm/db2/V11.1/cfg/db2cli.ini /home/db2inst/sqllib/cfg/db2cli.ini

Now you should have this symbolic link:

[db2inst@muse ~]$ ls -l /home/db2inst/sqllib/cfg/db2cli.ini 
lrwxrwxrwx. 1 root db2iadm 33 Sep 11 19:07 /home/db2inst/sqllib/cfg/db2cli.ini -> /opt/ibm/db2/V11.1/cfg/db2cli.ini

After this fix, I used my new friend again, and my environment changed. Below is a partial output:

[db2inst@muse ~]$ db2cli validate -dsn db2odbc
===============================================================================
db2cli.ini validation for data source name "db2odbc":
===============================================================================
 
[ Keywords used for the connection ]
 
Keyword                   Value
---------------------------------------------------------------------------
DATABASE                  GSDB
PROTOCOL                  TCPIP
HOSTNAME                  redhat
SERVICENAME               50000

At last, the DB2 ODBC driver was able to find the configuration file, read its contents, and validate the keywords. My frustration started to disappear.

It was time to see what unixODBC thought about this change. So I reran my isql command:

[db2inst@muse ~]$ isql -v db2odbc db2inst mypassword
+---------------------------------------+
| Connected!                            |
|                                       |
| sql-statement                         |
| help [tablename]                      |
| quit                                  |
|                                       |
+---------------------------------------+
SQL>

It finally worked!

Do we need to use db2dsdriver.cfg?

I managed to set up a successful ODBC connection using only three files: db2cli.ini, along with odbc.ini, and odbcinst.ini. Thus, the answer is no; you don't need to use this additional DB2 configuration file.

For our scenario, I consider it as optional. Previously, I pointed out that it was an XML file, as opposed to db2cli.ini, which is a text file. Both can have the same four keywords I defined. Then why does DB2 provide two files? db2cli.ini is used to configure the behavior of CLI/ODBC applications, whereas db2dsdriver.cfg can be used for those as well, and for other types of applications including PHP, Perl, .NET, among others.

In short, this concludes the DB2 configuration process. Now it's time to cover the SAS territory.

SAS joins the party

So far we haven't touched any SAS files. We have only tested the DB2 connection first using the db2 command, then using unixODBC's isql command. These are mandatory steps before a successful connection with SAS.

Getting sasenv_local ready for DB2

If you have already connected SAS to other RDBMS in the past, you know that the most important file you need to care about is sasenv_local. It is located in the !SASROOT/bin directory. The code below shows a partial output with both ODBC and DB2 variables correctly defined:

# ODBC settings
export ODBCSYSINI=/etc
 
# DB2 settings
export INSTHOME=/opt/ibm/db2/V11.1
export DB2CLIINIPATH=/opt/ibm/db2/V11.1/cfg
 
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/lib64:${INSTHOME}/lib64

Let's analyze each of the above settings. First, SAS needs to know where to locate both unixODBC configuration files. The ODBCSYSINI variable instructs SAS to look for them in the /etc directory.

Second, as a good practice, I defined an environment variable called INSTHOME to reference my DB2 installation directory. This variable is optional but highly recommended. Then, the DB2CLIINIPATH variable tells SAS the location of db2cli.ini. If you decide to use db2dsdriver.cfg instead, then replace that variable with DB2DSDRIVER_CFG_PATH.

Finally, you have to inform SAS where to find unixODBC and DB2 libraries. SAS uses the LD_LIBRARY_PATH environment variable on Linux. This variable points to /usr/lib64 because it's the default path for all unixODBC libraries. I also appended the $INSTHOME/lib64 directory, since it's the place where DB2 has all its libraries.

At this point, you might think that you are all set. I don't like being pessimistic, but there is a high possibility that your connection may not work. I didn't test all SAS and DB2 releases, but my customer and I experienced the same error when we tested our connection with a LIBNAME statement:

[sas@muse 9.4]$ ./sas -nodms
 
  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
ERROR:  An exception has been encountered.
Please contact technical support and provide them with the following
       traceback information:
 
The SAS task name is [Line Mod]
Segmentation Violation
# More lines...

Moreover, if you run this code in SAS Studio or SAS Enterprise Guide, you might not even get a response, SAS will just crash or hang for a long time.

Fixing the Segmentation Violation error

After intense research, I was able to fix this error. We need to link two ODBC libraries that SAS provides to enable 32-bit capability. Change to your !SASROOT/sasexe directory and backup both files:

[sas@muse sasexe]$ mv sasodb sasodb.orig
[sas@muse sasexe]$ mv sasioodb sasioodb.orig

Create two symbolic links that point to the 32-bit libraries:

[sas@muse sasexe]$ ln -s sasodb_u sasodb
[sas@muse sasexe]$ ln -s sasioodb_u sasioodb

List your files, and the output should resemble the following:

[sas@muse sasexe]$ ls -l *odb*
-r-xr-xr-x. 1 sas sas 630196 Nov  9  2016 sasiodbu
lrwxrwxrwx. 1 sas sas     10 Sep 11 21:51 sasioodb -> sasioodb_u
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb.orig
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb_u
lrwxrwxrwx. 1 sas sas      8 Sep 11 21:52 sasodb -> sasodb_u
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb_u
-r-xr-xr-x. 1 sas sas 102142 Nov  9  2016 sasodbcs
-r-xr-xr-x. 1 sas sas  71982 Nov  9  2016 sasodbdb
-r-xr-xr-x. 1 sas sas  71990 Nov  9  2016 sasodbdb_u
-r-xr-xr-x. 1 sas sas 202343 Nov  9  2016 sasodb.orig
-r-xr-xr-x. 1 sas sas 201815 Nov  9  2016 sasodb_u
-r-xr-xr-x. 1 sas sas 443852 Nov  9  2016 tkeodbc.so

Now your LIBNAME statement may work, but unlike my customer, you can also encounter this error:

  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
ERROR: Could not load /sas/Software/SASFoundation/9.4/sasexe/sasodb (35
       images loaded)
ERROR: libodbc.so.1: cannot open shared object file: No such file or directory
ERROR: The SAS/ACCESS Interface to ODBC cannot be loaded. The SASODB   code
       appendage could not be loaded.
ERROR: Error in the LIBNAME statement.

The error is self-explanatory. SAS couldn't find the file libodbc.so.1, which SASODB needs. To display all the required shared libraries, execute the ldd command:

[sas@muse sasexe]$ ldd sasodb
        linux-vdso.so.1 =>  (0x00007ffd4efe2000)
        libodbc.so.1 => not found
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9aa741d000)
        libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f9aa71e6000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f9aa6fe2000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f9aa6cdf000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f9aa691e000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f9aa6708000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9aa7882000)
        libfreebl3.so => /lib64/libfreebl3.so (0x00007f9aa6504000)

Indeed, that library was not found, so to fix this missing dependency, create a symbolic link using root or sudo:

[root@muse ~]# ln -s /lib64/libodbc.so.2.0.0 /lib64/libodbc.so.1

Creating a library that works

Now that the previous errors are gone, run a final test using the SAS command line to assign a library and then SAS Studio to display some data in a nice-looking way:

[sas@muse 9.4]$ ./sas -nodms
  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
NOTE: Libref DB2LIB was successfully assigned as follows: 
      Engine:        ODBC 
      Physical Name: db2odbc

Note: Displaying actual data is an essential last step because if you are pointing to an incorrect DB2 ODBC driver (such as libdb2o.so instead of libdb2.so), you can still assign a library successfully, but you won't see any tables.

In SAS Studio I assigned the same library and ran a small query to retrieve some data:

proc print data=db2lib.department_lookup;
    var department_code department_en;
run;

Sample data in SAS Studio

Wrapping Up

I know it's a lot to digest. However, some users may not need to follow all steps to succeed, whereas others starting from scratch may require performing all steps carefully. Whatever your situation might be, I hope you find this guide useful and easy to follow. Did you struggle like me? Share your experience in the comments below or if you have any problems, don't hesitate to contact me.

Connecting SAS to a DB2 database via ODBC without tears was published on SAS Users.

11月 162017
 
connect sas db2 via odbc

Illustration by Dooder on Freepik

A few weeks ago I helped a customer set up a successful connection between SAS and a remote DB2 database using ODBC. At first, I thought this case would be as simple as plugging in a cable. Why? Because I already had a fully-functional testing environment with direct interaction using SAS/ACCESS Interface to DB2. Besides, I had other RDBMS connected via ODBC. Therefore, it was just a matter of setting up my ODBC driver manager for DB2. No big deal right? Oh boy, was I wrong! It was not easy at all. It took me two full days to make it work in my lab, and only after some in-depth research and a lot of trial and error.

DB2 can be challenging, especially when most of the documentation I found was either incomplete, outdated, scarce or even unclear. I don't want you to undergo the same frustration I felt. As a blogger and SAS user, it is my duty to make your SAS life a little easier. That's why I decided to publish this blog post to share the solution with the community. This guide provides you with the steps to take to set up a successful connection between SAS and a remote DB2 database via ODBC on Linux, without tears.

Table of Contents

Keep in mind this is an extensive guide. I documented all the technical issues I came across in my lab, and also provided detailed explanations to overcome them. If you are beginning your journey, I recommend you to go through all the steps. However, for your convenience, if you want to jump right into a specific topic, use the following table of contents:

My Environment

My Environment

All commands throughout this guide are based on this table. You should replace the code with your settings when appropriate.

Prerequisites

Before we begin, make sure you meet all of these SAS and DB2 requirements:

System Requirements

It may sound obvious, but you must verify that your SAS release supports your DB2 database version. For instance, My Environment section shows that I'm using SAS 9.4 M4. This document recommends that I need DB2 10.5 or later. Considering I currently have DB2 11.1, I'm ready to go. To review the rest of the combinations visit the SAS System Requirements page.

Another evident must-have is an ODBC driver manager already installed. As you noticed, I use unixODBC version 2.3.1. It is always recommended to have the latest version, which nowadays is 2.3.4. Still, you should be fine if you decide to use any of the 2.3.X releases. In the past, I've seen users facing difficulties with unixODBC 2.2.X.

SAS/ACCESS Interface to ODBC licensed and installed

For this connection to work, it is essential to have SAS/ACCESS Interface to ODBC licensed and installed. First, to verify that this product is licensed, run this code in a SAS session:

proc setinit no alias;
run;

In the output log look for this line:

---SAS/ACCESS Interface to ODBC
                          06JAN2018

If the product is displayed, move on to the next step. If not, look for SAS/ACCESS Interface to DB2 instead. If you have the latter, consider yourself a lucky person. You don't have to continue with this guide because you can connect to DB2 directly. If you need assistance with this off-topic issue, send me a message, and I'll be glad to help you. If you don't have either one, you should get in touch with your sales representative.

Next, to verify that this product is installed, go to your !SASROOT/sasexe directory on your SAS server and list these files:

[sas@muse sasexe]$ ls -l *odb*
-r-xr-xr-x. 1 sas sas 630196 Nov  9  2016 sasiodbu
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb_u
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb_u
-r-xr-xr-x. 1 sas sas 102142 Nov  9  2016 sasodbcs
-r-xr-xr-x. 1 sas sas  71982 Nov  9  2016 sasodbdb
-r-xr-xr-x. 1 sas sas  71990 Nov  9  2016 sasodbdb_u
-r-xr-xr-x. 1 sas sas 202343 Nov  9  2016 sasodb
-r-xr-xr-x. 1 sas sas 201815 Nov  9  2016 sasodb_u
-r-xr-xr-x. 1 sas sas 443852 Nov  9  2016 tkeodbc.so

If you don't see the above list, then the product is not installed. Although, if your SAS license includes it, use your SAS depot to install it on your server.

Choosing the right DB2 client package

IBM offers five different client packages. If you don't know which package to choose, you can't go wrong with IBM Data Server Client, it's the package I use, it includes all the client tools and libraries available. Another good alternative is IBM Data Server Runtime Client, but you will have to create your DB2 instance and user yourself. I would not recommend IBM Data Server Driver for ODBC and CLI. During my research I found that people could not get it to connect to a remote DB2 database. For more information about each client package, please visit IBM data server client and driver types.

Testing your DB2 connection outside of SAS

This test is the most critical requirement before starting with the actual configuration process. Take my advice for any connection you plan to configure in the future. Firstly, you must establish a successful connection between your database client and database server outside of SAS, using your database vendor tools. This validation is a must, regardless of your RDBMS or SAS engine.

To test your connection to your remote DB2 server, run an equivalent command on your DB2 client:

[db2inst@muse ~]$ db2 connect to GSDB user db2inst using mypassword
 
   Database Connection Information
 
 Database server        = DB2/LINUXX8664 11.1.1
 SQL authorization ID   = DB2INST
 Local database alias   = GSDB

If you received a similar output you might continue; otherwise, you will have to catalog your DB2 server as a TCP/IP node, and then catalog the database on that node. Since this is beyond the scope of this guide, please review these IBM documents: CATALOG NODE command and CATALOG DATABASE command.

Setting up unixODBC with DB2

Considering that you have tested a successful connection between your database client and database server with the db2 command, it's time to set up a new one using an ODBC driver instead. Please bear with me. Things are about to get interesting.

As mentioned at the beginning, unixODBC is my ODBC driver manager of choice. I use it for all my RDBMS in my lab to test ODBC connections. Its usage is almost effortless, but one of the downsides is the documentation. The only official document available for DB2 is: Using IBM DB2 with unixODBC. Judge for yourself.

This driver manager has two configuration files: odbcinst.ini for database drivers and odbc.ini for database definitions. Typically with most databases, those two files are all that it takes to set up an ODBC connection. DB2 plays a different game though. It uses two extra files: db2cli.ini for DB2 definitions and an XML based file called db2dsdriver.cfg.

For now, let's take a closer look at the contents of both unixODBC files:

odbcinst.ini

[DB2]
Description     = IBM DB2 ODBC Driver
Driver          = /opt/ibm/db2/V11.1/lib32/libdb2.so
Driver64        = /opt/ibm/db2/V11.1/lib64/libdb2.so
FileUsage       = 1
DontDLClose     = 1

odbc.ini

[db2odbc]
Driver          = DB2
Description     = DB2 ODBC connection for SAS

At this point I was feeling confident and thought the connection would work, so I tested it out with the isql command, but to my surprise, I received this error:

[db2inst@muse ~]$ isql -v db2odbc db2inst mypassword
[08001][unixODBC][IBM][CLI Driver] SQL1013N  The database alias name or database name "DB2ODBC" could not be found.  SQLSTATE=42705
 
[ISQL]ERROR: Could not SQLConnect

db2cli.ini came to the rescue

As a result, I started looking up online documentation. After a while, I discovered the existence of the db2cli.ini file. To get you started, IBM provides a sample configuration file called db2cli.ini.sample, located in the installation_path/cfg directory. The DB2 documentation recommends creating a db2cli.ini file based on that sample file and place it in the same location. I followed their advice, and created my file with the same data source name:

[db2odbc]
Database        = GSDB
Protocol        = TCPIP
Hostname        = redhat
ServiceName     = 50000

Note: For IBM, the ServiceName keyword is the port number. The default port number for the DB2 default instance is 50000. Review the correct port with your DBA.

Having set up both unixODBC and DB2 files correctly, I was feeling lucky. Therefore, I attempted a connection again, but I received the same error. My frustration was on the rise.

db2cli validate tool is your best friend

I continued digging into the documentation until I found a powerful tool: db2cli with a particular parameter: validate. This command/parameter duo can help you verify the contents of your configuration files (db2cli.ini and db2dsdriver.cfg), list their location, test your data source names and databases, display invalid/duplicate keywords, among other functionalities. It basically is your new best friend.

db2cli validate helped me identify why the DB2 ODBC driver was unable to find my data source name. I ran this command to examine my environment, and the output was eye-opening:

[db2inst@muse ~]$ db2cli validate -dsn db2odbc
 
===============================================================================
Client information for the current copy:
===============================================================================
 
Client Package Type       : IBM Data Server Client
Client Version (level/bit): DB2 v11.1.0.0 (s1606081100/64-bit)
Client Platform           : Linux/X8664
Install/Instance Path     : /opt/ibm/db2/V11.1
DB2DSDRIVER_CFG_PATH value: 
db2dsdriver.cfg Path      : /home/db2inst/sqllib/cfg/db2dsdriver.cfg
DB2CLIINIPATH value       : 
db2cli.ini Path           : /home/db2inst/sqllib/cfg/db2cli.ini
db2diag.log Path          : /home/db2inst/sqllib/db2dump/db2diag.log
 
===============================================================================
db2dsdriver.cfg schema validation for the entire file:
===============================================================================
 
Note: The validation utility could not find the configuration file 
db2dsdriver.cfg. The file is searched at 
"/home/db2inst/sqllib/cfg/db2dsdriver.cfg".
 
===============================================================================
db2cli.ini validation for data source name "db2odbc":
===============================================================================
 
Note: The validation utility could not find the configuration file db2cli.ini. 
The file is searched at "/home/db2inst/sqllib/cfg/db2cli.ini".
 
===============================================================================
db2dsdriver.cfg validation for data source name "db2odbc":
===============================================================================
 
Note: The validation utility could not find the configuration file 
db2dsdriver.cfg. The file is searched at 
"/home/db2inst/sqllib/cfg/db2dsdriver.cfg".
 
===============================================================================
The validation is completed.
===============================================================================

As you can see, the validation tool didn't find my db2cli.ini. That file contained my data source name; consequently, the DB2 ODBC driver didn't find it either. As previously indicated, I created that file in the installation_path/cfg directory, which it was: /opt/ibm/db2/V11.1/cfg, but the DB2 ODBC driver was searching for it in: /home/db2inst/sqllib/cfg. I didn't choose that path randomly; I just followed this IBM document. Nevertheless, if you encounter the same issue, the solution is fairly simple. You can either create a copy in the required path or point to the original file using a symbolic link. I chose the latter to avoid having duplicates. To fix this misunderstanding just execute this command as root or sudo:

[root@muse ~]# ln -s /opt/ibm/db2/V11.1/cfg/db2cli.ini /home/db2inst/sqllib/cfg/db2cli.ini

Now you should have this symbolic link:

[db2inst@muse ~]$ ls -l /home/db2inst/sqllib/cfg/db2cli.ini 
lrwxrwxrwx. 1 root db2iadm 33 Sep 11 19:07 /home/db2inst/sqllib/cfg/db2cli.ini -> /opt/ibm/db2/V11.1/cfg/db2cli.ini

After this fix, I used my new friend again, and my environment changed. Below is a partial output:

[db2inst@muse ~]$ db2cli validate -dsn db2odbc
===============================================================================
db2cli.ini validation for data source name "db2odbc":
===============================================================================
 
[ Keywords used for the connection ]
 
Keyword                   Value
---------------------------------------------------------------------------
DATABASE                  GSDB
PROTOCOL                  TCPIP
HOSTNAME                  redhat
SERVICENAME               50000

At last, the DB2 ODBC driver was able to find the configuration file, read its contents, and validate the keywords. My frustration started to disappear.

It was time to see what unixODBC thought about this change. So I reran my isql command:

[db2inst@muse ~]$ isql -v db2odbc db2inst mypassword
+---------------------------------------+
| Connected!                            |
|                                       |
| sql-statement                         |
| help [tablename]                      |
| quit                                  |
|                                       |
+---------------------------------------+
SQL>

It finally worked!

Do we need to use db2dsdriver.cfg?

I managed to set up a successful ODBC connection using only three files: db2cli.ini, along with odbc.ini, and odbcinst.ini. Thus, the answer is no; you don't need to use this additional DB2 configuration file.

For our scenario, I consider it as optional. Previously, I pointed out that it was an XML file, as opposed to db2cli.ini, which is a text file. Both can have the same four keywords I defined. Then why does DB2 provide two files? db2cli.ini is used to configure the behavior of CLI/ODBC applications, whereas db2dsdriver.cfg can be used for those as well, and for other types of applications including PHP, Perl, .NET, among others.

In short, this concludes the DB2 configuration process. Now it's time to cover the SAS territory.

SAS joins the party

So far we haven't touched any SAS files. We have only tested the DB2 connection first using the db2 command, then using unixODBC's isql command. These are mandatory steps before a successful connection with SAS.

Getting sasenv_local ready for DB2

If you have already connected SAS to other RDBMS in the past, you know that the most important file you need to care about is sasenv_local. It is located in the !SASROOT/bin directory. The code below shows a partial output with both ODBC and DB2 variables correctly defined:

# ODBC settings
export ODBCSYSINI=/etc
 
# DB2 settings
export INSTHOME=/opt/ibm/db2/V11.1
export DB2CLIINIPATH=/opt/ibm/db2/V11.1/cfg
 
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/lib64:${INSTHOME}/lib64

Let's analyze each of the above settings. First, SAS needs to know where to locate both unixODBC configuration files. The ODBCSYSINI variable instructs SAS to look for them in the /etc directory.

Second, as a good practice, I defined an environment variable called INSTHOME to reference my DB2 installation directory. This variable is optional but highly recommended. Then, the DB2CLIINIPATH variable tells SAS the location of db2cli.ini. If you decide to use db2dsdriver.cfg instead, then replace that variable with DB2DSDRIVER_CFG_PATH.

Finally, you have to inform SAS where to find unixODBC and DB2 libraries. SAS uses the LD_LIBRARY_PATH environment variable on Linux. This variable points to /usr/lib64 because it's the default path for all unixODBC libraries. I also appended the $INSTHOME/lib64 directory, since it's the place where DB2 has all its libraries.

At this point, you might think that you are all set. I don't like being pessimistic, but there is a high possibility that your connection may not work. I didn't test all SAS and DB2 releases, but my customer and I experienced the same error when we tested our connection with a LIBNAME statement:

[sas@muse 9.4]$ ./sas -nodms
 
  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
ERROR:  An exception has been encountered.
Please contact technical support and provide them with the following
       traceback information:
 
The SAS task name is [Line Mod]
Segmentation Violation
# More lines...

Moreover, if you run this code in SAS Studio or SAS Enterprise Guide, you might not even get a response, SAS will just crash or hang for a long time.

Fixing the Segmentation Violation error

After intense research, I was able to fix this error. We need to link two ODBC libraries that SAS provides to enable 32-bit capability. Change to your !SASROOT/sasexe directory and backup both files:

[sas@muse sasexe]$ mv sasodb sasodb.orig
[sas@muse sasexe]$ mv sasioodb sasioodb.orig

Create two symbolic links that point to the 32-bit libraries:

[sas@muse sasexe]$ ln -s sasodb_u sasodb
[sas@muse sasexe]$ ln -s sasioodb_u sasioodb

List your files, and the output should resemble the following:

[sas@muse sasexe]$ ls -l *odb*
-r-xr-xr-x. 1 sas sas 630196 Nov  9  2016 sasiodbu
lrwxrwxrwx. 1 sas sas     10 Sep 11 21:51 sasioodb -> sasioodb_u
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb.orig
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb_u
lrwxrwxrwx. 1 sas sas      8 Sep 11 21:52 sasodb -> sasodb_u
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb_u
-r-xr-xr-x. 1 sas sas 102142 Nov  9  2016 sasodbcs
-r-xr-xr-x. 1 sas sas  71982 Nov  9  2016 sasodbdb
-r-xr-xr-x. 1 sas sas  71990 Nov  9  2016 sasodbdb_u
-r-xr-xr-x. 1 sas sas 202343 Nov  9  2016 sasodb.orig
-r-xr-xr-x. 1 sas sas 201815 Nov  9  2016 sasodb_u
-r-xr-xr-x. 1 sas sas 443852 Nov  9  2016 tkeodbc.so

Now your LIBNAME statement may work, but unlike my customer, you can also encounter this error:

  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
ERROR: Could not load /sas/Software/SASFoundation/9.4/sasexe/sasodb (35
       images loaded)
ERROR: libodbc.so.1: cannot open shared object file: No such file or directory
ERROR: The SAS/ACCESS Interface to ODBC cannot be loaded. The SASODB   code
       appendage could not be loaded.
ERROR: Error in the LIBNAME statement.

The error is self-explanatory. SAS couldn't find the file libodbc.so.1, which SASODB needs. To display all the required shared libraries, execute the ldd command:

[sas@muse sasexe]$ ldd sasodb
        linux-vdso.so.1 =>  (0x00007ffd4efe2000)
        libodbc.so.1 => not found
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9aa741d000)
        libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f9aa71e6000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f9aa6fe2000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f9aa6cdf000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f9aa691e000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f9aa6708000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9aa7882000)
        libfreebl3.so => /lib64/libfreebl3.so (0x00007f9aa6504000)

Indeed, that library was not found, so to fix this missing dependency, create a symbolic link using root or sudo:

[root@muse ~]# ln -s /lib64/libodbc.so.2.0.0 /lib64/libodbc.so.1

Creating a library that works

Now that the previous errors are gone, run a final test using the SAS command line to assign a library and then SAS Studio to display some data in a nice-looking way:

[sas@muse 9.4]$ ./sas -nodms
  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
NOTE: Libref DB2LIB was successfully assigned as follows: 
      Engine:        ODBC 
      Physical Name: db2odbc

Note: Displaying actual data is an essential last step because if you are pointing to an incorrect DB2 ODBC driver (such as libdb2o.so instead of libdb2.so), you can still assign a library successfully, but you won't see any tables.

In SAS Studio I assigned the same library and ran a small query to retrieve some data:

proc print data=db2lib.department_lookup;
    var department_code department_en;
run;

Sample data in SAS Studio

Wrapping Up

I know it's a lot to digest. However, some users may not need to follow all steps to succeed, whereas others starting from scratch may require performing all steps carefully. Whatever your situation might be, I hope you find this guide useful and easy to follow. Did you struggle like me? Share your experience in the comments below or if you have any problems, don't hesitate to contact me.

Connecting SAS to a DB2 database via ODBC without tears was published on SAS Users.

11月 042017
 

Internet of Things for dementiaDementia describes different brain disorders that trigger a loss of brain function. These conditions are all usually progressive and eventually severe. Alzheimer's disease is the most common type of dementia, affecting 62 percent of those diagnosed. Other types of dementia include; vascular dementia affecting 17 percent of those diagnosed, mixed dementia affecting 10 percent of those diagnosed.

Dementia Statistics

There are 850,000 people with dementia in the UK, with numbers set to rise to over 1 million by 2025. This will soar to 2 million by 2051. 225,000 will develop dementia this year, that’s one every three minutes. 1 in 6 people over the age of 80 have dementia. 70 percent of people in care homes have dementia or severe memory problems. There are over 40,000 people under 65 with dementia in the UK. More than 25,000 people from black, Asian and minority ethnic groups in the UK are affected.

Cost of treating dementia

Two-thirds of the cost of dementia is paid by people with dementia and their families. Unpaid careers supporting someone with dementia save the economy £11 billion a year. Dementia is one of the main causes of disability later in life, ahead of cancer, cardiovascular disease and stroke (Statistic can be obtained from here - Alzheimer’s society webpage). To tackle dementia requires a lot of resources and support from the UK government which is battling to find funding to support NHS (National Health Service). During 2010–11, the NHS will need to contribute £2.3bn ($3.8bn) of the £5bn of public sector efficiency savings, where the highest savings are expected primarily from PCTs (Primary care trusts). In anticipation for tough times ahead, it is in the interest of PCTs to obtain evidence-based knowledge of the use of their services (e.g. accident & emergency, inpatients, outpatients, etc.) based on regions and patient groups in order to reduce the inequalities in health outcomes, improve matching of supply and demand, and most importantly reduce costs generated by its various services.

Currently in the UK, general practice (GP) doctors deliver primary care services by providing treatment and drug prescriptions and where necessary patients are referred to specialists, such as for outpatient care, which is provided by local hospitals (or specialised clinics). However, general practitioners (GPs) are limited in terms of size, resources, and the availability of the complete spectrum of care within the local community.

Solution in sight for dementia patients

There is the need to prevent or avoid delay for costly long-term care of dementia patients in nursing homes. Using wearables, monitors, sensors and other devices, NHS based in Surrey is collaborating with research centres to generate ‘Internet of Things’ data to monitor the health of dementia patients at the comfort of staying at home. The information from these devices will help people take more control over their own health and wellbeing, with the insights and alerts enabling health and social care staff to deliver more responsive and effective services. (More project details can be obtained here.)

Particle filtering

One method that could be used to analyse the IoT generated data is particle filtering methods. IoT dataset naturally fails within Bayesian framework. This method is very robust and account for the combination of historical and real-time data in order to make better decision.

In Bayesian statistics, we often have a prior knowledge or information of the phenomenon/application been modelled. This allows us to formulate a Bayesian model, that is, prior distribution, for the unknown quantities and likelihood functions relating these quantities to the observations (Doucet et al., 2008). As new evidence becomes available, we are often interested in updating our current knowledge or posterior distribution. Using State Space Model (SSM), we are able to apply Bayesian methods to time series dataset. The strength of these methods however lies in the ability to properly define our SSM parameters appropriately; otherwise our model will perform poorly. Many SSMs suffers from non-linearity and non-Gaussian assumptions making the maximum likelihood difficult to obtain when using standard methods. The classical inference methods for nonlinear dynamic systems are the extended Kalman filter (EKF) which is based on linearization of a state and trajectories (e.g. Johnson et al., 2008 ). The EKF have been successfully applied to many non-linear filtering problems. However, the EKF is known to fail if the system exhibits substantial nonlinearity and/or if the state and the measurement noise are significantly non-Gaussian.

An alternative method which gives a good approximation even when the posterior distribution is non-Gaussian is a simulation based method called Monte Carlo. This method is based upon drawing observations from the distribution of the variable of interest and simply calculating the empirical estimate of the expectation.

To apply these methods to time series data where observation arrives in sequential order, performing inference on-line becomes imperative hence the general term sequential Monte Carlo (SMC). A SMC method encompasses range of algorithms which are used for approximate filtering and smoothing. Among this method is particle filtering. In most literature, it has become a general tradition to present particle filtering as SMC, however, it is very important to note this distinction. Particle filtering is simply a simulation based algorithm used to approximate complicated posterior distributions. It combines sequential importance sampling (SIS) with an addition resampling step. SMC methods are very flexible, easy to implement, parallelizable and applicable in very general settings. The advent of cheap and formidable computational power in conjunction with some recent developments in applied statistics, engineering and probability, have stimulated many advancements in this field (Cappe et al. 2007). Computational simplicity in the form of not having to store all the data is also an additional advantage of SMC over MCMC (Markov Chain Monte Carlo).

References

[1] National Health Service England, http://www.nhs.uk/NHSEngland/aboutnhs/Pages/Authoritiesandtrusts.aspx (accessed 18 August 2009).

[2] Pincus SM. Approximate entropy as a measure of system complexity. Proc Natl Acad Sci USA 88: 2297–2301, 1991

[3] Pincus SM and Goldberger AL. Physiological time-series analysis: what does regularity quantify?  Am J Physiol Heart Circ Physiol 266: H1643–H1656, 1994

[4] Cappe, O.,S. Godsill, E. Moulines(2007).An overview of existing methods and recent advances in sequential Monte Carlo. Proceedings of the IEEE. Volume 95, No 5, pp 899-924.

[5] Doucet, A., A. M. Johansen(2008). A Tutorial on Particle Filtering and Smoothing: Fifteen years Later.

[6] Johansen, A. M. (2009).SMCTC: Sequential Monte Carlo in C++. Journal of Statistical Software. Volume 30, issue 6.

[7] Rasmussen and Z.Ghahramani (2003). Bayesian Monte Carlo. In S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15.

[8] Osborne A. M.,Duvenaud D., GarnettR.,Rasmussen, C.E., Roberts, C.E.,Ghahramani, Z. Active Learning of Model Evidence Using Bayesian Quadrature.

Scaling Internet of Things for dementia using Particle filters was published on SAS Users.

11月 042017
 

Internet of Things for dementiaDementia describes different brain disorders that trigger a loss of brain function. These conditions are all usually progressive and eventually severe. Alzheimer's disease is the most common type of dementia, affecting 62 percent of those diagnosed. Other types of dementia include; vascular dementia affecting 17 percent of those diagnosed, mixed dementia affecting 10 percent of those diagnosed.

Dementia Statistics

There are 850,000 people with dementia in the UK, with numbers set to rise to over 1 million by 2025. This will soar to 2 million by 2051. 225,000 will develop dementia this year, that’s one every three minutes. 1 in 6 people over the age of 80 have dementia. 70 percent of people in care homes have dementia or severe memory problems. There are over 40,000 people under 65 with dementia in the UK. More than 25,000 people from black, Asian and minority ethnic groups in the UK are affected.

Cost of treating dementia

Two-thirds of the cost of dementia is paid by people with dementia and their families. Unpaid careers supporting someone with dementia save the economy £11 billion a year. Dementia is one of the main causes of disability later in life, ahead of cancer, cardiovascular disease and stroke (Statistic can be obtained from here - Alzheimer’s society webpage). To tackle dementia requires a lot of resources and support from the UK government which is battling to find funding to support NHS (National Health Service). During 2010–11, the NHS will need to contribute £2.3bn ($3.8bn) of the £5bn of public sector efficiency savings, where the highest savings are expected primarily from PCTs (Primary care trusts). In anticipation for tough times ahead, it is in the interest of PCTs to obtain evidence-based knowledge of the use of their services (e.g. accident & emergency, inpatients, outpatients, etc.) based on regions and patient groups in order to reduce the inequalities in health outcomes, improve matching of supply and demand, and most importantly reduce costs generated by its various services.

Currently in the UK, general practice (GP) doctors deliver primary care services by providing treatment and drug prescriptions and where necessary patients are referred to specialists, such as for outpatient care, which is provided by local hospitals (or specialised clinics). However, general practitioners (GPs) are limited in terms of size, resources, and the availability of the complete spectrum of care within the local community.

Solution in sight for dementia patients

There is the need to prevent or avoid delay for costly long-term care of dementia patients in nursing homes. Using wearables, monitors, sensors and other devices, NHS based in Surrey is collaborating with research centres to generate ‘Internet of Things’ data to monitor the health of dementia patients at the comfort of staying at home. The information from these devices will help people take more control over their own health and wellbeing, with the insights and alerts enabling health and social care staff to deliver more responsive and effective services. (More project details can be obtained here.)

Particle filtering

One method that could be used to analyse the IoT generated data is particle filtering methods. IoT dataset naturally fails within Bayesian framework. This method is very robust and account for the combination of historical and real-time data in order to make better decision.

In Bayesian statistics, we often have a prior knowledge or information of the phenomenon/application been modelled. This allows us to formulate a Bayesian model, that is, prior distribution, for the unknown quantities and likelihood functions relating these quantities to the observations (Doucet et al., 2008). As new evidence becomes available, we are often interested in updating our current knowledge or posterior distribution. Using State Space Model (SSM), we are able to apply Bayesian methods to time series dataset. The strength of these methods however lies in the ability to properly define our SSM parameters appropriately; otherwise our model will perform poorly. Many SSMs suffers from non-linearity and non-Gaussian assumptions making the maximum likelihood difficult to obtain when using standard methods. The classical inference methods for nonlinear dynamic systems are the extended Kalman filter (EKF) which is based on linearization of a state and trajectories (e.g. Johnson et al., 2008 ). The EKF have been successfully applied to many non-linear filtering problems. However, the EKF is known to fail if the system exhibits substantial nonlinearity and/or if the state and the measurement noise are significantly non-Gaussian.

An alternative method which gives a good approximation even when the posterior distribution is non-Gaussian is a simulation based method called Monte Carlo. This method is based upon drawing observations from the distribution of the variable of interest and simply calculating the empirical estimate of the expectation.

To apply these methods to time series data where observation arrives in sequential order, performing inference on-line becomes imperative hence the general term sequential Monte Carlo (SMC). A SMC method encompasses range of algorithms which are used for approximate filtering and smoothing. Among this method is particle filtering. In most literature, it has become a general tradition to present particle filtering as SMC, however, it is very important to note this distinction. Particle filtering is simply a simulation based algorithm used to approximate complicated posterior distributions. It combines sequential importance sampling (SIS) with an addition resampling step. SMC methods are very flexible, easy to implement, parallelizable and applicable in very general settings. The advent of cheap and formidable computational power in conjunction with some recent developments in applied statistics, engineering and probability, have stimulated many advancements in this field (Cappe et al. 2007). Computational simplicity in the form of not having to store all the data is also an additional advantage of SMC over MCMC (Markov Chain Monte Carlo).

References

[1] National Health Service England, http://www.nhs.uk/NHSEngland/aboutnhs/Pages/Authoritiesandtrusts.aspx (accessed 18 August 2009).

[2] Pincus SM. Approximate entropy as a measure of system complexity. Proc Natl Acad Sci USA 88: 2297–2301, 1991

[3] Pincus SM and Goldberger AL. Physiological time-series analysis: what does regularity quantify?  Am J Physiol Heart Circ Physiol 266: H1643–H1656, 1994

[4] Cappe, O.,S. Godsill, E. Moulines(2007).An overview of existing methods and recent advances in sequential Monte Carlo. Proceedings of the IEEE. Volume 95, No 5, pp 899-924.

[5] Doucet, A., A. M. Johansen(2008). A Tutorial on Particle Filtering and Smoothing: Fifteen years Later.

[6] Johansen, A. M. (2009).SMCTC: Sequential Monte Carlo in C++. Journal of Statistical Software. Volume 30, issue 6.

[7] Rasmussen and Z.Ghahramani (2003). Bayesian Monte Carlo. In S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15.

[8] Osborne A. M.,Duvenaud D., GarnettR.,Rasmussen, C.E., Roberts, C.E.,Ghahramani, Z. Active Learning of Model Evidence Using Bayesian Quadrature.

Scaling Internet of Things for dementia using Particle filters was published on SAS Users.

10月 272017
 

When loading data into CAS using PROC CASUTIL, you have two choices on how the table can be loaded:  session-scope or global-scope.  This is controlled by the PROMOTE option in the PROC CASUTIL statement.

Session-scope loaded

proc casutil;
                                load casdata="model_table.sas7bdat" incaslib="ryloll" 
                                outcaslib="otcaslib" casout="model_table”;
run;
Global-scope loaded
proc casutil;
                                load casdata="model_table.sas7bdat" incaslib="ryloll" 
                                outcaslib="otcaslib" casout="model_table" promote;
run;

 

Global-scope loaded

proc casutil;
                                load casdata="model_table.sas7bdat" incaslib="ryloll" 
                                outcaslib="otcaslib" casout="model_table" promote;
run;

 

Remember session-scope tables can only be seen by a single CAS session and are dropped from CAS when that session is terminated, while global-scope tables can be seen publicly and will not be dropped when the CAS session is terminated.

But what happens if I want to create a new table for modeling by partitioning an existing table and adding a new partition column? Will the new table be session-scoped or global-scoped? To find out, I have a global-scoped table called MODEL_TABLE that I want to partition based on my response variable Event. I will use PROC PARTITION and call my new table MODEL_TABLE_PARTITIONED.

proc partition data=OTCASLIB.MODEL_TABLE partind samppct=30;
	by Event;
	output out=OTCASLIB.model_table_partitioned;
run;

 

After I created my new table, I executed the following code to determine its scope. Notice that the Promoted Table value is set to No on my new table MODEL_TABLE_PARTITIONED which means it’s session-scoped.

proc casutil;
     list tables incaslib="otcaslib";
run;

 

promote CAS tables from session-scope to global-scope

How can I promote my table to global-scoped?  Because PROC PARTITION doesn’t provide me with an option to promote my table to global-scope, I need to execute the following PROC CASUTIL code to promote my table to global-scope.

proc casutil;
     promote casdata="MODEL_TABLE_PARTITIONED"
     Incaslib="OTCASLIB" Outcaslib="OTCASLIB" CASOUT="MODEL_TABLE_PARTITIONED";
run;

 

I know what you’re thinking.  Why do I have to execute a PROC CASUTIL every time I need my output to be seen publicly in CAS?  That’s not efficient.  There has to be a better way!

Well there is, by using CAS Actions.  Remember, when working with CAS in SAS Viya, SAS PROCs are converted to CAS Actions and CAS Actions are at a more granular level, providing more options and parameters to work with.

How do I figure out what CAS Action syntax was used when I execute a SAS PROC?  Using the PROC PARTITION example from earlier, I can execute the following code after my PROC PARTITION completes to see the CAS Action syntax that was previously executed.

proc cas;
     history;
run;

 

This command will return a lot of output, but if I look for lines that start with the word “action,” I can find the CAS Actions that were executed.  In the output, I can see the following CAS action was executed for PROC PARTITION:

action sampling.stratified / table={name='MODEL_TABLE', caslib='OTCASLIB', groupBy={{name='Event'}}}, samppct=30, partind=true, output={casOut={name='MODEL_TABLE_PARTITIONED', caslib='OTCASLIB', replace=true}, copyVars='ALL'};

 

To partition my MODEL_TABLE using a CAS Action, I would execute the following code.

proc cas;
  sampling.stratified / 
    table={name='MODEL_TABLE', caslib='OTCASLIB', groupBy={name='Event'}}, 
    samppct=30, 
    partind=true, 
    output={casOut={name='embu_partitioned', caslib='OTCASLIB'}, copyVars='ALL'};
run;

 

If I look up sampling.stratified syntax in the

proc cas;
  sampling.stratified / 
    table={name='MODEL_TABLE', caslib='OTCASLIB', groupBy={name='Event'}}, 
    samppct=30, 
    partind=true, 
    output={casOut={name='embu_partitioned', caslib='OTCASLIB', promote=true}, copyVars='ALL'};
run;

 

So, what did we learn from this exercise?  We learned that when we create a table in CAS from a SAS PROC, the default scope will be session and to change the scope to global we would need to promote it through a PROC CASUTIL statement.  We also learned how to see the CAS Actions that were executed by SAS PROCs and how we can write code in CAS Action form to give us more control.

I hope this exercise helps you when working with CAS.

Thanks.

Tip and tricks to promote CAS tables from session-scope to global-scope was published on SAS Users.

10月 212017
 

Advantages of SAS ViyaThere are many compelling reasons existing SAS users might want to start integrating SAS Viya into their SAS9 programs and applications.  For me, it comes down to ease-of-use, speed, and faster time-to-value.  With the ability to traverse the (necessarily iterative) analytics lifecycle faster than before, we are now able to generate output quicker – better supporting vital decision-making in a reduced timeframe.   In addition to the positive impacts this can have on productivity, it can also change the way we look at current business challenges and how we design possible solutions.

Earlier this year I wrote about how SAS Viya provides a robust analytics environment to handle all of your big data processing needs.  Since then, I’ve been involved in testing the new SAS Viya 3.3 software that will be released near the end of 2017 and found some additional advantages I think warrant attention.  In this article, I rank order the main advantages of SAS Viya processing and new capabilities coming to SAS Viya 3.3 products.  While the new SAS Viya feature list is too long to list everything individually, I’ve put together the top reasons why you might want to start taking advantage of SAS Viya capabilities of the SAS platform.

1.     Multi-threaded everything, including the venerable DATA-step

In SAS Viya, everything that can run multi-threaded - does.  This is the single-most important aspect of the SAS Viya architecture for existing SAS customers.  As part of this new holistic approach to data processing, SAS has enabled the highly flexible DATA step to run multi-threaded, requiring very little modification of code in order to begin taking advantage of this significant new capability (more on that in soon-to-be-released blog).  Migrating to SAS Viya is important especially in those cases where long-running jobs consist of very long DATA steps that act as processing bottle-necks where constraints exist because of older single-threading configurations.

2.     No sorting necessary!

While not 100% true, most sort routines can be removed from your existing SAS programs.  Ask yourself the question: “What portion of my runtimes are due strictly to sorting?”  The answer is likely around 10-25%, maybe more.  In general, the concept of sorting goes away with in-memory processing.  SAS Viya does its own internal memory shuffling as a replacement.  The SAS Viya CAS engine takes care of partitioning and organizing the data so you don’t have to.  So, take those sorts out your existing code!

3.     VARCHAR informat (plus other “variable-blocking” informats/formats)

Not available in SAS 9.4, the VARCHAR informat/format allows you to store byte information without having to allocate room for blank spaces.  Because storage for columnar (input) values varies by row, you have the potential to achieve an enormous amount of (blank space) savings, which is especially important if you are using expensive (fast) disk storage space.  This represents a huge value in terms of potential data storage size reduction.

4.     Reduced I/O in the form of data reads and writes from Hive/HDFS and Teradata to CAS memory

SAS Viya can leverage Hive/HDFS and Teradata platforms by loading (lifting) data up and writing data back down in parallel using CAS pooled memory.  Data I/O, namely reading data from disk and converting it into a SAS binary format needed for processing, is the single most limiting factor of SAS 9.4.  Once you speed up your data loading, especially for extremely large data sets, you will be able to generate faster time to results for all analyses and projects.

5.     Persisted data can stay in memory to support multiple users or processing steps

Similar to SAS LASR, CAS can be structured to persist large data sets in memory, indefinitely.  This allows users to access the same data at the same time and eliminates redundancy and repetitive I/O, potentially saving valuable compute cycles.  Essentially, you can load the data once and then as many people (or processing steps) can reuse it as many times as needed thereafter.

6.     State-of-the-art Machine Learning (ML) techniques (including Gradient Boosting, Random Forest, Support Vector Machines, Factorization Machines, Deep Learning and NLP analytics)

All the most popular ML techniques are represented giving you the flexibility to customize model tournaments to include those techniques most appropriate for your given data and problem set.  We also provide assessment capabilities, thus saving you valuable time to get the types of information you need to make valid model comparisons (like ROC charts, lift charts, etc.) and pick your champion models.  We do not have extreme Gradient Boosting, Factorization Machines, or a specific Assessment procedure in SAS 9.4.  Also, GPU processing is supported in SAS Viya 3.3, for Deep Neural Networks and Convolutional Neural Networks (this has not be available previously).

7.     In-memory TRANSPOSE

The task of transposing data amounts to about 80% of any model building exercise, since predictive analytics requires a specialized data set called a ‘one-row-per-subject’ Analytic Base Table (ABT).  SAS Viya allows you transpose in a fraction of the time that it used to take to develop the critical ABT outputs.  A phenomenal time-saver procedure that now runs entirely multi-threaded, in-memory.

8.     API’s!!!

The ability to code from external interfaces gives coders the flexibility they need in today’s fast-moving programming world.  SAS Viya supports native language bindings for Lua, Java, Python and R.  This means, for example, that you can launch SAS processes from a Jupyter Notebook while staying within a Python coding environment.  SAS also provide a REST API for use in data science and IT departments.

9.     Improved model build and deployment options

The core of SAS  Viya machine learning techniques support auto-tuning.  SAS has the most effective hyper-parameter search and optimization routines, allowing data scientists to arrive at the correct algorithm settings with higher probability and speed, giving them better answers with less effort.  And because ML scoring code output is significantly more complex, SAS Viya Data Mining and Machine Learning allows you to deploy compact binary score files (called Astore files) into databases to help facilitate scoring.  These binary files do not require compilation and can be pushed to ESP-supported edge analytics.  Additionally, training within  event streams is being examined for a future release.

10.    Tons of new SAS visual interface advantages

A.     Less coding – SAS Viya acts as a code generator, producing batch code for repeatability and score code for easier deployment.  Both batch code and score code can be produced in a variety of formats, including SAS, Java, and Python.

B.     Improved data integration between SAS Viya visual analytics products – you can now edit your data in-memory and pass it effortlessly through to reporting, modeling, text, and forecasting applications (new tabs in a single application interface).

C.     Ability to compare modeling pipelines – now data scientists can compare champion models from any number of pipelines (think of SAS9 EM projects or data flows) they’ve created.

D.     Best practices and white box templates – once only available as part of SAS 9 Rapid Predictive Modeler, Model Studio now gives you easy access to basic, intermediate and advanced model templates.

E.     Reusable components – Users can save their best work (including pipelines and individual nodes) and share it with others.  Collaborating is easier than ever.

11.    Data flexibility

You can load big data without having all that data fit into memory.  Before in HPA or LASR engines, the memory environment had to be sized exactly to fit all the data.  That prior requirement has been removed using CAS technology – a really nice feature.

12.    Overall consolidation and consistency

SAS Viya seeks to standardize on common algorithms and techniques provided within every analytic technique so that you don’t get different answers when attempting to do things using alternate procedures or methods. For instance, our deployment of Stochastic Gradient Descent is now the same in every technique that uses that method.  Consistency also applies to the interfaces, as SAS Viya attempts to standardize the look-and-feel of various interfaces to reduce your learning curve when using a new capability.

The net result of these Top 12 advantages is that you have access to state-of-the-art technology, jobs finish faster, and you ultimately get faster time-to-value.  While this idea has been articulated in some of the above points, it is important to re-emphasize because SAS Viya benefits, when added together, result in higher throughputs of work, a greater flexibility in terms of options, and the ability to keep running when other systems would have failed.  You just have a much greater efficiency/productivity level when using SAS Viya as compared to before.  So why not use it?

Learn more about SAS Viya.
Tutorial Library: An introduction to SAS Viya programming for SAS 9 programmers.
Blog: Adding SAS Viya to your SAS 9 programming toolbox.

Top 12 Advantages of SAS Viya was published on SAS Users.

10月 122017
 

With SAS Data Management, you can setup SAS Data Remediation to manage and correct data issues. SAS Data Remediation allows user- or role-based access to data exceptions.

When a data issue is discovered it can be sent automatically or manually to a remediation queue where it can be corrected by designated users.

Let’s look how to setup a remediation service and how to send issue records to Data Remediation.

Register the remediation service.

To register a remediation service in SAS Data Remediation we go to Data Remediation Administrator “Add New Client Application.

Under Properties we supply an ID, which can be the name of the remediation service as long as it is unique, and a Display name, which is the name showing in the Remediation UI.

Under the tab Subject Area, we can register different subject categories for this remediation service.  When calling the remediation service we can categorize different remediation issues by setting different subject areas. We can, for example, use the Subject Area to point to different Data Quality Dimensions like Completeness, Uniqueness, Validity, Accuracy, Consistency.

Under the tab Issues Types, we can register issue categories. This enables us to categorize the different remediation issues. For example, we can point to the affected part of record like Name, Address, Phone Number.

At Task Templates/Select Templates we can set a workflow to be used for each issue type. You can design your own workflow using SAS Workflow Studio or you can use a prepared workflow that comes with Data Remediation. You need to make sure that the desired workflow is loaded on to Workflow Server to link it to the Data Remediation Service. Workflows are not mandatory in SAS Data Remediation but will improve efficiency of the remediation process.

Saving the remediation service will make it available to be called.

Sending issues to Data Remediation.

When you process data, and have identified issues that you want to send to Data Remediation, you can either call Data Remediation from the job immediately where you process the data or you store the issue records in a table first and then, in a second step, create remediation records via a Data Management job.

To send records to Data Remediation you can call remediation REST API form the HTTP Request node in a Data Management job.

Remediation REST API

The REST API expects a JSON structure supplying all required information:

{
	"application": "mandatory",
	"subjectArea": "mandatory",
	"name": "mandatory",
	"description": "",
	"userDefinedFieldLabels": {
		"1": "",
		"2": "",
		"3": ""
	},
	"topics": [{
		"url": "",
		"name": "",
		"userDefinedFields": {
			"1": "",
			"2": "",
			"3": ""
		},
		"key": "",
		"issues": [{
			"name": "mandatory",
			"importance": "",
			"note": "",
			"assignee": {
				"name": ""
			},
			"workflowName": "",
			"dueDate": "",
			"status": ""
		}]
	}]
}

 

JSON structure description:

In a Data Management job, you can create the JSON structure in an Expression node and use field substitution to pass in the necessary values from the issue records. The expression code could look like this:

REM_APPLICATION= "Customer Record"
REM_SUBJECT_AREA= "Completeness"
REM_PACKAGE_NAME= "Data Correction"
REM_PACKAGE_DESCRIPTION= "Mon-Result: " &formatdate(today(),"DD MM YY") 
REM_URL= "http://myserver/Sourcesys/#ID=" &record_id
REM_ITEM_NAME= "Mobile phone number missing"
REM_FIELDLABEL_1= "Source System"
REM_FIELD_1= "CRM"
REM_FIELDLABEL_2= "Redord ID"
REM_FIELD_2= record_id
REM_FIELDLABEL_3= "-"
REM_FIELD_3= ""
REM_KEY= record_id
REM_ISSUE_NAME= "Phone Number"
REM_IMPORTANCE= "high"
REM_ISSUE_NOTE= "Violated data quality rule phone: 4711"
REM_ASSIGNEE= "Ben"
REM_WORKFLOW= "Customer Tag"
REM_DUE-DATE= "2018-11-01"
REM_STATUS= "open"
 
JSON_REQUEST= '
{
  "application":"' &REM_APPLICATION &'",
  "subjectArea":"' &REM_SUBJECT_AREA &'",
  "name":"' &REM_PACKAGE_NAME &'",
  "description":"' &REM_PACKAGE_DESCRIPTION &'",
  "userDefinedFieldLabels": {
    "1":"' &REM_FIELDLABEL_1 &'",
    "2":"' &REM_FIELDLABEL_2 &'",
    "3":"' &REM_FIELDLABEL_3 &'"
  },
  "topics": [{
    "url":"' &REM_URL &'",
    "name":"' &REM_ITEM_NAME &'",
    "userDefinedFields": {
      "1":"' &REM_FIELD_1 &'",
      "2":"' &REM_FIELD_2 &'",
      "3":"' &REM_FIELD_3 &'"
    },
    "key":"' &REM_KEY &'",
    "issues": [{
      "name":"' &REM_ISSUE_NAME &'",
      "importance":"' &REM_IMPORTANCE &'",
      "note":"' &REM_ISSUE_NOTE &'",
      "assignee": {
        "name":"' &REM_ASSIGNEE &'"
      },
      "workflowName":"' &REM_WORKFLOW &'",
      "dueDate":"' &REM_DUE_DATE &'",
      "status":"' &REM_STATUS &'"
    }]
  }]
}'

 

Tip: You could also write a global function to generate the JSON structure.

After creating the JSON structure, you can invoke the web service to create remediation records. In the HTTP Request node, you call the web service as follows:

Address:  http://[server]:[port]/SASDataRemediation/rest/groups
Method: post
Input Filed: The variable containing the JSON structure. I.e. JSON_REQUEST
Output Filed: A field to take the output from the web service. You can use the New button create a filed and set the size to 1000
Under Security… you can set a defined user and password to access Data Remediation.
In the HTTP Request node’s advanced settings set the WSCP_HTTP_CONTENT_TYPE options to application/json

 

 

 

You can now execute the Data Management job to create the remediation records in SAS Data Remediation.

Improving data quality through SAS Data Remediation was published on SAS Users.

7月 182017
 

In the digital world where billions of customers are making trillions of visits on a multi-channel marketing environment, big data has drawn researchers’ attention all over the world. Customers leave behind a huge trail of data volumes in digital channels. It is becoming an extremely difficult task finding the right data, given exploding data volumes, that can potentially help make the right decision.

This can be a big issue for brands. Traditional databases have not been efficient enough to capture the sheer amount of information or complexity of datasets we accumulate on the web, on social media and other places.

A leading consulting firm, for example, boasts of one of its client having 35 million customers and 9million unique visitors daily on their website, leaving a huge amount of shopper information data every second. Segmenting this large amount of data with the right tools to help target marketing activities is not readily available. To make matters more complicated, this data can be structured and unstructured, making traditional methods of analysing data not suitable.

Tackling market segmentation

Market segmentation is a process by which market researchers identify key attributes about customers and potential customers which can be used to create distinct target market groups. Without a market segmentation base, Advertising and Sales can lose large amount of money targeting the wrong set of customers.

Some known methods of segmenting consumers market include geographical segmentation, demographical segmentation, behavioural segmentation, multi-variable account segmentation and others. Common approaches using statistical methods to segment various markets include:

  • Clustering algorithms such as K-Means clustering
  • Statistical mixture models such as Latent Class Analysis
  • Ensemble approaches such as Random Forests

Most of these methods assume the number of clusters to be known, which in reality is never the case. There are several approaches to estimate the number of clusters. However, strong evidence about the quality of this clusters does not exist.

To add to the above issues, clusters could be domain specific, which means they are built to solve certain domain problems such as:

  • Delimitation of species of plants or animals in biology.
  • Medical classification of diseases.
  • Discovery and segmentation of settlements and periods in archaeology.
  • Image segmentation and object recognition.
  • Social stratification.
  • Market segmentation.
  • Efficient organization of data bases for search queries.

There are also quite general tasks for which clustering is applied in many subject areas:

  • Exploratory data analysis looking for “interesting patterns” without prescribing any specific interpretation, potentially creating new research questions and hypotheses.
  • Information reduction and structuring of sets of entities from any subject area for simplification, effective communication, or effective access/action such as complexity reduction for further data analysis, or classification systems.
  • Investigating the correspondence of a clustering in specific data with other groupings or characteristics, either hypothesized or derived from other data.

Depending on the application, it may differ a lot what is meant by a “cluster,” and cluster definition and methodology have to be adapted to the specific aim of clustering in the application of interest.

Van Mechelen et al. (1993) set out an objective characteristics of what a “true clusters” should possess which includes the following:

  • Within-cluster dissimilarities should be small.
  • Between-cluster dissimilarities should be large.
  • Clusters should be fitted well by certain homogeneous probability models such as the Gaussian or a uniform distribution on a convex set, or by linear, time series or spatial process models.
  • Members of a cluster should be well represented by its centroid.
  • The dissimilarity matrix of the data should be well represented by the clustering (i.e., by the ultrametric induced by a dendrogram, or by defining a binary metric “in same cluster/in different clusters”).
  • Clusters should be stable.
  • Clusters should correspond to connected areas in data space with high density.
  • The areas in data space corresponding to clusters should have certain characteristics (such as being convex or linear).
  • It should be possible to characterize the clusters using a small number of variables.
  • Clusters should correspond well to an externally given partition or values of one or more variables that were not used for computing the clustering.
  • Features should be approximately independent within clusters.
  • The number of clusters should be low.

Reference

  1. Van Mechelen, J. Hampton, R.S. Michalski, P. Theuns (1993). Categories and Concepts—Theoretical Views and Inductive Data Analysis, Academic Press, London

What are the characteristics of “true clusters?” was published on SAS Users.