Tech

11月 212017
 

SAS Viya provides a robust, scalable, cloud-ready, distributed runtime engine. This engine is driven by CAS (Cloud Analytic Services), providing fast processing for many data management techniques that run distributive, i.e. using all threads on all defined compute nodes.

Why

PROC APPEND is a common technique used in SAS processes. This technique will concatenate two data sets together. However, PROC APPEND will produce an ERROR if the target CAS table exists prior to the PROC APPEND.

Simulating PROC APPEND

Figure 1. SAS Log with the PROC APPEND ERROR message

Now what?

How

To explain how to simulate PROC APPEND we first need to create two CAS tables. The first CAS table is named CASUSER.APPEND_TARGET. Notice the variables table, row and variable in figure 2.

Figure 2. Creating the CAS table we need to append rows to

The second CAS table is called CASUSER.TABLE_TWO and in figure 3 we can review the variables table, row, and variable.

Figure 3. Creating the table with the rows that we need to append to the existing CAS table

To simulate PROC APPEND we will use a DATA Step. Notice on line 77 in figure 4 we will overwrite an existing CAS table, i.e. CASUSER.APEND_TARGET. On Line 78, we see the first table on the SET statement is CASUSER.APPEND_TARGET, followed by CASUSER.TABLE_TWO. When this DATA Step runs, all of the rows in CASUSER.APPEND_TARGET will be processed first, followed by the rows in CASUSER.TABLE_TWO. Also, note we are not limited to two tables on the SET statement with DATA Step; we can use as many as we need to solve our business problems.

Figure 4. SAS log validating the DATA Step ran in CAS i.e. distributed

The result table created by the DATA Step is shown in figure 5.

Figure 5. Result table from simulating PROC APPEND using a DATA Step

Conclusion

SAS Viya’s CAS processing allows us to stage data for downstream consumption by leveraging robust SAS programming techniques that run distributed, i.e. fast. PROC APPEND is a common procedure used in SAS processes. To simulate PROC APPEND when using CAS tables as source and target tables to the procedure use DATA Step.

How to simulate PROC APPEND in CAS was published on SAS Users.

11月 202017
 

Many SAS users have inquired about SAS Cloud Analytic Services’ (CAS) Distributed Network File System (Learn more about CAS.)

The “NFS” in “DNFS”

Let’s start at the beginning. The “NFS” in DNFS stands for “Network File System” and refers to the ability to share files across a network. As the picture below illustrates, a network file system lets numerous remote hosts access another host’s files.

Understanding DNFS

NFS

There are numerous network file system protocols that can be used for file sharing – e.g. CIFS, Ceph, Lustre – but the most common on Linux is NFS. While NFS is the dominant file-sharing protocol, the “NFS” part of the DNFS does not correspond to the NFS protocol. Currently all the DNFS supported file systems are based on NFS, but DNFS may support file systems based on other protocols in the future. So, it’s best to think of the “NFS” part of “DNFS” as a generic “network file system” (clustered file system) and not the specific NFS protocol.

The “D” in “DNFS”

The “D” in DNFS stands for “Distributed” but it does not refer to the network file system. By definition, that is already distributed since the file system is external to the machines accessing it. The “Distributed” in DNFS refers to CAS’ ability to use a network file system in a massively parallel way. With a supported file system mounted identically on each CAS node, CAS can access (both write) the file system’s CSV and SASHDAT files from every worker node in parallel.

This parallel file access is not an attribute of the file system, it is a capability of CAS. By definition, network file systems facilitate access at the file level, not the block level. With DNFS, CAS actively manages parallel block level I/O to the network file system, making sure file seek locations and block I/O operations do not overlap, etc.

DNFS

 

DNFS as CAS Backing Store

Not only can CAS perform multi-machine parallel I/O from network file systems, it can also memory-map NFS SASHDAT files directly into CAS. Thus, SASHDAT files on DNFS act as both the CASlib data source as well as the virtual memory “backing store,” often negating the need for CAS to utilize memory mapping (mmap()).

Note 1: Data transformations on load, such as row filtering and field selection, as well as encryption can trigger CAS_DISK_CACHE usage. Since the data must be transformed (subset and/or decrypted), CAS copies the transformed data into CAS_DISK_CACHE to support CAS processing.

Note 2: It is possible to use DNFS atop an encrypting file system or hardware storage device. Here, the HDAT blocks are stored encrypted but transmitted to the CAS I/O layer decrypted. Assuming no other transformations, CAS_DISK_CACHE will not be used in this scenario.

DNFS Memory Mapping

Performance Considerations

DNFS-based CAS loading will only be as fast as the slowest component involved. The chosen NFS architecture (hardware and CAS connectivity) should support I/O throughput commensurate with the CAS installation and in-line with the implementation’s service level agreements. So, while NetApp ONTAP clustering architecture. A different file system technology might look a little different but the same basic ideas will apply.

DNFS w/ Multi Machine File System

As described earlier, CAS manages the parallel I/O operations. Requests from CAS are sent to the appliance and handled by the NFS metadata server. The storage device implementing the NFS protocol points CAS DNFS to the proper file and block locations on the NFS data servers which pass the data to the CAS worker nodes directly.

Understanding DNFS was published on SAS Users.

11月 182017
 

Have you heard?  The ODS Destination for PowerPoint Has a New Option

It’s true.  The ODS destination for PowerPoint now has the STARTPAGE= option, which provides you with greater control and flexibility when creating presentations.

Added to the ODS POWERPOINT statement in SAS® 9.4TS1M4, the STARTPAGE= option enables you to force the creation of a new slide between procedures and between ODS layout containers.  Inserting a slide break between layout containers is one of the most impactful ways that you can use this option.

A new layout container does not automatically trigger a new slide within the presentation.  A new slide is started when the current slide is full.  That is the default, but the new STARTPAGE= option gives you the ability to start a new slide between containers even if the current slide is not full.

Examples

Shown below are four procedures placed within three layout containers.

  • The first PROC ODSTEXT step is placed in the first layout container.  Its purpose is to generate a slide of text, with that text roughly centered in the slide.
  • The second PROC ODSTEXT step is in the second container.  Its purpose is to provide useful information about the table and graph.
  • The PROC TABULATE and SGPLOT steps make up the third container.  They are the results of the analysis and, as such, need to be displayed side by side.

Default Behavior of ODS POWERPOINT, Without STARTPAGE=

Let’s look at the default behavior.  In this example, the STARTPAGE= option is not used.

ods powerpoint file='example1.pptx' options(backgroundimage="saslogo_pptx.png");
title;
ods layout gridded x=10% y=25%;
proc odstext;
	p "Have you heard?" /style=[just=c fontsize=42pt color=RoyalBlue];
	p "The STARTPAGE= option has been added to the ODS POWERPOINT statement!" /style=[just=c fontsize=24pt];
run;
ods layout end;
 
ods layout gridded rows=1 columns=1;
   ods region;
   proc odstext;
      p 'Table Shows Total Runs and Hits for Each League';
      p 'The Graph Contains One Bubble for Each Player.  The Size of Each Bubble Represents the Magnitude of the RBIs.';
   run;
ods layout end;
 
ods graphics / width=4.5in height=4in;
ods layout gridded columns=2 column_widths=(47% 47%) column_gutter=1pct;
   ods region;
   proc tabulate data=sashelp.baseball;
      class league;
      var nruns nhits nrbi;
      tables league='', sum='Totals'*(nruns nhits)*f=comma12.;
   run;
 
   ods region;
   proc sgplot data=sashelp.baseball;
      bubble x=nhits y=nruns size=nrbi/ group=league transparency=.3;
   run;
ods layout end;
ods powerpoint close;

Here is the resulting slide output:

ODS Destination for PowerPoint

Those results are not what we hoped they would be.  The output from the second PROC ODSTEXT step, which is to provide information about the table and graph, is on the first slide.  So is the graph!!!  And the graph does not look good because it is the wrong size.  The table is by itself on the second slide.  This is not the desired output at all.

Here Is Where STARTPAGE= Helps!

In this example, an ODS POWERPOINT statement with the STARTPAGE= option is added.  It is placed after the ODS LAYOUT END statement for the first container.

ods powerpoint file='example2.pptx' options(backgroundimage="saslogo_pptx.png");
title;
ods layout gridded x=10% y=25%;
proc odstext;
	p "Have you heard?" /style=[just=c fontsize=42pt color=RoyalBlue];
	p "The STARTPAGE= option has been added to the ODS POWERPOINT statement!" /style=[just=c fontsize=24pt];
run;
ods layout end;
 
ods powerpoint startpage=now; /* <---- Triggers a new slide */
 
ods layout gridded rows=1 columns=1;
   ods region;
   proc odstext;
      p 'Table Shows Total Runs and Hits for Each League';
      p 'The Graph Contains One Bubble for Each Player.  The Size of Each Bubble Represents the Magnitude of the RBIs.';
   run;
ods layout end;
 
ods graphics / width=4.5in height=4in;
ods layout gridded columns=2 column_widths=(47% 47%) column_gutter=1pct;
   ods region;
   proc tabulate data=sashelp.baseball;
      class league;
      var nruns nhits nrbi;
      tables league='', sum='Totals'*(nruns nhits)*f=comma12.;
   run;
 
   ods region;
   proc sgplot data=sashelp.baseball;
      bubble x=nhits y=nruns size=nrbi/ group=league transparency=.3;
   run;
ods layout end;
ods powerpoint close;

The STARTPAGE= option gave us exactly what we need.  The first slide contains just the text (from the first layout container).  The second slide contains more text along with the table and graph (from the second and third layout containers).

Use It Wisely!

The most important thing to know about using the STARTPAGE= option with layout containers is that it has to be placed between containers.  It cannot be placed within a container.  So you cannot put the statement between the ODS LAYOUT GRIDDED and ODS LAYOUT END statements.

For more information about ODS destination for PowerPoint and all of its options, visit this The Dynamic Duo: ODS Layout and the ODS Destination for PowerPoint.  Take a peek at it for more examples of using ODS Layout with the ODS destination for PowerPoint.

The ODS Destination for PowerPoint Has a New Option was published on SAS Users.

11月 162017
 
connect sas db2 via odbc

Illustration by Dooder on Freepik

A few weeks ago I helped a customer set up a successful connection between SAS and a remote DB2 database using ODBC. At first, I thought this case would be as simple as plugging in a cable. Why? Because I already had a fully-functional testing environment with direct interaction using SAS/ACCESS Interface to DB2. Besides, I had other RDBMS connected via ODBC. Therefore, it was just a matter of setting up my ODBC driver manager for DB2. No big deal right? Oh boy, was I wrong! It was not easy at all. It took me two full days to make it work in my lab, and only after some in-depth research and a lot of trial and error.

DB2 can be challenging, especially when most of the documentation I found was either incomplete, outdated, scarce or even unclear. I don't want you to undergo the same frustration I felt. As a blogger and SAS user, it is my duty to make your SAS life a little easier. That's why I decided to publish this blog post to share the solution with the community. This guide provides you with the steps to take to set up a successful connection between SAS and a remote DB2 database via ODBC on Linux, without tears.

Table of Contents

Keep in mind this is an extensive guide. I documented all the technical issues I came across in my lab, and also provided detailed explanations to overcome them. If you are beginning your journey, I recommend you to go through all the steps. However, for your convenience, if you want to jump right into a specific topic, use the following table of contents:

My Environment

My Environment

All commands throughout this guide are based on this table. You should replace the code with your settings when appropriate.

Prerequisites

Before we begin, make sure you meet all of these SAS and DB2 requirements:

System Requirements

It may sound obvious, but you must verify that your SAS release supports your DB2 database version. For instance, My Environment section shows that I'm using SAS 9.4 M4. This document recommends that I need DB2 10.5 or later. Considering I currently have DB2 11.1, I'm ready to go. To review the rest of the combinations visit the SAS System Requirements page.

Another evident must-have is an ODBC driver manager already installed. As you noticed, I use unixODBC version 2.3.1. It is always recommended to have the latest version, which nowadays is 2.3.4. Still, you should be fine if you decide to use any of the 2.3.X releases. In the past, I've seen users facing difficulties with unixODBC 2.2.X.

SAS/ACCESS Interface to ODBC licensed and installed

For this connection to work, it is essential to have SAS/ACCESS Interface to ODBC licensed and installed. First, to verify that this product is licensed, run this code in a SAS session:

proc setinit no alias;
run;

In the output log look for this line:

---SAS/ACCESS Interface to ODBC
                          06JAN2018

If the product is displayed, move on to the next step. If not, look for SAS/ACCESS Interface to DB2 instead. If you have the latter, consider yourself a lucky person. You don't have to continue with this guide because you can connect to DB2 directly. If you need assistance with this off-topic issue, send me a message, and I'll be glad to help you. If you don't have either one, you should get in touch with your sales representative.

Next, to verify that this product is installed, go to your !SASROOT/sasexe directory on your SAS server and list these files:

[sas@muse sasexe]$ ls -l *odb*
-r-xr-xr-x. 1 sas sas 630196 Nov  9  2016 sasiodbu
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb_u
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb_u
-r-xr-xr-x. 1 sas sas 102142 Nov  9  2016 sasodbcs
-r-xr-xr-x. 1 sas sas  71982 Nov  9  2016 sasodbdb
-r-xr-xr-x. 1 sas sas  71990 Nov  9  2016 sasodbdb_u
-r-xr-xr-x. 1 sas sas 202343 Nov  9  2016 sasodb
-r-xr-xr-x. 1 sas sas 201815 Nov  9  2016 sasodb_u
-r-xr-xr-x. 1 sas sas 443852 Nov  9  2016 tkeodbc.so

If you don't see the above list, then the product is not installed. Although, if your SAS license includes it, use your SAS depot to install it on your server.

Choosing the right DB2 client package

IBM offers five different client packages. If you don't know which package to choose, you can't go wrong with IBM Data Server Client, it's the package I use, it includes all the client tools and libraries available. Another good alternative is IBM Data Server Runtime Client, but you will have to create your DB2 instance and user yourself. I would not recommend IBM Data Server Driver for ODBC and CLI. During my research I found that people could not get it to connect to a remote DB2 database. For more information about each client package, please visit IBM data server client and driver types.

Testing your DB2 connection outside of SAS

This test is the most critical requirement before starting with the actual configuration process. Take my advice for any connection you plan to configure in the future. Firstly, you must establish a successful connection between your database client and database server outside of SAS, using your database vendor tools. This validation is a must, regardless of your RDBMS or SAS engine.

To test your connection to your remote DB2 server, run an equivalent command on your DB2 client:

[db2inst@muse ~]$ db2 connect to GSDB user db2inst using mypassword
 
   Database Connection Information
 
 Database server        = DB2/LINUXX8664 11.1.1
 SQL authorization ID   = DB2INST
 Local database alias   = GSDB

If you received a similar output you might continue; otherwise, you will have to catalog your DB2 server as a TCP/IP node, and then catalog the database on that node. Since this is beyond the scope of this guide, please review these IBM documents: CATALOG NODE command and CATALOG DATABASE command.

Setting up unixODBC with DB2

Considering that you have tested a successful connection between your database client and database server with the db2 command, it's time to set up a new one using an ODBC driver instead. Please bear with me. Things are about to get interesting.

As mentioned at the beginning, unixODBC is my ODBC driver manager of choice. I use it for all my RDBMS in my lab to test ODBC connections. Its usage is almost effortless, but one of the downsides is the documentation. The only official document available for DB2 is: Using IBM DB2 with unixODBC. Judge for yourself.

This driver manager has two configuration files: odbcinst.ini for database drivers and odbc.ini for database definitions. Typically with most databases, those two files are all that it takes to set up an ODBC connection. DB2 plays a different game though. It uses two extra files: db2cli.ini for DB2 definitions and an XML based file called db2dsdriver.cfg.

For now, let's take a closer look at the contents of both unixODBC files:

odbcinst.ini

[DB2]
Description     = IBM DB2 ODBC Driver
Driver          = /opt/ibm/db2/V11.1/lib32/libdb2.so
Driver64        = /opt/ibm/db2/V11.1/lib64/libdb2.so
FileUsage       = 1
DontDLClose     = 1

odbc.ini

[db2odbc]
Driver          = DB2
Description     = DB2 ODBC connection for SAS

At this point I was feeling confident and thought the connection would work, so I tested it out with the isql command, but to my surprise, I received this error:

[db2inst@muse ~]$ isql -v db2odbc db2inst mypassword
[08001][unixODBC][IBM][CLI Driver] SQL1013N  The database alias name or database name "DB2ODBC" could not be found.  SQLSTATE=42705
 
[ISQL]ERROR: Could not SQLConnect

db2cli.ini came to the rescue

As a result, I started looking up online documentation. After a while, I discovered the existence of the db2cli.ini file. To get you started, IBM provides a sample configuration file called db2cli.ini.sample, located in the installation_path/cfg directory. The DB2 documentation recommends creating a db2cli.ini file based on that sample file and place it in the same location. I followed their advice, and created my file with the same data source name:

[db2odbc]
Database        = GSDB
Protocol        = TCPIP
Hostname        = redhat
ServiceName     = 50000

Note: For IBM, the ServiceName keyword is the port number. The default port number for the DB2 default instance is 50000. Review the correct port with your DBA.

Having set up both unixODBC and DB2 files correctly, I was feeling lucky. Therefore, I attempted a connection again, but I received the same error. My frustration was on the rise.

db2cli validate tool is your best friend

I continued digging into the documentation until I found a powerful tool: db2cli with a particular parameter: validate. This command/parameter duo can help you verify the contents of your configuration files (db2cli.ini and db2dsdriver.cfg), list their location, test your data source names and databases, display invalid/duplicate keywords, among other functionalities. It basically is your new best friend.

db2cli validate helped me identify why the DB2 ODBC driver was unable to find my data source name. I ran this command to examine my environment, and the output was eye-opening:

[db2inst@muse ~]$ db2cli validate -dsn db2odbc
 
===============================================================================
Client information for the current copy:
===============================================================================
 
Client Package Type       : IBM Data Server Client
Client Version (level/bit): DB2 v11.1.0.0 (s1606081100/64-bit)
Client Platform           : Linux/X8664
Install/Instance Path     : /opt/ibm/db2/V11.1
DB2DSDRIVER_CFG_PATH value: 
db2dsdriver.cfg Path      : /home/db2inst/sqllib/cfg/db2dsdriver.cfg
DB2CLIINIPATH value       : 
db2cli.ini Path           : /home/db2inst/sqllib/cfg/db2cli.ini
db2diag.log Path          : /home/db2inst/sqllib/db2dump/db2diag.log
 
===============================================================================
db2dsdriver.cfg schema validation for the entire file:
===============================================================================
 
Note: The validation utility could not find the configuration file 
db2dsdriver.cfg. The file is searched at 
"/home/db2inst/sqllib/cfg/db2dsdriver.cfg".
 
===============================================================================
db2cli.ini validation for data source name "db2odbc":
===============================================================================
 
Note: The validation utility could not find the configuration file db2cli.ini. 
The file is searched at "/home/db2inst/sqllib/cfg/db2cli.ini".
 
===============================================================================
db2dsdriver.cfg validation for data source name "db2odbc":
===============================================================================
 
Note: The validation utility could not find the configuration file 
db2dsdriver.cfg. The file is searched at 
"/home/db2inst/sqllib/cfg/db2dsdriver.cfg".
 
===============================================================================
The validation is completed.
===============================================================================

As you can see, the validation tool didn't find my db2cli.ini. That file contained my data source name; consequently, the DB2 ODBC driver didn't find it either. As previously indicated, I created that file in the installation_path/cfg directory, which it was: /opt/ibm/db2/V11.1/cfg, but the DB2 ODBC driver was searching for it in: /home/db2inst/sqllib/cfg. I didn't choose that path randomly; I just followed this IBM document. Nevertheless, if you encounter the same issue, the solution is fairly simple. You can either create a copy in the required path or point to the original file using a symbolic link. I chose the latter to avoid having duplicates. To fix this misunderstanding just execute this command as root or sudo:

[root@muse ~]# ln -s /opt/ibm/db2/V11.1/cfg/db2cli.ini /home/db2inst/sqllib/cfg/db2cli.ini

Now you should have this symbolic link:

[db2inst@muse ~]$ ls -l /home/db2inst/sqllib/cfg/db2cli.ini 
lrwxrwxrwx. 1 root db2iadm 33 Sep 11 19:07 /home/db2inst/sqllib/cfg/db2cli.ini -> /opt/ibm/db2/V11.1/cfg/db2cli.ini

After this fix, I used my new friend again, and my environment changed. Below is a partial output:

[db2inst@muse ~]$ db2cli validate -dsn db2odbc
===============================================================================
db2cli.ini validation for data source name "db2odbc":
===============================================================================
 
[ Keywords used for the connection ]
 
Keyword                   Value
---------------------------------------------------------------------------
DATABASE                  GSDB
PROTOCOL                  TCPIP
HOSTNAME                  redhat
SERVICENAME               50000

At last, the DB2 ODBC driver was able to find the configuration file, read its contents, and validate the keywords. My frustration started to disappear.

It was time to see what unixODBC thought about this change. So I reran my isql command:

[db2inst@muse ~]$ isql -v db2odbc db2inst mypassword
+---------------------------------------+
| Connected!                            |
|                                       |
| sql-statement                         |
| help [tablename]                      |
| quit                                  |
|                                       |
+---------------------------------------+
SQL>

It finally worked!

Do we need to use db2dsdriver.cfg?

I managed to set up a successful ODBC connection using only three files: db2cli.ini, along with odbc.ini, and odbcinst.ini. Thus, the answer is no; you don't need to use this additional DB2 configuration file.

For our scenario, I consider it as optional. Previously, I pointed out that it was an XML file, as opposed to db2cli.ini, which is a text file. Both can have the same four keywords I defined. Then why does DB2 provide two files? db2cli.ini is used to configure the behavior of CLI/ODBC applications, whereas db2dsdriver.cfg can be used for those as well, and for other types of applications including PHP, Perl, .NET, among others.

In short, this concludes the DB2 configuration process. Now it's time to cover the SAS territory.

SAS joins the party

So far we haven't touched any SAS files. We have only tested the DB2 connection first using the db2 command, then using unixODBC's isql command. These are mandatory steps before a successful connection with SAS.

Getting sasenv_local ready for DB2

If you have already connected SAS to other RDBMS in the past, you know that the most important file you need to care about is sasenv_local. It is located in the !SASROOT/bin directory. The code below shows a partial output with both ODBC and DB2 variables correctly defined:

# ODBC settings
export ODBCSYSINI=/etc
 
# DB2 settings
export INSTHOME=/opt/ibm/db2/V11.1
export DB2CLIINIPATH=/opt/ibm/db2/V11.1/cfg
 
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/lib64:${INSTHOME}/lib64

Let's analyze each of the above settings. First, SAS needs to know where to locate both unixODBC configuration files. The ODBCSYSINI variable instructs SAS to look for them in the /etc directory.

Second, as a good practice, I defined an environment variable called INSTHOME to reference my DB2 installation directory. This variable is optional but highly recommended. Then, the DB2CLIINIPATH variable tells SAS the location of db2cli.ini. If you decide to use db2dsdriver.cfg instead, then replace that variable with DB2DSDRIVER_CFG_PATH.

Finally, you have to inform SAS where to find unixODBC and DB2 libraries. SAS uses the LD_LIBRARY_PATH environment variable on Linux. This variable points to /usr/lib64 because it's the default path for all unixODBC libraries. I also appended the $INSTHOME/lib64 directory, since it's the place where DB2 has all its libraries.

At this point, you might think that you are all set. I don't like being pessimistic, but there is a high possibility that your connection may not work. I didn't test all SAS and DB2 releases, but my customer and I experienced the same error when we tested our connection with a LIBNAME statement:

[sas@muse 9.4]$ ./sas -nodms
 
  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
ERROR:  An exception has been encountered.
Please contact technical support and provide them with the following
       traceback information:
 
The SAS task name is [Line Mod]
Segmentation Violation
# More lines...

Moreover, if you run this code in SAS Studio or SAS Enterprise Guide, you might not even get a response, SAS will just crash or hang for a long time.

Fixing the Segmentation Violation error

After intense research, I was able to fix this error. We need to link two ODBC libraries that SAS provides to enable 32-bit capability. Change to your !SASROOT/sasexe directory and backup both files:

[sas@muse sasexe]$ mv sasodb sasodb.orig
[sas@muse sasexe]$ mv sasioodb sasioodb.orig

Create two symbolic links that point to the 32-bit libraries:

[sas@muse sasexe]$ ln -s sasodb_u sasodb
[sas@muse sasexe]$ ln -s sasioodb_u sasioodb

List your files, and the output should resemble the following:

[sas@muse sasexe]$ ls -l *odb*
-r-xr-xr-x. 1 sas sas 630196 Nov  9  2016 sasiodbu
lrwxrwxrwx. 1 sas sas     10 Sep 11 21:51 sasioodb -> sasioodb_u
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb.orig
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb_u
lrwxrwxrwx. 1 sas sas      8 Sep 11 21:52 sasodb -> sasodb_u
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb_u
-r-xr-xr-x. 1 sas sas 102142 Nov  9  2016 sasodbcs
-r-xr-xr-x. 1 sas sas  71982 Nov  9  2016 sasodbdb
-r-xr-xr-x. 1 sas sas  71990 Nov  9  2016 sasodbdb_u
-r-xr-xr-x. 1 sas sas 202343 Nov  9  2016 sasodb.orig
-r-xr-xr-x. 1 sas sas 201815 Nov  9  2016 sasodb_u
-r-xr-xr-x. 1 sas sas 443852 Nov  9  2016 tkeodbc.so

Now your LIBNAME statement may work, but unlike my customer, you can also encounter this error:

  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
ERROR: Could not load /sas/Software/SASFoundation/9.4/sasexe/sasodb (35
       images loaded)
ERROR: libodbc.so.1: cannot open shared object file: No such file or directory
ERROR: The SAS/ACCESS Interface to ODBC cannot be loaded. The SASODB   code
       appendage could not be loaded.
ERROR: Error in the LIBNAME statement.

The error is self-explanatory. SAS couldn't find the file libodbc.so.1, which SASODB needs. To display all the required shared libraries, execute the ldd command:

[sas@muse sasexe]$ ldd sasodb
        linux-vdso.so.1 =>  (0x00007ffd4efe2000)
        libodbc.so.1 => not found
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9aa741d000)
        libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f9aa71e6000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f9aa6fe2000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f9aa6cdf000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f9aa691e000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f9aa6708000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9aa7882000)
        libfreebl3.so => /lib64/libfreebl3.so (0x00007f9aa6504000)

Indeed, that library was not found, so to fix this missing dependency, create a symbolic link using root or sudo:

[root@muse ~]# ln -s /lib64/libodbc.so.2.0.0 /lib64/libodbc.so.1

Creating a library that works

Now that the previous errors are gone, run a final test using the SAS command line to assign a library and then SAS Studio to display some data in a nice-looking way:

[sas@muse 9.4]$ ./sas -nodms
  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
NOTE: Libref DB2LIB was successfully assigned as follows: 
      Engine:        ODBC 
      Physical Name: db2odbc

Note: Displaying actual data is an essential last step because if you are pointing to an incorrect DB2 ODBC driver (such as libdb2o.so instead of libdb2.so), you can still assign a library successfully, but you won't see any tables.

In SAS Studio I assigned the same library and ran a small query to retrieve some data:

proc print data=db2lib.department_lookup;
    var department_code department_en;
run;

Sample data in SAS Studio

Wrapping Up

I know it's a lot to digest. However, some users may not need to follow all steps to succeed, whereas others starting from scratch may require performing all steps carefully. Whatever your situation might be, I hope you find this guide useful and easy to follow. Did you struggle like me? Share your experience in the comments below or if you have any problems, don't hesitate to contact me.

Connecting SAS to a DB2 database via ODBC without tears was published on SAS Users.

11月 162017
 
connect sas db2 via odbc

Illustration by Dooder on Freepik

A few weeks ago I helped a customer set up a successful connection between SAS and a remote DB2 database using ODBC. At first, I thought this case would be as simple as plugging in a cable. Why? Because I already had a fully-functional testing environment with direct interaction using SAS/ACCESS Interface to DB2. Besides, I had other RDBMS connected via ODBC. Therefore, it was just a matter of setting up my ODBC driver manager for DB2. No big deal right? Oh boy, was I wrong! It was not easy at all. It took me two full days to make it work in my lab, and only after some in-depth research and a lot of trial and error.

DB2 can be challenging, especially when most of the documentation I found was either incomplete, outdated, scarce or even unclear. I don't want you to undergo the same frustration I felt. As a blogger and SAS user, it is my duty to make your SAS life a little easier. That's why I decided to publish this blog post to share the solution with the community. This guide provides you with the steps to take to set up a successful connection between SAS and a remote DB2 database via ODBC on Linux, without tears.

Table of Contents

Keep in mind this is an extensive guide. I documented all the technical issues I came across in my lab, and also provided detailed explanations to overcome them. If you are beginning your journey, I recommend you to go through all the steps. However, for your convenience, if you want to jump right into a specific topic, use the following table of contents:

My Environment

My Environment

All commands throughout this guide are based on this table. You should replace the code with your settings when appropriate.

Prerequisites

Before we begin, make sure you meet all of these SAS and DB2 requirements:

System Requirements

It may sound obvious, but you must verify that your SAS release supports your DB2 database version. For instance, My Environment section shows that I'm using SAS 9.4 M4. This document recommends that I need DB2 10.5 or later. Considering I currently have DB2 11.1, I'm ready to go. To review the rest of the combinations visit the SAS System Requirements page.

Another evident must-have is an ODBC driver manager already installed. As you noticed, I use unixODBC version 2.3.1. It is always recommended to have the latest version, which nowadays is 2.3.4. Still, you should be fine if you decide to use any of the 2.3.X releases. In the past, I've seen users facing difficulties with unixODBC 2.2.X.

SAS/ACCESS Interface to ODBC licensed and installed

For this connection to work, it is essential to have SAS/ACCESS Interface to ODBC licensed and installed. First, to verify that this product is licensed, run this code in a SAS session:

proc setinit no alias;
run;

In the output log look for this line:

---SAS/ACCESS Interface to ODBC
                          06JAN2018

If the product is displayed, move on to the next step. If not, look for SAS/ACCESS Interface to DB2 instead. If you have the latter, consider yourself a lucky person. You don't have to continue with this guide because you can connect to DB2 directly. If you need assistance with this off-topic issue, send me a message, and I'll be glad to help you. If you don't have either one, you should get in touch with your sales representative.

Next, to verify that this product is installed, go to your !SASROOT/sasexe directory on your SAS server and list these files:

[sas@muse sasexe]$ ls -l *odb*
-r-xr-xr-x. 1 sas sas 630196 Nov  9  2016 sasiodbu
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb_u
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb_u
-r-xr-xr-x. 1 sas sas 102142 Nov  9  2016 sasodbcs
-r-xr-xr-x. 1 sas sas  71982 Nov  9  2016 sasodbdb
-r-xr-xr-x. 1 sas sas  71990 Nov  9  2016 sasodbdb_u
-r-xr-xr-x. 1 sas sas 202343 Nov  9  2016 sasodb
-r-xr-xr-x. 1 sas sas 201815 Nov  9  2016 sasodb_u
-r-xr-xr-x. 1 sas sas 443852 Nov  9  2016 tkeodbc.so

If you don't see the above list, then the product is not installed. Although, if your SAS license includes it, use your SAS depot to install it on your server.

Choosing the right DB2 client package

IBM offers five different client packages. If you don't know which package to choose, you can't go wrong with IBM Data Server Client, it's the package I use, it includes all the client tools and libraries available. Another good alternative is IBM Data Server Runtime Client, but you will have to create your DB2 instance and user yourself. I would not recommend IBM Data Server Driver for ODBC and CLI. During my research I found that people could not get it to connect to a remote DB2 database. For more information about each client package, please visit IBM data server client and driver types.

Testing your DB2 connection outside of SAS

This test is the most critical requirement before starting with the actual configuration process. Take my advice for any connection you plan to configure in the future. Firstly, you must establish a successful connection between your database client and database server outside of SAS, using your database vendor tools. This validation is a must, regardless of your RDBMS or SAS engine.

To test your connection to your remote DB2 server, run an equivalent command on your DB2 client:

[db2inst@muse ~]$ db2 connect to GSDB user db2inst using mypassword
 
   Database Connection Information
 
 Database server        = DB2/LINUXX8664 11.1.1
 SQL authorization ID   = DB2INST
 Local database alias   = GSDB

If you received a similar output you might continue; otherwise, you will have to catalog your DB2 server as a TCP/IP node, and then catalog the database on that node. Since this is beyond the scope of this guide, please review these IBM documents: CATALOG NODE command and CATALOG DATABASE command.

Setting up unixODBC with DB2

Considering that you have tested a successful connection between your database client and database server with the db2 command, it's time to set up a new one using an ODBC driver instead. Please bear with me. Things are about to get interesting.

As mentioned at the beginning, unixODBC is my ODBC driver manager of choice. I use it for all my RDBMS in my lab to test ODBC connections. Its usage is almost effortless, but one of the downsides is the documentation. The only official document available for DB2 is: Using IBM DB2 with unixODBC. Judge for yourself.

This driver manager has two configuration files: odbcinst.ini for database drivers and odbc.ini for database definitions. Typically with most databases, those two files are all that it takes to set up an ODBC connection. DB2 plays a different game though. It uses two extra files: db2cli.ini for DB2 definitions and an XML based file called db2dsdriver.cfg.

For now, let's take a closer look at the contents of both unixODBC files:

odbcinst.ini

[DB2]
Description     = IBM DB2 ODBC Driver
Driver          = /opt/ibm/db2/V11.1/lib32/libdb2.so
Driver64        = /opt/ibm/db2/V11.1/lib64/libdb2.so
FileUsage       = 1
DontDLClose     = 1

odbc.ini

[db2odbc]
Driver          = DB2
Description     = DB2 ODBC connection for SAS

At this point I was feeling confident and thought the connection would work, so I tested it out with the isql command, but to my surprise, I received this error:

[db2inst@muse ~]$ isql -v db2odbc db2inst mypassword
[08001][unixODBC][IBM][CLI Driver] SQL1013N  The database alias name or database name "DB2ODBC" could not be found.  SQLSTATE=42705
 
[ISQL]ERROR: Could not SQLConnect

db2cli.ini came to the rescue

As a result, I started looking up online documentation. After a while, I discovered the existence of the db2cli.ini file. To get you started, IBM provides a sample configuration file called db2cli.ini.sample, located in the installation_path/cfg directory. The DB2 documentation recommends creating a db2cli.ini file based on that sample file and place it in the same location. I followed their advice, and created my file with the same data source name:

[db2odbc]
Database        = GSDB
Protocol        = TCPIP
Hostname        = redhat
ServiceName     = 50000

Note: For IBM, the ServiceName keyword is the port number. The default port number for the DB2 default instance is 50000. Review the correct port with your DBA.

Having set up both unixODBC and DB2 files correctly, I was feeling lucky. Therefore, I attempted a connection again, but I received the same error. My frustration was on the rise.

db2cli validate tool is your best friend

I continued digging into the documentation until I found a powerful tool: db2cli with a particular parameter: validate. This command/parameter duo can help you verify the contents of your configuration files (db2cli.ini and db2dsdriver.cfg), list their location, test your data source names and databases, display invalid/duplicate keywords, among other functionalities. It basically is your new best friend.

db2cli validate helped me identify why the DB2 ODBC driver was unable to find my data source name. I ran this command to examine my environment, and the output was eye-opening:

[db2inst@muse ~]$ db2cli validate -dsn db2odbc
 
===============================================================================
Client information for the current copy:
===============================================================================
 
Client Package Type       : IBM Data Server Client
Client Version (level/bit): DB2 v11.1.0.0 (s1606081100/64-bit)
Client Platform           : Linux/X8664
Install/Instance Path     : /opt/ibm/db2/V11.1
DB2DSDRIVER_CFG_PATH value: 
db2dsdriver.cfg Path      : /home/db2inst/sqllib/cfg/db2dsdriver.cfg
DB2CLIINIPATH value       : 
db2cli.ini Path           : /home/db2inst/sqllib/cfg/db2cli.ini
db2diag.log Path          : /home/db2inst/sqllib/db2dump/db2diag.log
 
===============================================================================
db2dsdriver.cfg schema validation for the entire file:
===============================================================================
 
Note: The validation utility could not find the configuration file 
db2dsdriver.cfg. The file is searched at 
"/home/db2inst/sqllib/cfg/db2dsdriver.cfg".
 
===============================================================================
db2cli.ini validation for data source name "db2odbc":
===============================================================================
 
Note: The validation utility could not find the configuration file db2cli.ini. 
The file is searched at "/home/db2inst/sqllib/cfg/db2cli.ini".
 
===============================================================================
db2dsdriver.cfg validation for data source name "db2odbc":
===============================================================================
 
Note: The validation utility could not find the configuration file 
db2dsdriver.cfg. The file is searched at 
"/home/db2inst/sqllib/cfg/db2dsdriver.cfg".
 
===============================================================================
The validation is completed.
===============================================================================

As you can see, the validation tool didn't find my db2cli.ini. That file contained my data source name; consequently, the DB2 ODBC driver didn't find it either. As previously indicated, I created that file in the installation_path/cfg directory, which it was: /opt/ibm/db2/V11.1/cfg, but the DB2 ODBC driver was searching for it in: /home/db2inst/sqllib/cfg. I didn't choose that path randomly; I just followed this IBM document. Nevertheless, if you encounter the same issue, the solution is fairly simple. You can either create a copy in the required path or point to the original file using a symbolic link. I chose the latter to avoid having duplicates. To fix this misunderstanding just execute this command as root or sudo:

[root@muse ~]# ln -s /opt/ibm/db2/V11.1/cfg/db2cli.ini /home/db2inst/sqllib/cfg/db2cli.ini

Now you should have this symbolic link:

[db2inst@muse ~]$ ls -l /home/db2inst/sqllib/cfg/db2cli.ini 
lrwxrwxrwx. 1 root db2iadm 33 Sep 11 19:07 /home/db2inst/sqllib/cfg/db2cli.ini -> /opt/ibm/db2/V11.1/cfg/db2cli.ini

After this fix, I used my new friend again, and my environment changed. Below is a partial output:

[db2inst@muse ~]$ db2cli validate -dsn db2odbc
===============================================================================
db2cli.ini validation for data source name "db2odbc":
===============================================================================
 
[ Keywords used for the connection ]
 
Keyword                   Value
---------------------------------------------------------------------------
DATABASE                  GSDB
PROTOCOL                  TCPIP
HOSTNAME                  redhat
SERVICENAME               50000

At last, the DB2 ODBC driver was able to find the configuration file, read its contents, and validate the keywords. My frustration started to disappear.

It was time to see what unixODBC thought about this change. So I reran my isql command:

[db2inst@muse ~]$ isql -v db2odbc db2inst mypassword
+---------------------------------------+
| Connected!                            |
|                                       |
| sql-statement                         |
| help [tablename]                      |
| quit                                  |
|                                       |
+---------------------------------------+
SQL>

It finally worked!

Do we need to use db2dsdriver.cfg?

I managed to set up a successful ODBC connection using only three files: db2cli.ini, along with odbc.ini, and odbcinst.ini. Thus, the answer is no; you don't need to use this additional DB2 configuration file.

For our scenario, I consider it as optional. Previously, I pointed out that it was an XML file, as opposed to db2cli.ini, which is a text file. Both can have the same four keywords I defined. Then why does DB2 provide two files? db2cli.ini is used to configure the behavior of CLI/ODBC applications, whereas db2dsdriver.cfg can be used for those as well, and for other types of applications including PHP, Perl, .NET, among others.

In short, this concludes the DB2 configuration process. Now it's time to cover the SAS territory.

SAS joins the party

So far we haven't touched any SAS files. We have only tested the DB2 connection first using the db2 command, then using unixODBC's isql command. These are mandatory steps before a successful connection with SAS.

Getting sasenv_local ready for DB2

If you have already connected SAS to other RDBMS in the past, you know that the most important file you need to care about is sasenv_local. It is located in the !SASROOT/bin directory. The code below shows a partial output with both ODBC and DB2 variables correctly defined:

# ODBC settings
export ODBCSYSINI=/etc
 
# DB2 settings
export INSTHOME=/opt/ibm/db2/V11.1
export DB2CLIINIPATH=/opt/ibm/db2/V11.1/cfg
 
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/lib64:${INSTHOME}/lib64

Let's analyze each of the above settings. First, SAS needs to know where to locate both unixODBC configuration files. The ODBCSYSINI variable instructs SAS to look for them in the /etc directory.

Second, as a good practice, I defined an environment variable called INSTHOME to reference my DB2 installation directory. This variable is optional but highly recommended. Then, the DB2CLIINIPATH variable tells SAS the location of db2cli.ini. If you decide to use db2dsdriver.cfg instead, then replace that variable with DB2DSDRIVER_CFG_PATH.

Finally, you have to inform SAS where to find unixODBC and DB2 libraries. SAS uses the LD_LIBRARY_PATH environment variable on Linux. This variable points to /usr/lib64 because it's the default path for all unixODBC libraries. I also appended the $INSTHOME/lib64 directory, since it's the place where DB2 has all its libraries.

At this point, you might think that you are all set. I don't like being pessimistic, but there is a high possibility that your connection may not work. I didn't test all SAS and DB2 releases, but my customer and I experienced the same error when we tested our connection with a LIBNAME statement:

[sas@muse 9.4]$ ./sas -nodms
 
  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
ERROR:  An exception has been encountered.
Please contact technical support and provide them with the following
       traceback information:
 
The SAS task name is [Line Mod]
Segmentation Violation
# More lines...

Moreover, if you run this code in SAS Studio or SAS Enterprise Guide, you might not even get a response, SAS will just crash or hang for a long time.

Fixing the Segmentation Violation error

After intense research, I was able to fix this error. We need to link two ODBC libraries that SAS provides to enable 32-bit capability. Change to your !SASROOT/sasexe directory and backup both files:

[sas@muse sasexe]$ mv sasodb sasodb.orig
[sas@muse sasexe]$ mv sasioodb sasioodb.orig

Create two symbolic links that point to the 32-bit libraries:

[sas@muse sasexe]$ ln -s sasodb_u sasodb
[sas@muse sasexe]$ ln -s sasioodb_u sasioodb

List your files, and the output should resemble the following:

[sas@muse sasexe]$ ls -l *odb*
-r-xr-xr-x. 1 sas sas 630196 Nov  9  2016 sasiodbu
lrwxrwxrwx. 1 sas sas     10 Sep 11 21:51 sasioodb -> sasioodb_u
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb.orig
-r-xr-xr-x. 1 sas sas 603114 Nov  9  2016 sasioodb_u
lrwxrwxrwx. 1 sas sas      8 Sep 11 21:52 sasodb -> sasodb_u
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb
-r-xr-xr-x. 1 sas sas  59977 Nov  9  2016 sasodbcb_u
-r-xr-xr-x. 1 sas sas 102142 Nov  9  2016 sasodbcs
-r-xr-xr-x. 1 sas sas  71982 Nov  9  2016 sasodbdb
-r-xr-xr-x. 1 sas sas  71990 Nov  9  2016 sasodbdb_u
-r-xr-xr-x. 1 sas sas 202343 Nov  9  2016 sasodb.orig
-r-xr-xr-x. 1 sas sas 201815 Nov  9  2016 sasodb_u
-r-xr-xr-x. 1 sas sas 443852 Nov  9  2016 tkeodbc.so

Now your LIBNAME statement may work, but unlike my customer, you can also encounter this error:

  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
ERROR: Could not load /sas/Software/SASFoundation/9.4/sasexe/sasodb (35
       images loaded)
ERROR: libodbc.so.1: cannot open shared object file: No such file or directory
ERROR: The SAS/ACCESS Interface to ODBC cannot be loaded. The SASODB   code
       appendage could not be loaded.
ERROR: Error in the LIBNAME statement.

The error is self-explanatory. SAS couldn't find the file libodbc.so.1, which SASODB needs. To display all the required shared libraries, execute the ldd command:

[sas@muse sasexe]$ ldd sasodb
        linux-vdso.so.1 =>  (0x00007ffd4efe2000)
        libodbc.so.1 => not found
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9aa741d000)
        libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f9aa71e6000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f9aa6fe2000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f9aa6cdf000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f9aa691e000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f9aa6708000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9aa7882000)
        libfreebl3.so => /lib64/libfreebl3.so (0x00007f9aa6504000)

Indeed, that library was not found, so to fix this missing dependency, create a symbolic link using root or sudo:

[root@muse ~]# ln -s /lib64/libodbc.so.2.0.0 /lib64/libodbc.so.1

Creating a library that works

Now that the previous errors are gone, run a final test using the SAS command line to assign a library and then SAS Studio to display some data in a nice-looking way:

[sas@muse 9.4]$ ./sas -nodms
  1? libname db2lib odbc user=db2inst password=mypassword datasrc=db2odbc;
 
NOTE: Libref DB2LIB was successfully assigned as follows: 
      Engine:        ODBC 
      Physical Name: db2odbc

Note: Displaying actual data is an essential last step because if you are pointing to an incorrect DB2 ODBC driver (such as libdb2o.so instead of libdb2.so), you can still assign a library successfully, but you won't see any tables.

In SAS Studio I assigned the same library and ran a small query to retrieve some data:

proc print data=db2lib.department_lookup;
    var department_code department_en;
run;

Sample data in SAS Studio

Wrapping Up

I know it's a lot to digest. However, some users may not need to follow all steps to succeed, whereas others starting from scratch may require performing all steps carefully. Whatever your situation might be, I hope you find this guide useful and easy to follow. Did you struggle like me? Share your experience in the comments below or if you have any problems, don't hesitate to contact me.

Connecting SAS to a DB2 database via ODBC without tears was published on SAS Users.

11月 162017
 

As a SAS Viya user, you may be wondering whether it is possible to execute data append and data update concurrently to a global Cloud Analytic Services (CAS) table from two or more CAS sessions. (Learn more about CAS.) How would this impact the report view while data append or data update is running on a global CAS table? These questions are even more important for those using the programming interface to load and update data in CAS. This post discusses data append, data update, and concurrency in CAS.

Two or more CAS sessions can simultaneously submit a data append and data update process to a CAS table, but only one process at a time can run against the same CAS table. The multiple append and update processes execute in serial, one after another, never running in a concurrent fashion. Whichever CAS session is first to acquire the write lock on a global CAS table prevails, appending or updating the data first. The other append and update processes must wait in a queue to acquire the write lock.

During the data append process, the appended data is not available to end users or reports until all rows are inserted and committed into the CAS table. While data append is running, users can still render reports against the CAS table using the original data, but excluding the appended rows.

Similarly, during the data update process, the updated data is not available to users or reports until the update process is complete. However, CAS lets you render reports using the original (non-updated) data, as the CAS table is always available for the read process. During the data update process, CAS makes additional copies into memory of the to-be-updated blocks containing rows in order to perform the update statement. Once the update process is complete, the additional and now obsolete copies of blocks, are removed from CAS. Data updates to a global CAS table is an expensive operation in terms of CPU and memory usage. You have to factor in the additional overhead memory or CAS_CACHE space to support the updates. The space requirement depends on the number of rows being affected by the update process.

At any given time, there could be only one active write process (append/update) against a global CAS table. However, there could be many concurrent active read processes against a global CAS table. A global CAS table is always available for read processes, even when an append or update process is running on the same CAS table.

The following log example describes two simultaneous CAS sessions executing data appends to a CAS table. Both append processes were submitted to CAS with a gap of a few seconds. Notice the execution time for the second CAS session MYSESSION1 is double the time that it took the first CAS session to append the same size of data to the CAS table. This shows that both appends were executing one after another. The amount of memory used and the CAS_CACHE location also shows that both processes were running one after another in a serial fashion.

Log from simultaneous CAS session MYSESSION submitting APPEND

58 proc casutil ;
NOTE: The UUID '1411b6f2-e678-f546-b284-42b6907260e9' is connected using session MYSESSION.
59 load data=mydata.big_prdsale
60 outcaslib="caspath" casout="big_PRDSALE" append ;
NOTE: MYDATA.BIG_PRDSALE was successfully added to the "caspath" caslib as "big_PRDSALE".
61 quit ;
NOTE: PROCEDURE CASUTIL used (Total process time):
real time 49.58 seconds
cpu time 5.05 seconds

Log from simultaneous CAS session MYSESSION1 submitting APPEND

58 proc casutil ;
NOTE: The UUID 'a20a246e-e0cc-da4d-8691-c0ff2a222dfd' is connected using session MYSESSION1.
59 load data=mydata.big_prdsale1
60 outcaslib="caspath" casout="big_PRDSALE" append ;
NOTE: MYDATA.BIG_PRDSALE1 was successfully added to the "caspath" caslib as "big_PRDSALE".
61 quit ;
NOTE: PROCEDURE CASUTIL used (Total process time):
real time 1:30.33
cpu time 4.91 seconds

 

When the data append process from MYSESSION1 was submitted alone (no simultaneous process), the execution time is around the same as for the first session MYSESSION. This also shows that when two simultaneous append processes were submitted against the CAS table, one was waiting for the other to finish. At one time, only one process was running the data APPEND action to the CAS table (no concurrent append).

Log from a lone CAS session MYSESSION1 submitting APPEND

58 proc casutil ;
NOTE: The UUID 'a20a246e-e0cc-da4d-8691-c0ff2a222dfd' is connected using session MYSESSION1.
59 load data=mydata.big_prdsale1
60 outcaslib="caspath" casout="big_PRDSALE" append ;
NOTE: MYDATA.BIG_PRDSALE1 was successfully added to the "caspath" caslib as "big_PRDSALE".
61 quit ;
NOTE: PROCEDURE CASUTIL used (Total process time):
real time 47.63 seconds
cpu time 4.94 seconds

 

The following log example describes two simultaneous CAS sessions submitting data updates on a CAS table. Both update processes were submitted to CAS in a span of a few seconds. Notice the execution time for the second CAS session MYSESSION1 is double the time it took the first session to update the same number of rows. The amount of memory used and the CAS_CACHE location also shows that both processes were running one after another in a serial fashion. While the update process was running, memory and CAS_CACHE space increased, which suggests that the update process makes copies of to-be-updated data rows/blocks. Once the update process is complete, the space usage in memory/CAS_CACHE returned to normal.

When the data UPDATE action from MYSESSION1 was submitted alone (no simultaneous process), the execution time is around the same as for the first CAS session.

Log from a simultaneous CAS session MYSESSION submitting UPDATE

58 proc cas ;
59 table.update /
60 set={
61 {var="actual",value="22222"},
62 {var="country",value="'FRANCE'"}
63 },
64 table={
65 caslib="caspath",
66 name="big_prdsale",
67 where="index in(10,20,30,40,50,60,70,80,90,100 )"
68 }
69 ;
70 quit ;
NOTE: Active Session now MYSESSION.
{tableName=BIG_PRDSALE,rowsUpdated=86400}
NOTE: PROCEDURE CAS used (Total process time):
real time 4:37.68
cpu time 0.05 seconds

 

Log from a simultaneous CAS session MYSESSION1 submitting UPDATE

57 proc cas ;
58 table.update /
59 set={
60 {var="actual",value="22222"},
61 {var="country",value="'FRANCE'"}
62 },
63 table={
64 caslib="caspath",
65 name="big_prdsale",
66 where="index in(110,120,130,140,150,160,170,180,190,1100 )"
67 }
68 ;
69 quit ;
NOTE: Active Session now MYSESSION1.
{tableName=BIG_PRDSALE,rowsUpdated=86400}
NOTE: PROCEDURE CAS used (Total process time):
real time 8:56.38
cpu time 0.09 seconds

 

The following memory usage snapshot from one of the CAS nodes describes the usage of memory before and during the CAS table update. Notice the values for “used” and “buff/cache” columns before and during the CAS table update.

Memory usage on a CAS node before starting a CAS table UPDATE

Memory usage on a CAS node during CAS table UDPATE

Summary

When simultaneous data append and data update requests are submitted against a global CAS table from two or more CAS sessions, they execute in a serial fashion (no concurrent process execution). To execute data updates on a CAS table, you need an additional overhead memory/CAS_CACHE space. While the CAS table is going through the data append or data update process, the CAS table is still accessible to rendering reports.

Concurrent data append and update to a global CAS table was published on SAS Users.

11月 142017
 

SAS Visual Analytics 7.4 has added the support for date parameters. Recall from my first post,  Using parameters in SAS Visual Analytics, a parameter is a variable whose value can be changed at any time by the report viewer and referenced by other report objects. These objects can be a calculated item, aggregated measure, filter, rank or display rule. And remember, every time the parameter is changed, the corresponding objects are updated to reflect that change.

Here is my updated table that lists the supported control objects and parameter types for SAS Visual Analytics 7.4. The type of parameter is required to match the type of data that is assigned to the control.
Notice that SAS Visual Analytics 7.4 has also introduced the support for multiple value selection control objects. I’ll address these in another blog.

Using Date Parameters in your SAS Visual Analytics Reports

Let’s look at an example of a SAS Visual Analytics Report using date parameters. In this fictitious report, we have been given the requirements that the user wants to pick two independent date periods for comparison. This is not the same requirement as filtering the report between a start and end date. This report requirement is such that a report user can pick two independent months in the source data to be able to analyze the change in Expense magnitude for different aggregation levels, such as Region, Product Line and Product.

In this example, we will compare two different Month,Year periods. This could easily be two different Quarter,Year or Week,Year periods; depending on the report requirements, these same steps can be applied.

In this high level breakdown, you can see in red I will create two date parameters from data driven drop-down lists. From these parameter values, I will create two calculated data items, shown in purple, and one aggregated measure that will be used in three different report objects, shown in green.

Here are the steps:

1.     Create the date parameters.

2.     Add the control objects to the report and assign roles.

3.     Create the dependent data items, i.e. the calculated data items and aggregated measure.

4.     Add the remaining report objects to the canvas and assign roles.

Step 1: Create the date parameters

First we will need to create the date parameters that will hold the values made by the report viewers. From the Data Pane, use the drop-down menu and select New Parameter….

Then create your first parameter as shown below. Give it a name.

Next, select minimum and maximum values allowed for this parameter. I used the min and max available in my data source, but you could select a more narrow range if you wanted to restrict the users to only have access to portions of the data, just so long as the values are in your data source since, in this example, we will use the data source to populate the available values in the drop-down list.

Then select a current value, this will serve as the default value that will populate when a user first opens the report.

Finally, select the format in which you want your data item to be formatted. I selected the same format as my underlying data item I will be using to populate the drop-down list.

Notice how your new parameters will now be available from your Data Pane.

Step 2: Add the control objects to the report and assign roles

Next, drag and drop the drop-down list control objects onto the report canvas. In this example, we are not using the Report or Section Prompt areas since I do not want to filter the objects in the report or section automatically. Instead, I am using these prompt values and storing them in a parameter. I will then use those values to create new calculated data items and an aggregated measure.

Once your control objects are in the report canvas, then use the Roles Pane to assign the data items to the roles. As you can see from the screenshot, we are using the Date by Month data item to seed the values of the drop-down list by assigning it to the Category role, this data item is in our data source.

Then we are going to assign our newly created parameters, Period1Parameter and Period2Parameter to the Parameter role. This will allow us to use the value selected in our calculations.

Step 3: Create the dependent data items, i.e. the calculated data items and aggregated measure

Now we are free to use our parameters as we like. In this example, I am prompting the report viewer for two values: Period 1 and Period 2 which are the two periods the user would like compared in this report. So, we will need to create two calculated data times from a single column in our source data. Since we want to display these as columns next to each other in a crosstab object and use them for an aggregated measure, this technique can be used.

Calculated Data Item: Period 1 Expenses

From the Data Pane, use the drop-down menu and select New Calculated Item…. Then use the editor to create this expression: If the Date by Month for this data row equals the parameter value selected for Period 1, then return the Expenses; else return 0.

Calculated Data Item: Period 2 Expenses

Repeat this using the Period2Parameter in the expression.

Aggregated Measure: Period Difference

Next, we want to calculate the difference between the two user selected Period Expenses. To do this, we will need to create an aggregated measure which will evaluate based on the report object’s role assignments. In other words, it will be calculated “on-the-fly” based on the visualization.

Similar to the calculated data items, use the Data Pane and from the drop-down menu select New Aggregated Measure…. Use the editor to create this expression. Notice that we are using our newly created calculated data items we defined using the parameter values. This expression does not use the parameter value directly, but indirectly through the calculated data item.

Step 4: Add the remaining report objects to the canvas and assign roles

No that we have:

  • our Control Objects to capture the user input.,
  • the Parameters to store the values.,
  • and the Calculated Data Items and Aggregated Measure created…

we can add our report objects to the canvas and assign our roles.

You can see I used all three new measures in the crosstab object. I used the aggregated measure in the bar chart and treemap but notice the different aggregation levels. There is even a hierarchy assigned to the treemap category role. This Period Difference aggregated measure calculation is done dynamically and will evaluate for each visualization with its unique role assignments, even while navigating up and down the hierarchy.

Here are some additional screenshots of different period selections.

In this first screenshot you can see the parallel period comparison between December 2010 and 2011.

In these next two screenshots, we are looking at the Thanksgiving Black Friday month of November. We are comparing the two years 2010 and 2011 again. Here we see that the Board Product from the Game Product Line is bright blue indicating an increase in magnitude of Expenses in the most recent period, Nov2011.

By double clicking on Board in the treemap, we are taken to the next level of the hierarchy, Product Description, where we see a the largest magnitude of Expenses is coming from Backgammon and Bob Board Games.

In these final two screenshots we are comparing consecutive periods, November 2011 with December 2011. We can see from the bar chart easily the Region and Product Line where there is the greatest increase in Expenses.

I’ve configured a brush interaction between all three visualizations so that when I select the tallest bar it will highlight the corresponding data values in the crosstab and treemap.

Conclusion

Now you can use date parameters in your Visual Analytics Reports. There are several applications of this feature and this is only one way you can use parameters to drive business intelligence. Using this technique to create columns based on a user selected value is great when you need to compare values when your source data isn’t structured in this manner.

Using Date Parameters in your SAS Visual Analytics Reports was published on SAS Users.

11月 072017
 

SAS Tips and TricksThere is certainly no shortage of terrific tips and tricks in various SAS blogs from some of our most distinguished SAS in-house experts. But, there's another group of equally qualified experts who don't often get to share their expertise on this channel: our customers. So, I went on a quest to get the inside scoop from various SAS users, polling Friends of SAS members to get their feedback on their favorite SAS tips.

We asked a few of these Friends of SAS members who are regular SAS users to share with us their top SAS tips and tricks for improving performance or something they wished they had known earlier in their SAS career. Based on that, we got a wide range of tips and tricks from a number of different SAS users – ranging from novice to expert and across various industries and product users. Check out some of them below:

FUNCTIONS

Functions are either built into SAS itself or you can write your own customized code that act in the same manner, all of which help in analyzing and processing data. There are a variety of function categories that include mathematical, date and time, character, truncation, and miscellaneous. Using functions makes us more efficient, and we don’t have to re-invent the wheel every time we want to figure something out. With this being said, some of our regular SAS users have a thing or two to say about dealing with functions that may help you out:

“Before you program any complex code, look for a SAS function that will do the task for you.”
     - John Ladds, Past President, OASUS

“Insert a line break in a concatenated string, such as: manylines = catx('0a'x,a,b,c);”
     - Aroop Ghosh, Principal Consultant, Webtalk Communications

“Use the lag function to create time related variables, for example, in time punch data”
     - Yolanda, Analyst, TD

“A good trick that I have recently learnt [sic]which can make the code less wordier is using the functions IFN and IFC as an alternative to IF THEN ELSE statements in conditional processing.”
     - Sunny Giroti, Master of Business Analytics Candidate, Schulich School of Business

“IFN can be used in place of IF THEN ELSE to shorten code”
     - Neil Menezes, Senior Business Anlyst, CTFS

“Ron Cody’s link from SAS.COM. It has many SAS function examples.”
     - John Lam, CIBC

SYNTAX/SHORTCUTS/EFFICIENCIES

You know what they say: time is money. So for a SAS programmer, finding shortcuts and ways to work more efficiently and faster are important to get a job done quicker. Here are a few ways SAS users think can make your life easy while working with SAS:

“Use missover to ensure no records are skipped when reading in a file”
     - Scott Bellefeuille, IT Solutions Developer (Merchant Services), TD Bank

“Pressing keys 'Ctrl'+'/' to comment out a line of code.”
     - Bunce Leung, Execution Manager, RBC

“Variable Lists - being able to refer to variables using double dashes to indicate all variables between first and last in a dataset is super useful for many procs. The later versions of being able to use the prefix and colon to indicate all datasets with a prefix is a great shortcut as well.”
     - Fareeza Khurshed, Manager (Statistical Services), Alberta Treasury Board and Finance

“I like to use PERL in SAS for finding stuff in character variables.”
     - Peter Timusk, Statistics Officer, Statistics Canada

“Title "SAS can give you an Inheritance". Have an ODBC driver on your local PC but not on a remote server? No problem. Use rsubmit with the inheritlib option. Your remote server will now inherit the ODBC driver and be able to access a database you thought you could only reach with your PC.”
     - Horst Wolter, Manager, TD Bank

“If you want to speed the processing of your program. Run your join statements on the "work" library. It is must faster.”
     - Estela Tavares, Economist, Statistics Canada

“When dealing with probability, can logistic be used in all cases? Trick Q - as A is N0. What about the times, probability is 0 and 1. What if the data is heavily distributed on 1s and 0s.”
     - Mukul Pandey, Student Business Analytics, Schulich School of Business

“Proc tabulate can perform descriptive statistics better than proc freq and proc means.”
     - Taha Azizi, Senior Business Insight Analyst, TD

Your turn

Were any of these tips and tricks useful? Do you use them already? What are some of your top SAS tips and tricks? Please be sure to share in the comments below!

Looking for more tips and tricks? Check out this video featuring six Canadian SAS programmers, including a few Friends of SAS members, who share some of their favourite SAS programming tips.

About Friends of SAS

If you’re not familiar with Friends of SAS, it is an exclusive online community available only to our Canadian SAS customers and partners to recognize and show our appreciation for their affinity to SAS. Members complete activities called 'challenges' and earn points that can be redeemed for rewards. There are opportunities to build powerful connections, gain privileged access to SAS resources and events, and boost your learning and development of SAS all in a fun environment.

Interested in learning more about Friends of SAS? Feel free to email myself at Natasha.Ulanowski@sas.com or Martha.Casanova@sas.com with any questions or more details.

SAS tips and tricks: Users-tell-all edition was published on SAS Users.

11月 042017
 

Internet of Things for dementiaDementia describes different brain disorders that trigger a loss of brain function. These conditions are all usually progressive and eventually severe. Alzheimer's disease is the most common type of dementia, affecting 62 percent of those diagnosed. Other types of dementia include; vascular dementia affecting 17 percent of those diagnosed, mixed dementia affecting 10 percent of those diagnosed.

Dementia Statistics

There are 850,000 people with dementia in the UK, with numbers set to rise to over 1 million by 2025. This will soar to 2 million by 2051. 225,000 will develop dementia this year, that’s one every three minutes. 1 in 6 people over the age of 80 have dementia. 70 percent of people in care homes have dementia or severe memory problems. There are over 40,000 people under 65 with dementia in the UK. More than 25,000 people from black, Asian and minority ethnic groups in the UK are affected.

Cost of treating dementia

Two-thirds of the cost of dementia is paid by people with dementia and their families. Unpaid careers supporting someone with dementia save the economy £11 billion a year. Dementia is one of the main causes of disability later in life, ahead of cancer, cardiovascular disease and stroke (Statistic can be obtained from here - Alzheimer’s society webpage). To tackle dementia requires a lot of resources and support from the UK government which is battling to find funding to support NHS (National Health Service). During 2010–11, the NHS will need to contribute £2.3bn ($3.8bn) of the £5bn of public sector efficiency savings, where the highest savings are expected primarily from PCTs (Primary care trusts). In anticipation for tough times ahead, it is in the interest of PCTs to obtain evidence-based knowledge of the use of their services (e.g. accident & emergency, inpatients, outpatients, etc.) based on regions and patient groups in order to reduce the inequalities in health outcomes, improve matching of supply and demand, and most importantly reduce costs generated by its various services.

Currently in the UK, general practice (GP) doctors deliver primary care services by providing treatment and drug prescriptions and where necessary patients are referred to specialists, such as for outpatient care, which is provided by local hospitals (or specialised clinics). However, general practitioners (GPs) are limited in terms of size, resources, and the availability of the complete spectrum of care within the local community.

Solution in sight for dementia patients

There is the need to prevent or avoid delay for costly long-term care of dementia patients in nursing homes. Using wearables, monitors, sensors and other devices, NHS based in Surrey is collaborating with research centres to generate ‘Internet of Things’ data to monitor the health of dementia patients at the comfort of staying at home. The information from these devices will help people take more control over their own health and wellbeing, with the insights and alerts enabling health and social care staff to deliver more responsive and effective services. (More project details can be obtained here.)

Particle filtering

One method that could be used to analyse the IoT generated data is particle filtering methods. IoT dataset naturally fails within Bayesian framework. This method is very robust and account for the combination of historical and real-time data in order to make better decision.

In Bayesian statistics, we often have a prior knowledge or information of the phenomenon/application been modelled. This allows us to formulate a Bayesian model, that is, prior distribution, for the unknown quantities and likelihood functions relating these quantities to the observations (Doucet et al., 2008). As new evidence becomes available, we are often interested in updating our current knowledge or posterior distribution. Using State Space Model (SSM), we are able to apply Bayesian methods to time series dataset. The strength of these methods however lies in the ability to properly define our SSM parameters appropriately; otherwise our model will perform poorly. Many SSMs suffers from non-linearity and non-Gaussian assumptions making the maximum likelihood difficult to obtain when using standard methods. The classical inference methods for nonlinear dynamic systems are the extended Kalman filter (EKF) which is based on linearization of a state and trajectories (e.g. Johnson et al., 2008 ). The EKF have been successfully applied to many non-linear filtering problems. However, the EKF is known to fail if the system exhibits substantial nonlinearity and/or if the state and the measurement noise are significantly non-Gaussian.

An alternative method which gives a good approximation even when the posterior distribution is non-Gaussian is a simulation based method called Monte Carlo. This method is based upon drawing observations from the distribution of the variable of interest and simply calculating the empirical estimate of the expectation.

To apply these methods to time series data where observation arrives in sequential order, performing inference on-line becomes imperative hence the general term sequential Monte Carlo (SMC). A SMC method encompasses range of algorithms which are used for approximate filtering and smoothing. Among this method is particle filtering. In most literature, it has become a general tradition to present particle filtering as SMC, however, it is very important to note this distinction. Particle filtering is simply a simulation based algorithm used to approximate complicated posterior distributions. It combines sequential importance sampling (SIS) with an addition resampling step. SMC methods are very flexible, easy to implement, parallelizable and applicable in very general settings. The advent of cheap and formidable computational power in conjunction with some recent developments in applied statistics, engineering and probability, have stimulated many advancements in this field (Cappe et al. 2007). Computational simplicity in the form of not having to store all the data is also an additional advantage of SMC over MCMC (Markov Chain Monte Carlo).

References

[1] National Health Service England, http://www.nhs.uk/NHSEngland/aboutnhs/Pages/Authoritiesandtrusts.aspx (accessed 18 August 2009).

[2] Pincus SM. Approximate entropy as a measure of system complexity. Proc Natl Acad Sci USA 88: 2297–2301, 1991

[3] Pincus SM and Goldberger AL. Physiological time-series analysis: what does regularity quantify?  Am J Physiol Heart Circ Physiol 266: H1643–H1656, 1994

[4] Cappe, O.,S. Godsill, E. Moulines(2007).An overview of existing methods and recent advances in sequential Monte Carlo. Proceedings of the IEEE. Volume 95, No 5, pp 899-924.

[5] Doucet, A., A. M. Johansen(2008). A Tutorial on Particle Filtering and Smoothing: Fifteen years Later.

[6] Johansen, A. M. (2009).SMCTC: Sequential Monte Carlo in C++. Journal of Statistical Software. Volume 30, issue 6.

[7] Rasmussen and Z.Ghahramani (2003). Bayesian Monte Carlo. In S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15.

[8] Osborne A. M.,Duvenaud D., GarnettR.,Rasmussen, C.E., Roberts, C.E.,Ghahramani, Z. Active Learning of Model Evidence Using Bayesian Quadrature.

Scaling Internet of Things for dementia using Particle filters was published on SAS Users.

11月 042017
 

Internet of Things for dementiaDementia describes different brain disorders that trigger a loss of brain function. These conditions are all usually progressive and eventually severe. Alzheimer's disease is the most common type of dementia, affecting 62 percent of those diagnosed. Other types of dementia include; vascular dementia affecting 17 percent of those diagnosed, mixed dementia affecting 10 percent of those diagnosed.

Dementia Statistics

There are 850,000 people with dementia in the UK, with numbers set to rise to over 1 million by 2025. This will soar to 2 million by 2051. 225,000 will develop dementia this year, that’s one every three minutes. 1 in 6 people over the age of 80 have dementia. 70 percent of people in care homes have dementia or severe memory problems. There are over 40,000 people under 65 with dementia in the UK. More than 25,000 people from black, Asian and minority ethnic groups in the UK are affected.

Cost of treating dementia

Two-thirds of the cost of dementia is paid by people with dementia and their families. Unpaid careers supporting someone with dementia save the economy £11 billion a year. Dementia is one of the main causes of disability later in life, ahead of cancer, cardiovascular disease and stroke (Statistic can be obtained from here - Alzheimer’s society webpage). To tackle dementia requires a lot of resources and support from the UK government which is battling to find funding to support NHS (National Health Service). During 2010–11, the NHS will need to contribute £2.3bn ($3.8bn) of the £5bn of public sector efficiency savings, where the highest savings are expected primarily from PCTs (Primary care trusts). In anticipation for tough times ahead, it is in the interest of PCTs to obtain evidence-based knowledge of the use of their services (e.g. accident & emergency, inpatients, outpatients, etc.) based on regions and patient groups in order to reduce the inequalities in health outcomes, improve matching of supply and demand, and most importantly reduce costs generated by its various services.

Currently in the UK, general practice (GP) doctors deliver primary care services by providing treatment and drug prescriptions and where necessary patients are referred to specialists, such as for outpatient care, which is provided by local hospitals (or specialised clinics). However, general practitioners (GPs) are limited in terms of size, resources, and the availability of the complete spectrum of care within the local community.

Solution in sight for dementia patients

There is the need to prevent or avoid delay for costly long-term care of dementia patients in nursing homes. Using wearables, monitors, sensors and other devices, NHS based in Surrey is collaborating with research centres to generate ‘Internet of Things’ data to monitor the health of dementia patients at the comfort of staying at home. The information from these devices will help people take more control over their own health and wellbeing, with the insights and alerts enabling health and social care staff to deliver more responsive and effective services. (More project details can be obtained here.)

Particle filtering

One method that could be used to analyse the IoT generated data is particle filtering methods. IoT dataset naturally fails within Bayesian framework. This method is very robust and account for the combination of historical and real-time data in order to make better decision.

In Bayesian statistics, we often have a prior knowledge or information of the phenomenon/application been modelled. This allows us to formulate a Bayesian model, that is, prior distribution, for the unknown quantities and likelihood functions relating these quantities to the observations (Doucet et al., 2008). As new evidence becomes available, we are often interested in updating our current knowledge or posterior distribution. Using State Space Model (SSM), we are able to apply Bayesian methods to time series dataset. The strength of these methods however lies in the ability to properly define our SSM parameters appropriately; otherwise our model will perform poorly. Many SSMs suffers from non-linearity and non-Gaussian assumptions making the maximum likelihood difficult to obtain when using standard methods. The classical inference methods for nonlinear dynamic systems are the extended Kalman filter (EKF) which is based on linearization of a state and trajectories (e.g. Johnson et al., 2008 ). The EKF have been successfully applied to many non-linear filtering problems. However, the EKF is known to fail if the system exhibits substantial nonlinearity and/or if the state and the measurement noise are significantly non-Gaussian.

An alternative method which gives a good approximation even when the posterior distribution is non-Gaussian is a simulation based method called Monte Carlo. This method is based upon drawing observations from the distribution of the variable of interest and simply calculating the empirical estimate of the expectation.

To apply these methods to time series data where observation arrives in sequential order, performing inference on-line becomes imperative hence the general term sequential Monte Carlo (SMC). A SMC method encompasses range of algorithms which are used for approximate filtering and smoothing. Among this method is particle filtering. In most literature, it has become a general tradition to present particle filtering as SMC, however, it is very important to note this distinction. Particle filtering is simply a simulation based algorithm used to approximate complicated posterior distributions. It combines sequential importance sampling (SIS) with an addition resampling step. SMC methods are very flexible, easy to implement, parallelizable and applicable in very general settings. The advent of cheap and formidable computational power in conjunction with some recent developments in applied statistics, engineering and probability, have stimulated many advancements in this field (Cappe et al. 2007). Computational simplicity in the form of not having to store all the data is also an additional advantage of SMC over MCMC (Markov Chain Monte Carlo).

References

[1] National Health Service England, http://www.nhs.uk/NHSEngland/aboutnhs/Pages/Authoritiesandtrusts.aspx (accessed 18 August 2009).

[2] Pincus SM. Approximate entropy as a measure of system complexity. Proc Natl Acad Sci USA 88: 2297–2301, 1991

[3] Pincus SM and Goldberger AL. Physiological time-series analysis: what does regularity quantify?  Am J Physiol Heart Circ Physiol 266: H1643–H1656, 1994

[4] Cappe, O.,S. Godsill, E. Moulines(2007).An overview of existing methods and recent advances in sequential Monte Carlo. Proceedings of the IEEE. Volume 95, No 5, pp 899-924.

[5] Doucet, A., A. M. Johansen(2008). A Tutorial on Particle Filtering and Smoothing: Fifteen years Later.

[6] Johansen, A. M. (2009).SMCTC: Sequential Monte Carlo in C++. Journal of Statistical Software. Volume 30, issue 6.

[7] Rasmussen and Z.Ghahramani (2003). Bayesian Monte Carlo. In S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15.

[8] Osborne A. M.,Duvenaud D., GarnettR.,Rasmussen, C.E., Roberts, C.E.,Ghahramani, Z. Active Learning of Model Evidence Using Bayesian Quadrature.

Scaling Internet of Things for dementia using Particle filters was published on SAS Users.