sas programming

Peek at your data using VBScript, OLE DB, and the SAS local data provider

 ole db, sas programming, SAS tips  Peek at your data using VBScript, OLE DB, and the SAS local data provider已关闭评论
11月 042010
 
Let's pretend for a moment that you don't have SAS installed on all of your office computers. But you've got some great content locked away inside SAS data sets. Is there a way to get to the content of that data, without having to push the data through a SAS tier?

There is a way! All that you need is the SAS OLE DB Local Data Provider and some programming know-how. You can download the provider from SAS. And the know-how? The basic recipes are in the SAS 9.2 Providers for OLE DB Cookbook. (There's a version for SAS 9.1 as well.)

Note: there are limitations when using the Local Data Provider. Because there is no SAS session in this mix, you don't get the benefit of SAS data services. For example, it can't handle your user-defined formats. You might sacrifice precision for some numeric values. You don't have the full cross-platform support that SAS provides.

Here is an example script that verifies that the provider is installed, opens a data set file, and reports on the row count and column names/types. It's a very simple example that doesn't get very fancy and doesn't include proper error checking, but I hope that it shows you the possibilities.

To run the example on your Windows machine:

  1. Copy the program into a local file (for example, c:\datatest\ShowColumns.vbs).
  2. Change the filename and path variables in the script to point to data files on your machine.
  3. Open a command prompt window.
  4. Run the example program with cscript.exe. For example:

    cscript.exe c:\datatest\ShowColumns.vbs
     
    Note that this can work with the 32-bit or 64-bit versions of the SAS OLE DB providers. If you are on a 64-bit machine and you want to use the 32-bit provider, be sure to run the 32-bit version of cscript.exe:

    %windir%\SysWow64\cscript.exe c:\datatest\ShowColumns.vbs
     
Here's the program:

' Change these to report on your data set
' Change path to the directory where your data set resides
' Change filename to the root name of the sas7bdat file (no extension)
path = "C:\Program Files\SAS\EnterpriseGuide\4.3\Sample\Data"
filename = "Candy_Sales_Summary"

' Check registry for SAS Local Provider
Set WSHShell = CreateObject("WScript.Shell")
clsID = WSHShell.RegRead("HKCR\SAS.LocalProvider\CLSID\")
WScript.Echo "DIAGNOSTICS: SAS.LocalProvider CLSID is " & clsID
inProcServer = WSHShell.RegRead("HKCR\CLSID\"  & clsID & "\InprocServer32\")
WScript.Echo "DIAGNOSTICS: Registered InprocServer32 DLL is " & inProcServer

' Constants for ADO calls
Const adOpenDynamic = 2
Const adLockOptimistic = 3
Const adCmdTableDirect = 512

' Instantiate the provider object
Set obConnection = CreateObject("ADODB.Connection")
Set obRecordset = CreateObject("ADODB.Recordset")

obConnection.Provider = "SAS.LocalProvider"
obConnection.Properties("Data Source") = path
obConnection.Open
obRecordset.Open filename, obConnection, adOpenDynamic, adLockOptimistic, adCmdTableDirect

 'Report on Fields in this data set
WScript.Echo ""
 WScript.Echo "Opened data " & filename & ", Record count: " & obRecordset.RecordCount
 For Each Field In obRecordset.Fields
   If Field.Type = 5 Then pType = "Numeric"
   If Field.Type = 200 Then pType = "Character"
   WScript.Echo Field.Name & " " & pType
 Next

obRecordset.Close
obConnection.Close
 
Output from my example:

Microsoft (R) Windows Script Host Version 5.8
Copyright (C) Microsoft Corporation. All rights reserved.

DIAGNOSTICS: SAS.LocalProvider CLSID is {0221264D-F909-4872-B7E6-B108D3FC8E8B}
DIAGNOSTICS: Registered InprocServer32 DLL is C:\Program Files\SAS\SharedFiles\SASProvidersForOLEDB\sasafloc0902.dll

Opened data Candy_Sales_Summary, Record count: 15000
Name Character
Region Character
OrderID Numeric
ProdID Numeric
Customer Numeric
Type Character
Product Character
Category Character
Subcategory Character
Retail_Price Numeric
Units Numeric
Discount Character
Date Numeric
Fiscal_Year Character
Fiscal_Quarter Character
Fiscal_Month_Num Numeric
Sale_Amount Numeric
 
10月 232010
 
The question came up on the SAS Enterprise Guide discussion forum: which do you prefer, List Report Wizard (PROC REPORT) or Summary Tables (PROC TABULATE)? And as with most SAS-related questions, the proper response is: "it depends." If you put these two PROCs in the ring with a Google Fight, PROC REPORT is a clear winner. Cynthia@SAS delivered a very thoughtful answer. She first carefully separated out the "point-and-click" versus "code-it-yourself" concerns, highlighting the capabilities and limitations of the tasks within SAS Enterprise Guide. Then, for the people who want follow a programming approach, Cynthia offers these insights:
But for the beginner, I think the biggest issue that I've heard about (and it surprised me) was whether they wanted the box area in the upper left hand corner of the table. For some folks, it doesn't matter that the BOX is there -- but I've had some students switch to PROC REPORT just because they didn't like the big empty box -- seriously -- the BOX area was a deal breaker for them -- they wanted the look that they got with PROC REPORT. The second issue that I've heard about is changing the code -- some people find the PROC REPORT code easier to understand than the TABULATE code -- if and only if they have to change the code. The TABULATE code is more abbreviated and "algebra-like" with all the * and = in the TABLE statement. Even though the REPORT code is more verbose -- the existence of a COLUMN statement makes it clear which variable will come first on the report row, which second, which third, etc, etc. And then the explicit DEFINE statement makes it easy to see what statistic will get used for a numeric variable, the usage for a category variable (ORDER, GROUP or ACROSS).
So, as you can see, sometimes it actually does come down to a "BOXing" match. <groan...sorry>
10月 092010
 
Yesterday, I was in the #raganSAS audience as David Pogue told me What's New and What's Next in the world of technology. David is a great presenter, and he really had the audience engaged as he talked about augmented reality, his world according to Twitter, and an iPhone app that comes pretty close to teaching the world to sing in perfect harmony (plus a cheater app that helps the world to sing like T. Pain).

On the world-harmony-for-profit theme, he shared information about web sites such as Kiva.org that facilitate microfinancing around the world. There are other microfinance sites that help people closer to home (for us in the USA), but as Pogue said, only Kiva.org can give you that "rosy glow" when you know you're helping people in developing countries.

Kiva.org opens financial doors for people who might not have another source of funding; but it also presents a platform rich in data for analysis and reporting. The folks at Kiva.org support web services that allow you to build applications that reference the data that they collect. They also offer "data snapshots": downloadable versions of all of the data they have on the loans, loan recipients, and the lenders who participate.

If you could get this data into SAS, what insights could you glean? What cool stats could you produce? What stories could you tell with charts and plots?

So, now we come to your homework assignment...if you choose to accept it. I've already done the grunt work of writing a SAS program that transforms the raw data (from its XML format) into SAS data sets. I've even written a sample step that produces a simple chart based on the current data.

My plot with SGPANEL

What can you do with this data using SAS? There are two data sets: lenders (over 400,000 records) and loans (over 165,000 records). They contain columns relating to geography (location of lenders and loan recipients), quantity (how many loans, what amounts), categories (loan purpose/industry, gender of recipient), and time (when the loan was granted/funded). You can read about the data on Kiva.org, and then create interesting reports using SAS.

Bonus assignment: can you improve my SAS program that pulls the data into SAS? I promise you: there is lots of room for optimization. (If I held off of this post until I perfected it, it would be ready for World Statistics Day 2011.) My implementation uses the XML libname engine, DATA step, and PROC SQL. It could be more automated (download the zip file with FILENAME URL, extract and process) and more efficient (faster appends, perhaps joining and summarizing for easier analysis). The program encounters a few errors when it runs, probably due to character encoding in the XML data. What would you do differently?

Here's how you can get started:

  • Download my SAS program and XML map files from this ZIP file here (small, just about 3K).
  • Extract the ZIP file to a new folder that your SAS session can access as the Kiva "root" folder (example: "C:\public\Kiva" or "/u/userid/Kiva").
  • Download the data snapshot from Kiva.org (big, about 150MB ZIP file). You need the XML format (not the JSON format).
  • Extract the data snapshot files into your Kiva "root" folder.
  • Modify my kivaProgram.sas file to set the Kiva data root folder, and set the number of loan XML files and lender XML files (as described in the comments in the program).

(By the way, I wrote this program entirely using SAS Enterprise Guide 4.3. So I know that you can run it from there, or within whatever SAS 9.2 environment you have access to.)

What better way to celebrate World Statistics Day than to compute some statistics for the world? Post your experiences back here in the comments, or use sasCommunity.org to share more details and post the link.

What to do IN BETWEEN dates

 data step, in operator, sas programming, SAS tips  What to do IN BETWEEN dates已关闭评论
9月 302010
 
I saw a suggestion arrive from a SAS customer who would like to see the IN operator extended to allow ranges of date values. For example, you can currently write a program that checks for values IN a collection or range of numbers:

data check;
  if x in (1:10) then result="match";
run;
 
This matches on the set of numbers 1 through 10, inclusive. The customer would like to see something like this supported, to match on the dates that fall within a given range.

data check;
   if x in ('01JAN2010'd : '01FEB2010'd) then result="match";
run;
 
It's a great suggestion. But in the meantime, you can satisfy this simple example by using the shorthand for the AND operator:

data check;
  if '01JAN2010'd <= x <='01FEB2010'd then result="match";
run;
 
There are many other ways to "skin this cat", including using PROC SQL and the BETWEEN condition. What's your favorite?