data sets

2月 202013
 

So let’s get down to it, what exactly do we mean by data in SAS? We hear all about “big data”, tiny data, datasets, metadata, OLAP, relational data, but what do I really need to know when it comes to SAS?

If we dig deep into the annals of SAS history, we have the SAS “dataset”. Then came along the need to read data from source systems (remember VSAM?), and ,of course, SAS’ ability to write to lots and lots of target systems such as files (Excel, CSV, delimited), databases (Oracle, DB2, Teradata) and unusual places (FTP, SMTP, WebDAV). For an experienced perspective on some of these unusual data sources, take a look at a paper by Steven First who takes us down memory lane.

I promised in my last post to start with an overview of how SAS “eats” data and what SAS cares about when processing data. So let’s start with what we mean by data.

How data is represented in SAS

Look at any spreadsheet and you’ll begin to understand the basics. We have rows and columns, and on the columns, we have additional information such as how the column is formatted (for example, date versus dollar versus lots of decimal points). Moving even higher, we have knowledge about the spreadsheet such as when it was created, how many records (or rows) it has, when it was last updated and by whom. We may also realize that there are lots of spreadsheets in our workbook. This simplistic description of a spreadsheet can be translated roughly to what we see in SAS. Here are the corollaries working our way from the Excel file down to the details:

The spreadsheet analogy breaks down a bit when we deal with more complex structures like relationships (integrity constraints and foreign keys), dataset auditing, hierarchies and alternative SAS data structures like multidimensional databases (OLAP), Scalable Performance Data Server (SPDS) and its baby brother, Scalable Performance Data Engine (SPDE), generation data groups (collection of generation data sets), indexes and compression. But for now, we will keep it simple.

Referencing data

In SAS, when we read or write data, we generally do so by referencing the directory (through a LIBNAME statement) or the file directly (FILENAME statement). As we highlighted in the summary table above, the library is a collection of tables that exist in a directory on our system. In UNIX/ Linux this library might look like:

Similarly, we can see the same library on Windows.

And finally, we see the same directory through SAS Enterprise Guide.

Since Version 9, SAS allows us to read and write files created on multiple operating systems through the use of something they call Cross-Environment Data Access, or CEDA. This feature makes data accessible when you move it from one system to another. If you require access to data that is not supported by CEDA, then you’ll have to move it around using some of the SAS utilities like one of these procedures: COPY, UPLOAD, DOWNLOAD, CPORT, CIMPORT or MIGRATE. For a nice summary of common migration questions, take a look at Diane Olson’s paper.

But I digress. So let’s go back to our screenshots above.

Assigning libnames and engines

If we want to reference this location, we assign a library reference (called LIBNAME) so SAS knows where this is and what engine it is going to use to read the file. The concept of an engine is pretty important in SAS as this entity provides the set of instructions to SAS for what we need to do in order to properly read (or write) the data. Here are a few examples of LIBNAME statements using various engines:

Note in the first case, we have an explicit engine called BASE that tells SAS that this is a SAS library. The engine is smart enough to figure out which version of SAS to use based on the files contained in the directory. If you omit an engine, SAS assumes that the BASE engine is being used.

When we issue the LIBNAME statement, SAS can now address tables in this library by name. All references to SAS tables have what is referred to as a two-level name. For example, if I want to reference the table postassessments in the directory above, I would tell reference the table by first adding the library where it resides before the filename, such as: mydata.postassessments

The basis of most procedures in SAS (known as PROCs) use this form to reference the data that is being processed. If I wanted to print the data, for example, I would issue the following statement:

PROC Print data=mydata.postassessments;
run;

The difference between data and metadata

So this gets us to the point where we know how to reference data so that we can process it accordingly. In SAS Foundation (which includes the core SAS products such as BASE, STAT, GRAPH and so on), we write programs that utilize this library (LIBNAME) reference.

Up to this point, we have made the assumption that we will explicitly reference data in our program. Similarly, we make the assumption that whoever knows this path (or details in the LIBNAME statement) can use those same directives to issue their own LIBNAME statement.

Another huge assumption that we are making is that the data residing beneath that LIBNAME reference is accessible to the user running the program. That is, when I submit the following LIBNAME statement, I need to be able , at a minimum, to read the directory and the physical file that is being referenced:

LIBNAME mydata BASE “/userdata/project/mydata”;

This is one of the fundamental propositions of SAS data security: the physical operating system security takes precedence over SAS security that you may layer on top of the file. SAS mechanisms for security include password protection on the dataset, encryption, application level security and metadata security.

I promised in my last post that I wouldn’t step into the foray of metadata, but I lied. Next time, I will explore the concept of data as seen through the eyes of modern SAS clients such as SAS Enterprise Guide and Web Report Studio. There we will explore the main differences between accessing data directly (as we saw above with the LIBNAME statement) and accessing them through the metadata server. Soon thereafter, we will dive deep into external databases and how SAS processes data using various database engines.

Until then, remember – “Happy Data = Happy SAS = Happy Users = Happy Admin”! See, it really is all about the data!

--greg

tags: data sets, SAS Administrators
1月 212013
 

Sometimes I need to "disassemble" a SAS data set into a DATA step program. It's kind of like creating a "freeze-dried" version of the data that you can carry around and use anywhere, re-hydrating it in the SAS session where you next need it.

Some example uses for this:

  • Build standalone program examples that include the data "inline", so that you don't have to supply a SAS7BDAT file and a SAS program file together to make it work.
  • Create a portable version of the data that you can transport from one SAS environment and use easily on another environment, without having to worry about platform encoding differences. (Of course, you can also use PROC MIGRATE for this job.)
  • Capture the structure of the data, with the columns, formats, labels, and lengths -- even if you don't need all of the records.

There are various programming approaches that you can use to accomplish this in SAS, but to make it easier I built a custom task called Data set->DATA step. (The task runs in SAS Enterprise Guide 4.3 and later.) This task reads the contents of the active data set, generates a DATA step program with the appropriate ATTRIB statements for each column, and then includes all of the data records as DATALINES.

This screen shows the task user interface, which gives you a preview of the DATA step setup. If you want just the data structure portion of the program, you can click Copy to clipboard and thus capture just that piece.


When you click Run, the task builds a program that contains the DATA step along with all of the DATALINES records. It doesn't actually run the program, since that would only recreate a copy of the data set that you already have, but it does produce a short summary log for output. And the contents of the Code tab in the results will contain the complete DATA step program.


(If you're wondering how I'm "splitting" the code view in this picture, check out this blog post about an editor trick.)

You can download the task from the list of custom task examples that accompany my recent book. The "Data set->DATA step" example is featured in "Chapter 12: Abracadabra: Turn Your Data into a SAS Program". You are not required to own the book in order to use the examples (but I promise that you can learn a lot from the book).

If you're interested in how this task does its work, here are the techniques used in the source code and described within the chapter:

  • Use ADO.NET to connect to and read SAS data. (ADO.NET is the method for working with data sources in .NET programs.)
  • Use SAS DICTIONARY tables to discover data set and column attributes for the source data. Then, use that information to influence task behavior.
  • Use the SASTextEditorCtl control from the SAS.Tasks.Toolkit library to preview the SAS program in the SAS color-coded Program Editor.
  • Encapsulate the meat of the task -- the business logic that reads data and creates a SAS program -- into a separate .NET class. This makes the task more maintainable and enables you to reuse the business logic in other contexts.
  • Use a special task API interface, named ISASTaskExecution, to implement the work of the task. This enables SAS Enterprise Guide to delegate all task processing to your task so that it can perform work that can't be done easily in a SAS program.

More resources:

tags: data sets, data step, SAS custom tasks
8月 152012
 

The project that I'm currently working on requires several input data tables, and those tables must have a specific schema. That is, each input table must contain columns of a specific name, type, and length in order for the rest of the system to function correctly. The schema requirements aren't too fancy; there are no specs for keys, indexes, or constraints. It's simply "your data sets need columns that look like this".

Even though the schema is "published", the data tables themselves will be prepared by humans using Who Knows What Processes. Therefore, we must adopt a "trust but verify" approach before accepting the data and turning the crank on what could be a long-running analysis.

I could "eyeball" the differences by capturing the data column attributes from each table and comparing them with what I know I need, but that approach doesn't scale well.

I wrote a simple SAS macro that I use to create a report that shows whether the input data passes a simple schema "sniff test". It compares a "candidate" table against a table that has a known "good" column layout, and reports on the substantive differences.

DOWNLOAD: %check_reference_data from here.

The output of the macro is another data set that contains one record per column-pair, showing a list of potential problems with the input data, ranked by severity. The severity assignments are a bit arbitrary, but effective enough for my project:

  • "SEV 1" is assigned if a required column is missing, or is present but has a different type (numeric vs. character).
  • "SEV-2" is assigned if a required column has a greater-than-expected length, as it might lead to truncation of data values downstream.
  • "SEV-3" is assigned if the column has a different SAS format, as it could indicate possible misinterpretation (a DATE format vs. DATETIME, for example).
  • And "SEV-4" is more of an "FYI" when the input data contains additional columns that aren't required.

Sample uses:

%check_reference_data(SASHELP,CLASS,SASHELP,CLASSFIT,WORK.CHECK);
title "Does CLASSFIT have everything that CLASS has?  Yes, and more";
proc print data=work.check;
run;

%check_reference_data(SASHELP,CLASSFIT,SASHELP,CLASS,WORK.CHECK);
title "Does CLASS have everything that CLASSFIT has?  No, it does not.";
proc print data=work.check;
run;

If you examine the code behind the macro, you might notice that it creates a temporary view of the input data. Why? During work on my project, I noticed that some of the supplied input data differed from the schema only by one or two column names. Sometimes it's easier to "fix" that data on the way in than to send it back to the customer to regenerate. The temp view allows me to apply simple data set options like this:

%check_reference_data(REF,CLASS,CUST,CLASS(rename=(GENDER=SEX)),WORK.CHECK);
title "Fixed for customer-provided data that was SEX-less";
proc print data=work.check;
run;

And by the way, for this check to work I don't need to keep a complete copy of "good" data lying around. I can create an empty table with the desired schema "on the fly" like so, and then use it as the reference table in my macro:

/* create reference table attributes */
/* using SQL, but DATA step works too */
proc sql;
create table work.PATIENT (
  SEX                    VARCHAR(1) FORMAT=$1.,
  YOB                    NUMERIC(8) FORMAT=4.,
  ZIP_CODE               VARCHAR(5),
  PATIENT_ID             VARCHAR(15)
);
quit;

This macro provides a simple, "good-enough" approach for verifying the input data. It provides instant feedback as to whether the data can be used, and gives us confidence that we can proceed to the next step of the analysis.

tags: data sets, SAS programming, SAS tips, sql
5月 102012
 

In a previous post I showed how you can use Windows PowerShell (with the SAS Local Data Provider) to create a SAS data set viewer.  This approach doesn't require that you have SAS installed, and allows you to read or export the records within a SAS data set file.

In this post, I'll present two companion scripts that allow you to:

  • Gather the table-level metadata about a SAS data set, including the observation count, created and last-modified dates, data set label, and more.
  • Gather the column-level metadata within each SAS data set, including variable names, types, formats, lengths, and more.

If you make use of the SAS DICTIONARY tables (as seen in SASHELP.VMEMBER and SASHELP.VCOLUMN), these scripts will provide familiar information.  But like my previous example, these scripts do not require a SAS installation.

Why is this useful? Of course, the best way to read SAS data sets is to use SAS.  And if you have SAS data sets, the probability is high that you have SAS installed somewhere, so why not use it?  It turns out that even among companies that use SAS, not every employee has access to a SAS environment.  (Tragic, right?)  And since SAS data sets are often treated as a corporate asset (or, at least, the information within the data sets is), these are subject to cataloging and auditing by staff who don't use SAS.  These scripts can enable a light-weight auditing process with a minimum of installation/licensing complications.

Here are links to all three scripts.  To use them, save each file to your local PC as a .PS1 file.  You will also need to make sure that you can run Windows PowerShell scripts, and that you have the SAS OLE DB Local Data Provider installed (free to download).

The output of each of these scripts is in the form of PowerShell objects, which are most useful when piped into another PowerShell cmdlet such as Out-GridView (for visual display) or Export-CSV (for use as input to Excel or another data-driven process).

Examples (as run from a PowerShell console window)

To view the table information about all SAS data sets in a file path w:/ (including subfolders):

.\ReadSasDataTables.ps1 w:\ | Out-GridView

Result (click on the image to see the full-size version):

To export the SAS table information to a CSV file:

.\ReadSasDataTables.ps1 w:\ | Export-CSV -NoType -Path c:\report\tables.csv

Result:
"FileName","Path","FileTime","FileSize","TableName","Label","Created","Modified","LogicalRecords","PhysicalRecords","RecordLength","Compressed","Indexed","Type","Encoding","WindowsCodepage"
"users.sas7bdat","w:\","5/8/2012 9:37:03 AM","466944","users","","5/8/2012 9:37:03 AM","5/8/2012 9:37:03 AM","412","412","952","NO","False","","20","65001"
"bloglist.sas7bdat","w:\","5/8/2012 9:37:03 AM","73728","bloglist","","5/8/2012 9:37:03 AM","5/8/2012 9:37:03 AM","28","28","1360","NO","False","","20","65001"
"posts.sas7bdat","w:\","5/8/2012 9:37:09 AM","103555072","posts","","5/8/2012 9:37:07 AM","5/8/2012 9:37:07 AM","41077","41077","2496","NO","False","","20","65001"
"postviews.sas7bdat","w:\","5/8/2012 9:37:09 AM","5120000","postviews","","5/8/2012 9:37:09 AM","5/8/2012 9:37:09 AM","4808","4808","1040","NO","False","","20","65001"
"comments.sas7bdat","w:\","5/8/2012 9:37:12 AM","26943488","comments","","5/8/2012 9:37:11 AM","5/8/2012 9:37:11 AM","7807","7807","3432","NO","False","","20","65001"
"published_posts.sas7bdat","w:\","5/8/2012 9:37:13 AM","11739136","published_posts","","5/8/2012 9:37:13 AM","5/8/2012 9:37:13 AM","4628","4628","2512","NO","False","","20","65001"
"blogsocial.sas7bdat","w:\","5/6/2012 1:19:06 PM","401408","blogsocial","","5/6/2012 1:19:06 PM","5/6/2012 1:19:06 PM","682","682","544","NO","False","","20","65001"

To view the columns (variables) information for all SAS data sets within a folder w:/ (including subfolders):
.\ReadSasDataColumns.ps1 w:\ | Out-GridView

Result (click on the image to see the full-size version):

To export the SAS columns information to a CSV file:

.\ReadSasDataColumns.ps1 w:\ | Export-CSV -NoType -Path c:\report\columns.csv

Result (partial):
"File name","Column","Label","Pos","Type","Length","Format","Informat","Indexed","Path","File time","File size"
"users.sas7bdat","ID","ID","1","NUM","0","","","False","w:\","5/8/2012 9:37:03 AM","466944"
"users.sas7bdat","user_login","user_login","2","130","90","$180.","$180.","False","w:\","5/8/2012 9:37:03 AM","466944"
"users.sas7bdat","user_registered","user_registered","3","NUM","0","DATETIME19.","DATETIME19.","False","w:\","5/8/2012 9:37:03 AM","466944"
"users.sas7bdat","display_name","display_name","4","130","375","$750.","$750.","False","w:\","5/8/2012 9:37:03 AM","466944"
"bloglist.sas7bdat","blog_id","blog_id","1","NUM","0","","","False","w:\","5/8/2012 9:37:03 AM","73728"
"bloglist.sas7bdat","name","option_value","2","130","512","$1024.","$1024.","False","w:\","5/8/2012 9:37:03 AM","73728"
"bloglist.sas7bdat","path","path","3","130","150","$300.","$300.","False","w:\","5/8/2012 9:37:03 AM","73728"
"bloglist.sas7bdat","registered","registered","4","NUM","0","DATETIME19.","DATETIME19.","False","w:\","5/8/2012 9:37:03 AM","73728"
"bloglist.sas7bdat","last_updated","last_updated","5","NUM","0","DATETIME19.","DATETIME19.","False","w:\","5/8/2012 9:37:03 AM","73728"
"bloglist.sas7bdat","public","public","6","NUM","0","","","False","w:\","5/8/2012 9:37:03 AM","73728"

 
tags: data sets, ole db, PowerShell, PROC DATASETS
4月 132012
 

In about 30 lines of PowerShell script, we can build a SAS data set viewer that:

  • Does not require SAS on the PC
  • Provides very basic filtering capability
  • Also allows for easy export to CSV

All you need is the ability to run PowerShell scripts, and the SAS Local Data Provider for OLE DB.

I said that we could do this with about 30 lines of script. But I like to include some comments, error handling, and spacing for readability -- so the script has ballooned to about 120 lines.

Here's an example of how to run the script in PowerShell (put this all on one line):

ReadSasDataset.ps1 'C:\Data\westernrates.sas7bdat' | 
  Out-GridView -Title "Western Rates data"

Here's a screenshot of the result.

Suppose that you want to export the SAS data set to CSV, perhaps for use in Excel? Try this command (again, all on one line):

ReadSasDataset.ps1 'C:\Data\westernrates.sas7bdat' | 
  Export-CSV -Path C:\Data\WesternRates.csv -NoTypeInformation

Windows PowerShell has many built-in "cmdlets" (such as Export-CSV and the Out-GridView interface) that allow you to easily pipe output from one command to another, and shape the result into exactly what you need.

Download the full script: ReadSasDataset.ps1.

(It has a TXT file extension for security purposes. To use it, save the script as a PS1 file on your local machine.)

Here is the meat of the script. The script uses ADO to "connect" to the data set and read the records, one-by-one. For each field within each record, the script creates a new PowerShell psobject, which holds the "name-value pair" of column name and cell value. These are the actual output of the script. To do something useful with those objects, you pipe the results to another cmdlet.

Note: Those little backwards tickmarks that look like dirt specks on your screen are actually used as the PowerShell line continuation character. I used them here to make the script easier to read. Unlike the SAS language, PowerShell statements usually terminate with the end-of-line, not a semicolon.

$objConnection = New-Object -comobject ADODB.Connection
$objRecordset = New-Object -comobject ADODB.Recordset

$objConnection.Open("Provider=SAS.LocalProvider;Data Source=`"$filePath`";")
$objRecordset.ActiveConnection = $objConnection

# tell the SAS provider to return FORMATTED SAS values
$objRecordset.Properties.Item("SAS Formats").Value = "_ALL_"

# open the data set
# IMPORTANT: passing in a "missing" value for the connection
# because the connection is already on the RecordSet object
$objRecordset.Open($filename, [Type]::Missing,
   $adOpenDynamic, `
   $adLockOptimistic, `
   $adCmdTableDirect) 
   
$objRecordset.MoveFirst()

# read all of the records within the SAS data file
do 
{
  # build up a new object with the field values
  $objectRecord = New-Object psobject
  
  for ($i=0; $i -lt $objRecordset.Fields.Count; $i++)
  {
     # add static properties for each record
     $objectRecord | add-member noteproperty `
     -name $objRecordset.Fields.Item($i).Name `
     -value  $objRecordset.Fields.Item($i).Value;     
  }
  # emit the object as output from this script
  $objectRecord
  
  # move on to the next record
  $objRecordset.MoveNext()
} 
until ($objRecordset.EOF -eq $True)

# close all of the connections
$objRecordset.Close()
$objConnection.Close()

In a future post, I'll provide more PowerShell examples that allow you to see all of the data set attributes for a collection of SAS data set files, similar to what you can see with PROC DATASETS.

tags: data sets, PowerShell
4月 132012
 

In about 30 lines of PowerShell script, we can build a SAS data set viewer that:

  • Does not require SAS on the PC
  • Provides very basic filtering capability
  • Also allows for easy export to CSV

All you need is the ability to run PowerShell scripts, and the SAS Local Data Provider for OLE DB.

I said that we could do this with about 30 lines of script. But I like to include some comments, error handling, and spacing for readability -- so the script has ballooned to about 120 lines.

Here's an example of how to run the script in PowerShell (put this all on one line):

ReadSasDataset.ps1 'C:\Data\westernrates.sas7bdat' | 
  Out-GridView -Title "Western Rates data"

Here's a screenshot of the result.

Suppose that you want to export the SAS data set to CSV, perhaps for use in Excel? Try this command (again, all on one line):

ReadSasDataset.ps1 'C:\Data\westernrates.sas7bdat' | 
  Export-CSV -Path C:\Data\WesternRates.csv -NoTypeInformation

Windows PowerShell has many built-in "cmdlets" (such as Export-CSV and the Out-GridView interface) that allow you to easily pipe output from one command to another, and shape the result into exactly what you need.

Download the full script: ReadSasDataset.ps1.

(It has a TXT file extension for security purposes. To use it, save the script as a PS1 file on your local machine.)

Here is the meat of the script. The script uses ADO to "connect" to the data set and read the records, one-by-one. For each field within each record, the script creates a new PowerShell psobject, which holds the "name-value pair" of column name and cell value. These are the actual output of the script. To do something useful with those objects, you pipe the results to another cmdlet.

Note: Those little backwards tickmarks that look like dirt specks on your screen are actually used as the PowerShell line continuation character. I used them here to make the script easier to read. Unlike the SAS language, PowerShell statements usually terminate with the end-of-line, not a semicolon.

$objConnection = New-Object -comobject ADODB.Connection
$objRecordset = New-Object -comobject ADODB.Recordset

$objConnection.Open("Provider=SAS.LocalProvider;Data Source=`"$filePath`";")
$objRecordset.ActiveConnection = $objConnection

# tell the SAS provider to return FORMATTED SAS values
$objRecordset.Properties.Item("SAS Formats").Value = "_ALL_"

# open the data set
# IMPORTANT: passing in a "missing" value for the connection
# because the connection is already on the RecordSet object
$objRecordset.Open($filename, [Type]::Missing,
   $adOpenDynamic, `
   $adLockOptimistic, `
   $adCmdTableDirect) 
   
$objRecordset.MoveFirst()

# read all of the records within the SAS data file
do 
{
  # build up a new object with the field values
  $objectRecord = New-Object psobject
  
  for ($i=0; $i -lt $objRecordset.Fields.Count; $i++)
  {
     # add static properties for each record
     $objectRecord | add-member noteproperty `
     -name $objRecordset.Fields.Item($i).Name `
     -value  $objRecordset.Fields.Item($i).Value;     
  }
  # emit the object as output from this script
  $objectRecord
  
  # move on to the next record
  $objRecordset.MoveNext()
} 
until ($objRecordset.EOF -eq $True)

# close all of the connections
$objRecordset.Close()
$objConnection.Close()

In a future post, I'll provide more PowerShell examples that allow you to see all of the data set attributes for a collection of SAS data set files, similar to what you can see with PROC DATASETS.

tags: data sets, PowerShell
9月 132011
 
Kathleen Harkins, Carolyn Maass and Mary Anne Rutkowski, from Merck Sharp and Dohme, collaborated to write T.I.P.S: Techniques and information for programming in SAS® for NESUG 2011. These three women are highly experienced programmers: Harkins has more than 20 years of experience in the pharmaceutical and aerospace industries; Maass has [...]
11月 162010
 
Are you afraid of big commitments? Do you like to shop around thoroughly before making a purchase decision? I can't help you with most of Life's Big Choices, but I can help you in one way: I can show you how to learn more about your data set before you open it.

In SAS Enterprise Guide 4.2, we added a feature that allows you to view the details of your data sets (such as the data set labels, how many records they contain, and more) while you browse the contents of a SAS library. What's more, you can also see the column names and column types right from within the file dialog, providing you with that peace of mind that comes only with knowing that, yes, you've selected the correct data set.

But, the feature is not turned on by default. Why not? Because gathering this information incurs a slight performance hit as you browse your libraries. For local file-based SAS data sets you probably won't notice much of a slowdown.

Here's how you turn it on:

  1. Select Tools->Options. The Options window appears.
  2. On the General page, check the box that reads, "Show full details when viewing SAS libraries in File Dialog details view."

The next time that you select File->Open->Data and navigate to the Libraries view, select the Details option from the toolbar in the file dialog, as shown here:

Toolbar for selecting details

 

When you drill down into a SAS library, you'll be overwhelmed with a tremendous set of new details, as shown here:

What the list looks like with details

 

But wait -- there's more! Go back up to the toolbar and tick the Show detail pane option, and on the right-hand side you'll see the list of SAS variables that lurk inside a SAS data set, as you select it within the list.

And with the column list in view

 

Try it! Turn on the option and give it a whirl. It will help you to eliminate "buyer's remorse" (or at least "data-opener's remorse") from your daily life.