Chris Hemedinger

9月 292011
 

Filed under "Little changes that you probably didn't notice".

When Microsoft introduced Windows 7 to the desktop, one of the many usability features that they added were "Jump Lists". Jump Lists serve as a sort of shortcut to not just open an application, but "jump right in" to a particular feature or document that the application knows how to manage. It's all about convenience and reducing the number of clicks necessary to accomplish a task.

When developing SAS Enterprise Guide 4.3, we hooked into the "recent documents" aspect of the Jump List for two types of documents: project files (.EGP) and SAS programs (.SAS). Both of these are also easily accessible in SAS Enterprise Guide under File->Recent Projects and File->Recent Programs (new in 4.3).

If you use SAS Enterprise Guide often, you can drag the program icon to your Windows 7 task bar and "pin" it in place for convenience.  Then you can right-click on that icon to see a list of recently accessed projects and/or programs.

How to hook into the Jump List

The remainder of this post is aimed toward other application developers, as it describes how to hook into the "recent documents" feature of Windows, which automatically puts your documents in your application's Jump List. It's technical and it features .NET C# code, so if that's not your interest, then you may be excused.

To hook into the "recent documents" list, you need to call some Windows APIs -- specifically, the SHAddToRecentDocs routine within Shell32 (yes, even on a 64-bit system). That means you have use "platform invoke", or pinvoke, within .NET.

To use pinvoke, you first declare a .NET routine to wrap the call to the operating system. Then, you use that routine when appropriate from within your application. At the end of this post I included an example of a C# class to wrap the SHAddToRecentDocs routine, and a helper routine (AddToRecentDocs) to make it easy to use.

In your application, when the end user successfully opens or saves a document (whatever that might be for your application), you can then call the new routine to "register" that document. For example:

public void SaveDocument(string docFullpath)
{
  // statements to save the document to disk go here
  // ...
  ShRecentDocs.AddToRecentDocs(docFullpath);
}

When used, all "recent document" items are added to the Windows registry at:
HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Explorer\RecentDocs
The names are organized by file extension and are stored as binary blobs, so you might be able to verify that new files show up by examining the registry, but you probably cannot just eyeball it for specific entries.

Here's the wrapper class:

///
/// pinvoke wrapper for adding a local file to the "Recent Documents" list
/// in the Windows shell.
/// Source: pinvoke.net
/// http://www.pinvoke.net/default.aspx/shell32/SHAddToRecentDocs.html
///
sealed public class ShRecentDocs
{
    private ShRecentDocs() { }
 
    [System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Design", "CA1008:EnumsShouldHaveZeroValue")]
    public enum ShellAddToRecentDocsFlags
    {
        Pidl = 0x001,
        Path = 0x002,
    }
 
    [DllImport("shell32.dll", CharSet = CharSet.Ansi)]
    public static extern void
      SHAddToRecentDocs(ShellAddToRecentDocsFlags flag, string path);
 
    [DllImport("shell32.dll")]
    public static extern void
      SHAddToRecentDocs(ShellAddToRecentDocsFlags flag, IntPtr pidl);
 
    ///
    /// This is the simple function to call to add a document to the list
    ///
    public static void AddToRecentDocs(string path)
    {
        SHAddToRecentDocs(ShellAddToRecentDocsFlags.Path, path);
    }
}
tags: .net, SAS Enterprise Guide, usability, Windows 7
9月 262011
 

Today, I came across this article that discusses how software users rarely change the settings in the applications that they use.  Users assume that the software vendor sets the default values as they are for a reason, and who are they to set them otherwise?

I'm a software developer (and all-around-computer-geek), so I'm forever tweaking the settings in applications that I use to suit the way that I work.  As a long-time developer for SAS Enterprise Guide, I know that the "defaults" for certain settings are something that we debate with every release.  Our objective is to have default settings that create the best experience and results for the largest percentage of users.

But SAS Enterprise Guide users (and SAS users in general) are so varied in their backgrounds and job roles that there is no such thing as a "typical" user.  Therefore, I can promise that some of the default behaviors in the application are not optimized for your experience.  Which options need tweaking?  That's up to you.

Some settings are meant to provide the best experience for new users.  Well, being a "new user" is a temporary occupation, isn't it?  Once you graduate to "somewhat experienced user", it's your task to tweak your options to match your new familiarity with the application.

In the past, I've shared my favorite 5 options to tweak when using SAS Enterprise Guide.  I encourage you to take a few minutes to explore these and other options, and see how you can make the tools work better for you as a result.

tags: options, SAS Enterprise Guide, usability
9月 222011
 

SAS programming is taught in schools all over the world, including in high schools.  Occasionally, I receive questions via my blog such as this one:

Can somebody help me on this?
Write a short DATA _NULL_ step to determine the largest integer you can store on your computer in 3, 4, 5, 6, and 7 bytes.

This sounds like a homework assignment to me, not a typical "how do I" programming question from a person trying to get a job done.  I'm pretty sure that the professor who assigned it would not sanction a lazy web approach.

Besides that, given the brief problem statement above, how do we know what the professor wants?  Is the student supposed to write a program to calculate the largest integer that you can safely store within SAS?  Or is it okay to be clever and simply use built-in SAS functions to tell you the answer?  The first approach is an exercise in programming logic and math.  The second approach is a test of your resourcefulness -- can you find the answer without a "brute force" approach, perhaps by becoming familiar with SAS documentation?  Both approaches are valid, and will provide you with practice in skills that will help you with your ongoing SAS programming endeavors.

At the risk of rewarding lazy behavior, I will present one method to find the answer.  I'm sharing it only because it sheds light on the useful CONSTANT function, and other readers might find that helpful in their own work.

In this example, we'll use the EXACTINT constant. From the SAS documentation:

The exact integer is the largest integer k such that all integers less than or equal to k in absolute value have an exact representation in a SAS numeric variable of length nbytes. This information can be useful to know before you trim a SAS numeric variable from the default 8 bytes of storage to a lower number of bytes to save storage.

Here's the program:

data _null_;
  array len{8} _numeric_ BYTES_1-BYTES_8;
  do i=3 to 8;
   len{i} = constant('exactint',i);
  end;
  pi = constant('pi');
  put "Decimal:" (BYTES_3-BYTES_8) (=/comma32.0);
  put "Hexidecimal:" (BYTES_3-BYTES_8) (=/hex14.);
  put "Binary:" (BYTES_3-BYTES_8) (=/binary64.);
  put "and a slice of Pi:" (pi) (=/16.14);
run;

You'll have to run the program in SAS to see the results. As my high school history teacher used to say, "I'm not going to spoon-feed the answers to you." (Although I suppose that's what I just did...)

Bonus (even though you don't deserve it)

Here are a couple of resources that may be helpful in future assignments:

tags: CONSTANT function, homework, SAS functions, SAS programming
9月 202011
 

Like millions of other Americans, I recently was asked to make a decision of tremendous importance to my household -- a decision that would affect the welfare of everyone in my family. That decision, of course, was whether to continue to receive Netflix movies by mail, or opt for the less-expensive "streaming only" subscription.

Let me just say this up front: we love our Netflix subscription.  We subscribed way back in 2005 on the low-cost "one-disc-at-a-time" plan, and since then we've seen over 180 movies that we received on-loan, delivered via the US Postal Service.  Many of these were movies that we never would have seen otherwise: older films, independent films, and many other titles that would have been difficult to find at a local video rental shop.

But having arrived at this crossroads, it's a good time to try and measure just how much the DVD-by-mail option would cost us, and then we can decide what action to take.  And of course, I used SAS Enterprise Guide to analyze my Netflix history, and thus created additional insight about the movie-rental pattern of my household.

Getting my account history into SAS

One of the things that I like about Netflix is how they allow you to see your entire account history online.   At the time that I'm writing this, this URL will get you to your complete DVD shipping activity (although this could change, as Netflix is restructuring their DVD-by-mail business):

https://www.netflix.com/RentalActivity?all=true

In order for that URL to work, you must be already signed in to your Netflix account in your web browser.  While there are several ways to turn this web page into data, I found the easiest method within Microsoft Excel.  On the Data ribbon menu, select Get External Data->From Web.  On the New Web Query window, paste the URL in the Address field and click Go.  You'll see a preview of the web page where you can select the table of your account history to import as data.

When the content is transferred into the Excel spreadsheet, I saved the file (as NetflixHistory.xlsx), and closed Microsoft Excel.  The spreadsheet doesn't look like data that's ready to analyze yet (lots of extra rows and space as you can see in the example below), but that's okay.  I can fix all of that easily in SAS.

With the data now in an Excel spreadsheet, I fired up SAS Enterprise Guide and selected File->Import Data. After just a few clicks through the wizard, I've got the data in a work data set.

Cleaning the data and calculating the value

The data records for the account history are very simple, containing just four fields for each movie: DVD_Title, Rating (whether we liked it), Shipped (date when Netflix shipped the movie out to me), and Returned (date when Netflix received the movie back from me).  My goal for this project is to measure value, and there are no measures in this data...yet.  I also need to filter out the "garbage" rows -- those values that had some purpose in the HTML page, but don't add anything to my analysis.

I'm going to accomplish all of this within the SAS Enterprise Guide query builder, building it all into a single step.  First, I need to design a few filters to clean up the data, as shown here:

The first three filters will drop all of the rows that don't contain information about a DVD title or shipment.  The last two filters will drop any records that reflect multi-disc shipments, or the occasional replacement shipment from when I reported a damaged disc.  Those are rare events, and they don't contain any information that I need to include in my analysis.

Next, I want to calculate some measures.  The most obvious measure to calculate is "How many days did we have the movie" -- the difference between the Shipped Date and Received Date.  And while that number will be interesting, by itself it doesn't convey value or cost.  I want a number that I can express in dollar terms.  To come up with that number, I will employ the tried-and-true method used by data hackers all over the world: I will Make Something Up.

In this case, I'm going to create a formula that reflects my cost for each movie.  That formula is:

(Netflix Monthly Fee / Days In a Month) * Days We Had the Movie = Cost of the Movie

Using the query builder, I calculated new columns with these values.  I assumed the fee was $10/month (varied over time, but this constant is good enough) and that there are 30 days in a month (again, a "good enough" constant).  Here are the new columns in the query builder:

After applying these filters and calculations, I finally have a result set that looks interesting to analyze:

By sorting this data by CostPerMovie, I can see that the "cheapest movies" were those that we had out for only 3 days, which is the fastest possible turnaround (example: receive in the mailbox on Monday, watch Monday night, mail out on Tuesday, Netflix receives on Wednesday and ships the next DVD in our queue).  By my reckoning, those DVDs cost just $1 to watch.  The most expensive movie in my list came to $26.33, a Mad Men DVD that sat for 79 days while we obviously had other things to do besides watch movies.

Visualizing the results

To visualize the "Days Out" as a time series, I used the SGSCATTER procedure to generate a simple plot.  You can see that at the start of our Netflix subscription, we were enthusiastic about watching the movies immediately after we received them, and then returning them in order to release the next title from our queue.  These are where the DaysOut values are closer to zero.  But as time goes on and Life Gets Busy, there are more occurrences of "extended-period loans", with higher values for DaysOut.

Because I've calculated the cost/movie with my sophisticated model, I can plot the cost over time by using the SERIES statement in PROC SGPLOT, with this result:

This plot makes it easy to see that I've had a few "high cost" DVDs.  But it's still difficult to determine an actual trend from this, because the plot is -- and this is a technical term -- "too jumpy".  To remedy that, I used another task in SAS Enterprise Guide -- one that I probably have no business using because I don't fully understand it.  I used the Prepare Time Series Data task (found under the Tasks->Time Series menu) to accomplish two things:

  • Calculate the moving average of the CostPerMovie over each 10-movie interval, in an effort to "smooth out" the variance among these values.
  • Interpolate the CostPerMovie value for all dates that are covered by these data, so that on any given day I can see the "going rate" of my CostPerMovie, even if that date is not a Shipped Date or Received Date.

This magic happens behind the scenes by using PROC EXPAND, part of SAS/ETS.  And although PROC EXPAND creates some nice plots by using ODS statistical graphics, I created my own series plot again by using PROC SGPLOT:

This plot confirms what I already know: our movies have become more expensive over the past 6 years of my subscription.  But more importantly, it tells me by how much: from an initial cost of $3-4, it's now up to nearly $6 per movie -- based solely on our pattern of use.

Important note: The data I collected and analyzed covers only the DVDs we've had shipped to us.  It does not include any movies or shows that we've watched by streaming the content over the Internet.  The "instant watch" feature is an important component of the Netflix model, and we do use this quite a bit.  I know that this accounts for much of the decrease in frequency for our DVD watching.  But by changing their pricing model, Netflix basically asked the question: how much is it worth to you to continue receiving movies by mail, independent of the streaming content?

And I answered that question: it's not worth $6 per DVD to me (as I reckon it, given my pattern of use).  Like millions of others, I've opted out of the DVD-by-mail service.  But we've kept the streaming service!  In a future post, I'll take a look at how we use the streaming content and what value we receive from it. [UPDATE: Here it is, the analysis of my streaming account.]

tags: moving average, Netflix, proc expand, SAS Enterprise Guide, SGPLOT, sgscatter, time series data
9月 152011
 

As I mentioned in my introductory post about Windows PowerShell, you can use PowerShell commands as a simple and concise method to collect data from your Windows systems -- information that is ripe for analysis within SAS.

In this example, I'll show a technique for using a SAS program to call PowerShell and funnel the results directly into your SAS session for analysis. This example uses the Get-Process cmdlet (pronounced /command-let/) to collect information about running processes on a given Windows machine.

I built this example in pieces, making sure that each part worked before putting it all together:

  • I ran the powershell command from a command prompt console, and redirected the output to a text file. For example:
    powershell -Command "Get-Process -ComputerName L73391" 
      > c:\temp\outProcesses.txt
  • I worked with the text file in SAS and SAS Enterprise Guide to develop a DATA step that could read/interpret it
  • I combined the command and DATA step into a program using FILENAME PIPE syntax
  • I tested the program on different machines, just to make sure that the behavior of the command and the text file didn't change across machines.

In order to run this example, you need the following conditions to be true:

 /* name of a machine on my network where I have admin rights */
%let machine = L73391; 
filename process pipe
  "powershell -Command ""Get-Process -ComputerName &machine.""";
data processes;
	infile process firstobs=4;
	length
		Handles 8
		WorkingSet 8
		CPU 8
		ID $ 8
		ProcessName $ 255;
	format
		WorkingSet comma12.
		Handles comma6.;
	input @1  Handles
		@25 WorkingSet
		@42 CPU 42-50
		@51 ID
		@59 ProcessName;
run;
 
title "Processes on &machine";
/* with an ODS destination active, this will create a report */
proc sql;
	select * from processes order by WorkingSet desc;
quit;
 
proc means data=processes;
	var Handles WorkingSet;
run;

This produces a simple report of the active processes on the given machine. Here is a partial example of the output, in descending order of "Working Set" (which is one measurement of the amount of system memory in use by a process).

tags: filename pipe, PowerShell, sas administration, SAS tips
9月 142011
 

Rick Wicklin and I are engaged in an arms race of birthday-related blog posts.  To recap:

Now I have no choice but to respond again.  This isn't my fault. I didn't start this.

Today, I'm going to take the data that Rick supplied and attempt to answer the deep scientific question, "What is our (zodiac) sign?"

You might as well ask the ever-reliable desktop scientist, the Magic 8-Ball: "Reply hazy, try again."  As you can see from the PROC FREQ output below, there doesn't appear to be a clear dominant horoscope that might be influencing our collective fate.

Attributes of Virgo (meticulous and reliable) and Taurus (warm-hearted and loving) are definitely reflected in our corporate culture, but I'm not sure that these signs are concentrated enough within our population to affect that.  Perhaps the most useful thing that came out of this exercise is my user-written SAS format that equates birthdates to signs of the zodiac.  Here's the complete program, which you ought to be able to run as-is from SAS 9.2 or later, or in SAS Enterprise Guide.

filename bdays url 
    "http://blogs.sas.com/content/iml/files/2011/09/SASBirthdays.csv"
  /* behind a corporate firewall? don't forget the PROXY= option here */
  ;
 
/* SAS format for zodiac signs in a given year */
proc format lib=work;
	value sign
	'21Mar2000'd - '19Apr2000'd = 'Aries'
	'20Apr2000'd - '20May2000'd = 'Taurus'
	'21May2000'd - '20Jun2000'd = 'Gemini'
	'21Jun2000'd - '22Jul2000'd = 'Cancer'
	'23Jul2000'd - '22Aug2000'd = 'Leo'
	'23Aug2000'd - '22Sep2000'd = 'Virgo'
	'23Sep2000'd - '22Oct2000'd = 'Libra'
	'23Oct2000'd - '21Nov2000'd = 'Scorpio'
	'22Nov2000'd - '21Dec2000'd = 'Sagittarious'
	/* split Capricorn to make two valid ranges */
	/* that don't span the calendar boundary */
	'22Dec2000'd - '31Dec2000'd = 'Capricorn'
	'01Jan2000'd - '19Jan2000'd = 'Capricorn'
	'20Jan2000'd - '18Feb2000'd = 'Aquarius'
	'19Feb2000'd - '20Mar2000'd = 'Pisces'
	other = 'Unknown';
run;
 
data bdays;
	infile bdays dsd firstobs=2;
	input mon day;
	length birthdate 8 birthsign 8;
	format birthdate date5.;
	format birthsign sign.;
	label birthsign="Zodiac sign";
	/* make sure we pick a leap year, so 29Feb is valid */
	birthdate = mdy(mon,day,2000);
	birthsign = birthdate;
run;
 
ods graphics on / height=400 width=800;
title "What's our sign?";
ods noproctitle;
proc freq data=bdays order=data;
	tables birthsign /plots=freqplot(scale=percent);
run;
 
/* clear the filename */
filename bdays;

I'm sure that the question has occurred to you: what about my Facebook friends?  How are they distributed among the stars?  Of those that report their birthdays, here is how they fall:

As you can see, there really isn't a dominant sign among them.  However, Aquarius might be a bit underrepresented.  That's a shame, because I could probably use more honest and loyal people among my friends.

tags: facebook, magic 8-ball, proc format, proc freq, zodiac
9月 122011
 

Windows PowerShell is one of my favorite tools to use for gathering information from my PC and network, and also for automating processes.  If you are familiar with UNIX shells (such as Korn shell or its variants), you will probably be very comfortable with Windows PowerShell.  Just like its UNIX predecessors, Windows PowerShell allows you to run commands and combinations of commands from an interactive console window.  You can also write PowerShell scripts (saved as PS1 files), which allows you to combine the commands and programming logic to run more sophisticated operations.

I have used Windows PowerShell to automate some of my SAS-related processes, such as batch processing with SAS Enterprise Guide.  I've also used it within my development work to gather metrics about files, computers on the network, and running processes -- all of which are interesting activities for a system administrator.  Because Windows PowerShell allows you to gather different types of information and easily save it in CSV files, it's a convenient way to generate data sources for further analysis using SAS.

In the near future, I intend to share with you a few examples that combine SAS with Windows PowerShell.  But before I do that, I need to share some basic information for getting started with PowerShell, because there are a few "gotchas" you might experience as a new user.

Where to find Windows PowerShell

If you are running Windows 7 or Windows Server 2008, Windows PowerShell is built in.  Earlier versions of Windows Vista or Windows XP might not yet have it, so you would need to download it from Microsoft's Scripting Center.

Once downloaded, you should be able to find various shortcuts in Start->All Programs->Accessories->Windows PowerShell.  On a 64-bit system you will find 4 shortcuts, as shown here:

In my experience, I almost always use the x86 versions of the PowerShell engine.  These allow me to work with most aspects of the system, plus my favorite 32-bit applications such as SAS Enterprise Guide or Microsoft Office.  I can still drive other 64-bit applications as long as they are launched out-of-process.

The Windows PowerShell (x86) shortcut invokes a command prompt environment, reminiscent of the familiar DOS shell (but a little fancier).  The Windows PowerShell ISE (x86) shortcut opens the Integrated Scripting Environment, which presents a program window, command window, and an output/log window (gee, why does that seem so familiar?).

Enable Windows PowerShell scripts to run

Here's the most baffling part about getting started with PowerShell: by default, you cannot run PowerShell scripts on your system.  It's a capability that comes as disabled out-of-the-box.  You can run script commands from the console, but you won't be able to execute scripts that are saved as PS1 files.  If you try, you will see an error message with text similar to this:

File C:\Test\TestScript.ps1 cannot be loaded because the 
  execution of scripts is disabled on this system.
  Please see "get-help about_signing" for more details.

At line:1 char:23+ .\Test\TestScript.ps1 <<<<  
  + CategoryInfo
 : NotSpecified: (:) [], PSSecurityException   
 + FullyQualifiedErrorId : RuntimeException

This default policy setting is presumably for your own safety, though I must admit I don't understand it.  Fortunately, you can easily change the policy by using the Set-ExecutionPolicy command:
Set-ExecutionPolicy RemoteSigned

Run this command from the PowerShell console, select 'Y' to confirm, and you'll now be able to run local PS1 files on your PC.  ("RemoteSigned" indicates that local scripts will run, but scripts that you download from the internet will run only if they are signed by a trusted party.)  You can read more about setting these policies for running scripts in the Windows PowerShell Owner's Manual.

A quick PowerShell example

Earlier in this post, I promised that Windows PowerShell makes it easy to collect system information and save it as CSV.  Once we have it in CSV format, it's easy to import and analyze in SAS.  Here is a simple example that you can run from the PowerShell console:

Get-Process | Export-CSV -Path ./processes.csv -NoTypeInformation

The Get-Process command generates a detailed list of the running processes on your machine. The output is piped to the Export-CSV command, which drops the result into a file named processes.csv in the current directory.  That file is then easy to import into SAS using SAS Enterprise Guide (or by writing your own SAS program).
tags: automation, PowerShell, sas administration
9月 102011
 

Earlier today, Rick posted interesting information about which time of year the most babies are born, at least in the USA.

I don't have data nearly as extensive as what's available at the NCHS, but I do have a sample of birthday records to compare against Rick's findings.  My sample comes from my Facebook account, where many of my friends share the information about their birthdays so that at least once a year, they receive friendly cybergreetings from all the people that they've connected with.

As you might know, I wrote an application that can turn my Facebook friend data into SAS data sets.  At the moment, I'm connected to 355 people on Facebook (that number drops by the minute as my friends read this blog post).  Of the 355 friends, only 236 share their birthday information (month and day).  Even fewer (only 106) share their birth year as part of that.  The policies for sharing this information are among the privacy settings that you can tweak on your Facebook account.

As part of the results of the SAS program that I've generated, I've included a report of "Known birthdays in calendar order".  By using these data records, I can check the small sample of my friends' birthdays against the slightly larger set of the entire country.

I've included an "anonymized" table of just the September birthdays here.  Of the 236 birthday records that I have, 19 of them fall in September.

For my data, September is not the "birthiest" month; April is.  I can see that with this simple SGPLOT output of birthday frequency by month.

But does this mean that April (with 28 birthdays) was truly the busiest period for birthdays among my friends?  Month boundaries are a bit arbitrary ("30 days has September, April, June...").  In Rick's post, he pointed out that weeks 37, 38, and 39 combined represent the peak number of births.  What if I generate a report of frequency of births per week number?  (I used the WEEK function to calculate the week number from each date.)   According to the chart below, week 14 (which creeps into April) has the highest number, with 10 birthdays.

Using a quick-and-dirty DATA step and the LAG function, I built another column that represents "3-week running total", just to see where the highest 3-week period falls.  In my case, the highest-volume 3-week period appears to end with week 40.  That actually falls pretty close to the peak time of year reflected in the national statistics.

What insight have I gained about my friends here?  None, really.  That's because when it comes to the information that they share on Facebook, many of my friends are less than honest...as illustrated in the following story.

Earlier this year I noticed (on Facebook) that my friend John had a birthday coming up, so I wrote a "Happy Birthday" message on his Facebook wall.  (I probably said something really clever...who knows?)  When I next saw him in person (I do occasionally interact with people that way), John said to me: "Chris, January 1 is my cyber birthday, the birthday I use online to separate those who know me well from the posers."  Well, I guess we know where I fall among those groups.

tags: birthdays, facebook, lag, SGPLOT
9月 072011
 

I use Google Reader to keep up with SAS-related conversations on the blogosphere.   I thought it would be nice to share the lists of blogs that I follow as "shared bundles".  If you also use Google Reader, it will be very easy for you to add these bundles to items that you follow.  (Note: in order to use Google Reader you need a Google account, but my guess is that most people have at least one of those these days.)

If you have or know of other SAS-related blogs, post back in the comments so I can add them to my list of items to watch.

The following links will lead you to my two shared bundles:

tags: Google reader, SAS blog, SAS tips
9月 062011
 

We ship many sample data sets with SAS products. By using well-understood sample data sets, we can teach concepts or show off product features without distracting the audience/students with data collection or prep.

At least, that's the way it's supposed to work. But occasionally the sample data can cause a distraction on its own, especially when its origin is called into question. Jiangtang Hu tells the colorful story of the Fisher's Iris data, which can be found in SAS 9.2 as SASHELP.IRIS.

According to Mr. Hu, the SASHELP.IRIS data set contains some errors: deviations from the original Iris data published by R.A. Fisher. Three of the errors are well-known, propagated by many scientists over the years in several different repositories. A fourth error exists only in the version shipped with SAS 9.2.


I became familiar with the IRIS data set a couple of years ago when we introduced the Scatter Plot Matrix task into SAS Enterprise Guide 4.3. The IRIS data, with its one categorical column ("Species") and several measurement columns ("PetalWidth", "PetalLength", etc.), is a great way to show off the capabilities of PROC SGSCATTER, which is the SAS procedure that the new task uses to create its results (pictured here). SASHELP.IRIS was the primary data set used by our testers to verify the behavior of the Scatter Plot Matrix task.

Mr. Hu's blog post aroused my curiosity. Using our internal defects tracking system, I went back in time to September 2007 and saw the activity that resulted in the inclusion of the IRIS data in SASHELP. The version that we included was meant to be a copy of the data that we had already been shipping with SAS/STAT examples for many years. That version already had the three common errors known to the community; I cannot yet explain how the fourth error might have been introduced (although I might continue the research for an episode of CSI: SASHELP Data).

I also learned that the errors did not go unnoticed by our own staff. SAS Press author Warren Kuhfeld noticed the deviations as he prepared content for his book about Statistical Graphics in SAS. Warren applied his high professional standards to the problem and took steps to make sure that with SAS 9.3, the SASHELP.IRIS data set was corrected. The data set now matches the UCI version of the Iris data (consistent with what has been supplied with SAS examples in the past), but with the observations sorted by Species (so observation order might differ, but the values are the same).

tags: kuhfeld, ODS Graphics, sample data, sashelp.iris, sgscatter