sas programming

6月 152012
 

A colleague was recently working with a web service that supplies some date-time values using the Microsoft Windows internal representation. He called the web service to retrieve those values (along with other data) from SAS, and he needed convert these values to SAS date-time values.

The Microsoft definition for a date-time value is this:

The number of ticks since 12:00 midnight, January 1, 0001 A.D. (C.E.) in the Gregorian Calendar.

Right. Now, what's a tick?

A tick is 100 nanoseconds. There are 10 million ticks in a second.

A SAS date-time value, as you might already know, is this:

The number of seconds since 12:00 midnight on January 1, 1960.

This is beginning to smell like an algebra word problem. To solve, you start with the ticks value of '01JAN1960:00:00:00'dt, subtract that from the given time value, divide by 10 million, and -- kapow! -- you've got a SAS date-time value. (Note: doesn't account for a few Leap Day nuances for way-out future dates.)

So you need to know the number of ticks for the SAS baseline date-time? Of course, that's 618199776000000000.

Oh, you don't have that memorized? No worries -- we can get these values by using my old friend, Windows PowerShell. In a PowerShell console, I used the Get-Date command to get a date-time object for the value in question, and then queried the Ticks property. I can also use PowerShell to remind me how many ticks are in a second (a different sort of tick check!):

PS C:\> $base = Get-Date "01/01/1960 12:00:00 AM"
PS C:\> $base.Ticks
618199776000000000
PS C:\> $onesec = Get-Date "01/01/1960 12:00:01 AM"
PS C:\> $onesec.Ticks - $base.Ticks
10000000

Here's a SAS program that shows this approach:

data _null_;
  length start 8 date 8 diff 8 sasdate 8;
 
  /* Ticks value for 01JAN1960:0:0:0 = */
  start=618199776000000000; 
  infile datalines dsd;
  input date;
 
  diff=(date-start)/10000000;
  sasdate=intnx('second','01jan1960:00:00:00'dt,diff);
  put sasdate datetime.;
datalines;
634710667200000000
634710662300000000
634543877200034000
run;

Output:

26APR12:19:52:00
26APR12:19:43:50
16OCT11:18:48:40

An aside:
Friends, if you ever get the chance to implement a web service, do not use vendor-specific date or date-time representations. There are ISO standards for these things that will help your consumers! And SAS can interpret the values that comply with these standards, with no fancy math necessary.
tags: date-time, PowerShell, SAS programming, web services
6月 042012
 

I recently read a blog post in which a SAS user had to rename a bunch of variables named A1, A2,..., A10, such as are contained in the following data set:

/* generate data with variables A1-A10 */
data A;
array A[10] A1-A10 (1);
do i = 1 to 10; A{i} = i; end;
output;
run;

He needed to rename the variables to B1, B2, ..., B10 in order to merge some data. He did it like this:

/* rename Ai = Bi for i=1 to 10 */
data B;
set A (rename=(A1=B1 A2=B2 A3=B3 A4=B4 A5=B5 
               A6=B6 A7=B7 A8=B8 A9=B9 A10=B10));
run;

Many SAS users know that you can use numbered range lists to specify variables on a VAR statement or on a MODEL statement in a SAS procedure, but did you know that you can also specify a numbered range list in the RENAME= option?

/* rename A1-A10 to B1-B10 */
data C;
set A (rename=(A1-A10=B1-B10));
run;

Remember this trick when you need to rename dozens of variables that have numerical suffixes and a common prefix.

tags: SAS Programming, Tips and Techniques
5月 042012
 

A reader asked:

I want to create a vector as follows. Suppose there are two given vectors x=[A B C] and f=[1 2 3]. Here f indicates the frequency vector. I hope to generate a vector c=[A B B C C C]. I am trying to use the REPEAT function in the SAS/IML, language but there is always something wrong. Can you help me?

This is probably a good time to remind everyone about the SAS/IML Community (formerly known as a Discussion Forum). You can post your SAS/IML questions there 24 hours a day. That is always a better plan than making a personal appeal to me, because I receive dozens of questions like this every month, and there is no way that I can personally reply. There are lots of experienced SAS/IML experts out there, so please use the SAS/IML Community to tap into that knowledge.

That said, I think the answer to this reader's question makes an interesting example of statistical programming with SAS/IML software. It is trivial to solve this in the DATA step (see the end of this article), but how might you solve it in the SAS/IML language? If you'd like to try to solve this problem yourself, stop reading here. Spoilers ahead!

Create a new vector that duplicates frequencies

The goal is to write a function that duplicates or "expands" data that have a frequency variable. The important function to use for this task is the CUSUM function, which computes the cumulative frequencies. Let's look at a simple example and apply the CUSUM function to the frequency vector:

proc iml;
values={A,B,C,E};
freq = {2,1,3,4};
cumfreq = cusum(freq);
print values freq cumfreq;

As shown in the output, the cumfreq variable contains the indices for the expanded data. The expanded data will be a vector that contains 10 elements. The first data value (A) repeats twice (the freq value), so it repeats until element 2 (the cumfreq value) in the expanded vector. The second category fills element 3. The next category repeats 3 times, so it occupies up through element 6 in the expanded vector. The last category repeats until element 10. The following DO loop specifies each data value and the indices of the expanded vector that it should occupy:

print (values[1])[label="value"] (1:cumFreq[1])[label="Indices"];
do i = 2 to nrow(values);
   bIdx = 1 + cumFreq[i-1]; /* begin index */
   eIdx = cumFreq[i];       /* end index */
   value = values[i];
   print value (bIdx:eIdx)[label="Indices"];
end;

The output shows that we have all the information we need to allocate a vector of length 10 and fill it with the data values, where the ith value is repeated freq[i] times. The key, it turns out, is to use the CUSUM function to find the indices that correspond to the each data value.

A module to compute the expanded data

In SAS procedures that support a FREQ statement, the frequency values must be positive integers. If the frequency value is missing or is a nonpositive value, the corresponding data value is excluded from the analysis. It is easy to add that same feature to a module that takes a vector of values and a vector of frequencies and returns a vector that contains the data in expanded form. This is implemented in the following SAS/IML module, which allocates the result vector with the first data value in order to avoid handling the first element outside of the DO loop:

start expandFreq(_x, _freq);
   /* Optional: handle nonpositive and fractional frequencies */
   idx = loc(_freq > 0); /* trick: in SAS this also handles missing alues */
   if ncol(idx)=0 then return (.);
   x = _x[idx];
   freq = round( _freq[idx] );
 
   /* all frequencies are now positive integers */
   cumfreq = cusum(freq);
 
   /* Initialize result with x[1] to get correct char/num type */
   N = nrow(x);
   expand = j(cumfreq[N], 1, x[1]); /* useful trick */
 
   do i = 2 to N;
      bIdx = 1 + cumFreq[i-1]; /* begin index */
      eIdx = cumFreq[i];       /* end index */
      expand[bIdx:eIdx] = x[i];/* you could use the REPEAT function here */
   end;
   return ( expand );
finish;
 
/* test the module */
values={A,B,C,D,E,F};
freq = {2,1,3,0,4,.}; /* include nonpositive and missing frequencies */
y = expandFreq(values, freq);
print values freq y;

Notice that you don't actually need to use the REPEAT function because SAS/IML is happy to assign a scalar value into a vector. The scalar is automatically repeated as often as needed in order to fill the vector.

A DATA step solution

As indicated at the beginning of this post, the DATA step solution is quite simple: merely use the OUTPUT statement in a loop, as shown in the following example:

data Orig;
input x $ Freq;
datalines;
A 2
B 1
C 3
D 0
E 4
F .
;
run;
 
/* expand original data by frequency variable */
data Expand;
keep x;
set Orig;
if Freq<1 then delete;
do i = 1 to int(Freq);
   output;
end;
run;
proc print data=Expand; run;

The output data set contains the same data as the y vector in the SAS/IML program.

tags: Data Analysis, SAS Programming, Statistical Programming
4月 182012
 

SAS software provides many run-time functions that you can call from your SAS/IML or DATA step programs. The SAS/IML language has several hundred built-in statistical functions, and Base SAS software contains hundreds more. However, it is common for statistical programmers to extend the run-time library to include special user-defined functions.

In a previous blog post I discussed two different ways to apply a log transformation when your data might contain missing values and negative values. I'll use the log transformation example to show how to define and call user-defined functions in SAS/IML software and in Base SAS software.

A "safe" log transformation in the SAS/IML language

In the SAS/IML language, it is easy to write user-defined functions (called modules) that extend the functionality of the language. If you need a function that safely takes the natural logarithm and handles missing and negative values, you can easily use the ideas from my previous blog post to create the following SAS/IML function:

proc iml;
/* if Y>0, return the natural log of Y
   otherwise return a missing value  */
start SafeLog(Y);
   logY = j(nrow(Y),ncol(Y),.); /* allocate missing */
   idx = loc(Y > 0);            /* find indices where Y > 0 */
   if ncol(idx) > 0 then logY[idx] = log(Y[idx]);
   return(logY);
finish;
 
Y = {-3,1,2,.,5,10,100}; 
LogY = SafeLog(Y);
print Y LogY;

The program is explained in my previous post, but essentially it allocates a vector of missing values and then computes the logarithm for the positive data values. The START and FINISH statements are used to define the SafeLog function, which you can then call on a vector or matrix of values.

In this example, the function is defined only for the current PROC IML session. However, you can store the function and load it later if you want to reuse it.

Defining a "safe" log transformation by using PROC FCMP

You can also extend the Base SAS library of run-time functions. The FCMP procedure enables you to define your own functions that can be called from the DATA step and from other SAS procedures. (The MCMC procedure has an example of calling a user-defined function from a SAS/STAT procedure.) If you have never used the FCMP procedure before, I recommend Peter Eberhardt's 2009 paper on defining functions in PROC FCMP. For a more comprehensive treatment, see Jason Secosky's 2007 paper.

Technically, you don't need to do anything special in the DATA step if you want a SAS missing value to represent the logarithm of a negative number: the DATA step does this automatically. However, the DATA step also generates some scary-looking notes in the SAS LOG:

NOTE: Invalid argument to function LOG at line 72 column 5.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+
74         -3 1 2 . 5 10 100
x=-3 y=. _ERROR_=1 _N_=1
NOTE: Missing values were generated as a result of performing an operation on missing values.
NOTE: Mathematical operations could not be performed at the following places. The results of
      the operations have been set to missing values.

I prefer my programs to run with a clean, healthy-looking SAS LOG, so I will use PROC FCMP to define a SafeLog function that has the same behavior (and name!) as my SAS/IML function:

proc fcmp outlib=work.funcs.MathFuncs;
function SafeLog(y);
   if y>0 then return( log(y) );
   return( . );
endsub;
quit;

The function returns a missing value for nonpositive and missing values. The definition of the function is stored in a data set named WORK.FUNCS, which will vanish when you exit SAS. However, you can create the definition in a permanent location if you want to call the function in a later SAS session.

In order to call the function from the DATA step, use the CMPLIB= option, as shown in the following example:

options cmplib=work.funcs;  /* define location of SafeLog function */
data A;
input x @@;
y = SafeLog(x); /* test the function */
datalines;
-3 1 2 . 5 10 100 
;
run;

The result is not shown, but it is identical to the output from the SAS/IML program.

You might not have need for the SafeLog function, but it is very useful to know how to define user-defined functions in SAS/IML software and in Base SAS software. SAS/IML modules and PROC FCMP functions make it easy to extend the built-in functionality of SAS software.

tags: Data Analysis, SAS Programming, Statistical Programming, Tips and Techniques
4月 102012
 

Last week I blogged about how to construct a smoother for a time series for the temperature in Albany, NY from 1995 to March, 2012. I smoothed the data by "folding" the time series into a single "year" that contains repeated measurements for each day of the year.

Experts in time series analysis would use different techniques to analyze these data, so today I'm giving my colleague, Udo Sglavo, a chance to comment on how he might analyze these data. I am not knowledgeable about the Singular Spectrum Analysis method that he presents, but it seems like a powerful mixture of principal component analysis and orthogonal decomposition analysis in the time domain. Enjoy this guest post!

-- Rick


Udo Sglavo

Rick’s blog about smoothers for periodic data caught my attention because I recently read a SAS Global Forum paper, "An Introduction to Singular Spectrum Analysis with SAS/ETS Software" by M. Leonard, et al. Singular spectrum analysis is a relatively new approach to modeling long time series data for which patterns (such as trends and cycles) are difficult to visualize and analyze. The general idea of singular spectrum analysis is to apply nonparametric techniques for decomposing time series data into principal components. Since the paper also used temperature data as an example, I thought it would be fun to apply singular spectrum analysis to the Albany temperature data.

It is good practice to visualize the data first, so I plotted the complete time series and the seasonal cycles using monthly temperature averages. You can plot the data by using the TIMESERIES procedure in SAS/ETS software:

proc timeseries data=tempdata out=_null_ plot=(series cycles);
title "Average Temperature per Month";
id date interval=month accumulate=average;
var temperature;
run;

As expected the temperature data shows a strong seasonal behavior with high average temperatures in summer and low average temperatures in winter.

As a next step, I applied a singular spectrum analysis to decompose the data into two main components:

proc timeseries data=tempdata out=_null_ plot=SSA;
title "SSA Analysis: Average Temperature per Month";
id date interval=month accumulate=average;
SSA / LENGTH=12 GROUPS=(1)(2 3);
var temperature;
run;

The first component (labeled 1) seems to reflect the long-term trend, whereas the second component (labeled 2) reflects the cyclic behavior of the temperatures. When overlaying individual components with the original series, this relationship becomes even more apparent. For example, the following plot shows that the first component looks like a moving average:

Other times series analysis techniques, such as seasonal decomposition, could also be used to analyze these data.

I could have decomposed the original series into more components to identify additional patterns in the data, as is shown in the paper "An Introduction to Singular Spectrum Analysis with SAS/ETS Software". See the documentation for the TIMESERIES procedure for further information about the singular spectrum analysis, how to identify the number of patterns in the data, and its implementation in SAS/ETS software.

In conclusion, my analysis suggests that even after removing seasonal effects of temperature from the data, the long-run average monthly temperature trend is not constant. The first SSA component goes up and down, which shows that additional factors affect the weather in Albany. The most recent Albany winter was one of the warmest since 1995, but the data also show other winters (such as 2002 and 2006) during which Albany experienced higher than average winter temperatures.

tags: Data Analysis, SAS Programming
4月 062012
 

In yesterday's post, I discussed a "quick and dirty" method to smooth periodic data. However, after I smoothed the data I remarked that the smoother itself was not exactly periodic. At the end points of the periodic interval, the smoother did not have equal slopes and the method does not guarantee that the predicted values are the same. The lack of periodicity occurs because of "edge effects" in the loess algorithm.

In this article I show how to correct these deficiencies to construct a truly periodic smoother. You can download the program used to construct the periodic loess smoother.

Creating a periodic smoother

The problem, as I discussed in my previous article, is that the loess algorithm behaves differently at the ends of the data range than in the middle. The loess algorithm uses the k observations nearest to x to predict a value at x. In the middle of the data, about k/2 points on either side of x are used to form a prediction at x. However, for observations near the minimum of the data, the algorithm uses about k points to the right of x. For observations near the maximum of the data, the loess algorithm uses about k points to the left of x. This asymmetry leads to the loess curve being aperiodic, even when the data are periodic.

A solution is to translate a copy of the data to the left and to the right before fitting the loess curve. By extending the data, an observation near the minimum of the data still has k/2 points to its right and k/2 points to its left. Furthermore, the points to the right are exactly the k/2 observations with the largest x values.

If you know ahead of time that you only need k/2 points to the left of the original data, you can just translate k/2 points. However, in the following DATA step I translate the entire set of data to the left and to the right. This simplifies the code and is sometimes necessary at the modeling stage of an analysis.

/* extend data to each side */
data Periodic;
set TempData(in=before) TempData TempData(in=after);
if before then 
   proportion = proportion - 1; /* (-1,0] */
if after then 
   proportion = proportion + 1; /* (1,2] */
run;

I want to use 0.167 as the loess smoothing because that was the value used in my previous analysis.. But I need to be careful: I now have three times as much data, so I need to choose a smoothing parameter that is 1/3 smaller. In the following SAS statements, I create a data set to score the predicted values and I call PROC LOESS with the smoothing parameter 0.167 / 3 = 0.0557:

data Score;
do proportion = 0 to 1 by 1/365;
   output;
end;
 
/* 3 times the data, so use 1/3 the smoothing parameter! */
proc loess data=Periodic plots(only maxpoints=none)=(FitPlot CriterionPlot);
model Temperature = Proportion/ smooth=0.0557 interp=cubic;
score data=Score;
ods output ScoreResults=Fit;
run;

I can now plot the original data and overlay a truly periodic loess curve:

data Combine;
merge TempData Fit(rename=(Proportion=Prop));
run;
 
proc sgplot data=Combine;
scatter x=Proportion y=Temperature / transparency=0.8;
scatter x=Prop y=Temp / markerattrs=(color='gray' symbol=CircleFilled) legendlabel="Winter 2011-12";
series x=Prop y=P_Temperature/ lineattrs=(thickness=4px) legendlabel="Periodic smoother"; /* truly periodic */
yaxis grid; 
title "Temperature in Albany, NY (1995-2012)";
run;

Notice that although the predicted values have not changed very much, the slopes of the loess curve at the ends of the data match up. The curve is a periodic smoother for these data. This method (extending the data in both directions) works for smoothers other than the loess smoother. As long as the smoother uses local interpolation, including spline interpolation, this technique should work.

The SAS/IML language has a built-in routine for fitting cubic splines to periodic data. The documentation gives examples of how to use it.

Do you have a favorite alternative method for smoothing periodic data? Leave a comment.

tags: Data Analysis, SAS Programming
4月 052012
 

Over at the SAS and R blog, Ken Kleinman discussed using polar coordinates to plot time series data for multiple years. The time series plot was reproduced in SAS by my colleague Robert Allison.

The idea of plotting periodic data on a circle is not new. In fact it goes back at least as far as Florence Nightingale who used polar charts to plot the seasonal occurrence of deaths of soldiers during the Crimean War. Her diagrams are today called "Nightingale Roses" or "Coxcombs."

The polar charts created by Kleinman and Allison enable you to see general characteristics of the data and could be useful for comparing the seasonal temperatures of different cities. A city such as Honolulu, Hawaii, that has a small variation in winter-summer temperatures will have a polar diagram for which the data cloud is nearly concentric. Cities with more extreme variation—such as the Albany, NY, data used by Kleinman and Allison—will have a data cloud that is off center. Comparing cities by using polar charts has many of the same advantages and disadvantages as using radar charts.

However, if you want to model the data and display a fitted curve that shows seasonality, then a rectangular coordinate system is better for displaying the data. For me, trying to follow a sinusoidal curve as it winds around a circle requires too much head-tilting!

Fitting periodic data: The quick-and-dirty way

You can visualize periodic time-series data by "folding" the data onto a scatter plot. The easiest way to do this is to plot the day of the year for each data point. (The day of the year is called the "Julian day" and is easily computed by applying the SAS JULDAYw format.) That produces a scatter plot for which the horizontal axis is in the interval [1, 365], or [1, 366] for leap years. An alternative approach is to transform each date into the interval (0,1] by dividing the Julian day by the number of days in the year (either 365 or 366). The following SAS code performs a "quick and dirty" fit to the temperature data for Albany, NY. The code does the following:

  • A DATA step reads the temperatures for Albany, NY, from its Internet URL. This uses the FILENAME statement to access data directly from a URL.
  • The Julian day is computed by using the JULDAY3. format.
  • The proportion of the year is computed by dividing the Julian day by 365 or 366.
  • The winter of 2011-2012 is appended to the data to make it easier to accent those dates while using transparency for the bulk of the data.
  • The SGPLOT procedure is used to create a scatter plot of the data and to overlay a loess curve.

/* Read the data directly from the Internet. This DATA step adapted from 
   Robert Allison's analysis: http://sww.sas.com/~realliso/democd55/albany_ny_circular.sas */
filename webfile 
   url "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYALBANY.txt" 
   /* behind a corporate firewall? don't forget the PROXY= option here */
   ;
data TempData;
infile webfile;
input month day year Temperature;
format date date9.;
date=MDY(month, day, year);
dayofyear=0; dayofyear=put(date,julday3.);
/* incorporate leap years into calculations */
Proportion = dayofyear / 
             put(mdy(12, 31, year), julday3.);
label Proportion = "Day of Year (Normalized)";
CurrentWinter = (date >='01dec2011'd & date<='15mar2012'd);
if Temperature^=-99 then output;
run;
 
/* Technique to overlay solid markers on transparent markers */
data TempData;
set TempData              /* full data (transparent markers) */
    TempData(where=(CurrentWinter=1) /* special data (solid) */
             rename=(Proportion=Prop Temperature=Temp));
run;
 
proc sgplot data=TempData;
scatter x=Proportion y=Temperature / transparency=0.8;
scatter x=Prop y=Temp / markerattrs=(color='gray' symbol=CircleFilled) legendlabel="Winter 2011-12";
loess x=Proportion y=Temperature / 
      smooth=0.167 nomarkers lineattrs=(thickness=4px) legendlabel="Loess smoother";
yaxis grid; 
title "Temperature in Albany, NY (1995-2012)";
run;

A few aspects of the plot are interesting:

  • Only about 27% of this past winter's temperatures are below the average temperature, as determined by the loess smoother. This indicates that the Albany winter was warmer than usual—a result that was not apparent in the polar graph.
  • The smoother enables you to read the average temperatures for each season.

The observant reader might wonder about the value of the smoothing parameter in the LOESS statement. The smoothing value is 61/365 = 0.167, which was chosen so that the mean temperature of a date in the center of the plot is predicted by using a weighted fit of temperatures for 30 days prior to the date and 30 days after the date. If you ask the LOESS procedure to compute the smoothing value for these data according to the AICC or GCV criterion, both criteria tend to oversmooth these data.

Creating a periodic smoother

The loess curve for these data is very close to being periodic....but it isn't quite. The value at Proportion=0 and Proportion=1 are almost the same, but the slopes are not. For other periodic data that I've examined, the fitted values have not been equal.

Why does this happen? Because of what are sometimes called "edge effects." The loess algorithm is different near the extremes of the data than it is in the middle. In the middle of that data, about k/2 points on either side of x are used to predict the value at x. However, for observations near Proportion=0, the algorithm uses about k points to the right of x, and for observations near Proportion=1, the loess algorithm uses about k points to the left of x. This asymmetry leads to the loess curve being aperiodic, even when the data are periodic.

Tomorrow I will show how to create a really, truly, honestly periodic smoother for these data.

tags: Data Analysis, SAS Programming, Statistical Graphics
4月 042012
 

Over at the SAS Discussion Forums, someone asked how to use SAS to fit a Poisson distribution to data. The questioner asked how to fit the distribution but also how to overlay the fitted density on the data and to create a quantile-quantile (Q-Q) plot.

The questioner mentioned that the UNIVARIATE procedure does not fit the Poisson distribution. That is correct: the UNIVARIATE procedure fits continuous distributions, whereas the Poisson distribution is a discrete distribution. Nevertheless, you can fit Poisson data and visualize the results by combining several SAS procedures. This article shows one way to accomplish this. The method also works for other discrete distributions such as the negative binomial and the geometric distribution.

Do I receive emails at a constant rate?

For data I will use the number of emails that I received one day for each of 19 half-hour periods from 8:00 am to 5:30 pm. If I receive emails at a constant rate during the day, the number of emails in each 30-minute period follows a Poisson distribution. The following DATA step defines the data; PROC FREQ tabulates and plots the sample distribution:

/* number of emails received in each half-hour period
   8:00am - 5:30pm on a weekday. */
data MyData;
input N @@;
/* 8a 9a  10a 11a 12p 1p  2p  3p  4p  5p */
datalines;
   7 7 13 9 8 8 9 9 5 6 6 9 5 10 4 5 3 8 4
;
run;
/* Tabulate counts and plot data */
proc freq data=MyData;
tables N / out=FreqOut plots=FreqPlot(scale=percent);
run;

The mean of the data is about 7. A Poisson(7) distribution looks approximately normal—which these data do not. On the other hand, there are less than 20 observations in the data, so let's proceed with the fit. (I actually looked at several days of email before I found a day that I could model as Poisson, so these data are NOT a random sample!)

Fit the data

The first step is to fit the Poisson parameter to the data. You can do this in PROC GENMOD by by using the DIST= option to specify a Poisson distribution. Notice that I do not specify any explanatory variables, which means that I am fitting the mean of the data.

/* 1. Estimate the rate parameter with PROC GENMOD: 
     http://support.sas.com/kb/24/166.html */
proc genmod data=MyData;
   model N = / dist=poisson;
   output out=PoissonFit p=lambda;
run;

At this point you should look at the goodness-of-fit and parameter estimates tables that PROC GENMOD creates to see how well the model fits the data. I will skip these steps.

Compute the fitted density

The P= option on the OUTPUT statement outputs the mean, which is also the parameter estimate for the fitted Poisson distribution. The mean is about 7.1. The following statements set a macro variable to that value and create a data set (PMF) that contains the Poisson(7.1) density for various x values. In a subsequent step, I'll overlay this fitted density on the empirical density.

/* 2. Compute Poisson density for estimated parameter value */
/* 2.1 Create macro variable with parameter estimate */ 
data _null_;
set PoissonFit;
call symputx("Lambda", Lambda);
stop;
run;
 
/* 2.2 Use PDF function for range of x values */
data PMF;
do t = 0 to 13; /* 0 to max(x) */
   Y = pdf("Poisson", t, &Lambda);
   output;
end;
run;

Overlay the empirical and fitted densities

I want to overlay the discrete density on a bar chart of the data. One way to visualize the discrete density is as a scatter plot of (x, pdf(x)) values that represent the fitted density at x=0, 1,...,13. Unfortunately, you cannot use the VBAR and the SCATTER statements in the same SGPLOT call to overlay a bar chart and a scatter plot. However, in SAS 9.3 you can use the VBARPARM statement together with the SCATTER statement. (Thanks to "PGStats" for this suggestion.) The VBARPARM statement requires that you compute the heights of the bars yourself, but the heights are easily constructed from the PROC FREQ output that was created earlier:

/* 3. Use bar chart to plot data. To overlay a bar chart and 
      scatter plot, use the VBARPARM stmt instead of VBAR. */
data Discrete;
merge FreqOut PMF;
Prop = Percent / 100; /* convert to same scale as PDF */
run;
 
/* 3.2 Overlay VBARPARM and scatter plot of (x, pdf(x)) */
proc sgplot data=Discrete; /* VBARPARM is SAS 9.3 stmt */
   vbarparm category=N response=Prop / legendlabel='Sample';
   scatter x=T y=Y / legendlabel='PMF'
      markerattrs=GraphDataDefault(symbol=CIRCLEFILLED size=10);
   title "Emails per 30-Minute Period and Poisson Distribution";
run;

Create a discrete Q-Q plot

On the Discussion Forum, the questioner asked for a quantile-quantile plot. I don't know whether I've ever seen a Q-Q plot for a discrete distribution before; usually they are shown for continuous distributions. However, you can create a discrete Q-Q plot by following exactly the same steps that I described in my previous article on how to compute a Q-Q plot:

/* 4. Create a Q-Q plot */
/* 4.1 Compute theoretical quantiles */
proc sort data=MyData; by N; run;    /* 1 */
data QQ;
set MyData nobs=nobs;
v = (_N_ - 0.375) / (nobs + 0.25);   /* 2 */
q = quantile("Poisson", v, &Lambda); /* 3 */
run;
 
proc sgplot data=QQ noautolegend;    /* 4 */
scatter x=q y=N;
lineparm x=0 y=0 slope=1; /* SAS 9.3 statement */
xaxis label="Poisson Quantiles" grid; 
yaxis label="Observed Data" grid;
title "Poisson Q-Q Plot of Emails";
run;

I've created a discrete Q-Q plot, but is it useful? A drawback appears to be that the discrete Q-Q plot suffers from overplotting, whereas a continuous Q-Q plot does not. A continuous CDF function is one-to-one so the quantiles of the ranks of the data are unique. In contrast, the CDF function for a discrete distribution is a step function, which leads to duplicated quantiles and overplotting.

For example, in the discrete Poisson Q-Q plot for my email, there are 19 observations, but only 13 points are visible in the Q-Q plot due to overplotting. If I analyze 10 days of my email traffic, I could get 190 observations, but the Q-Q plot might show only a fraction of those points. (In simulated data, there were only 25 unique values in 190 observations drawn from a Poisson(7) distribution.)

The fact that I don't often see discrete Q-Q plots bothered me, so I did a little research. I found a reference to discrete Q-Q plots on p. 126 of Computational Statistics Handbook with MATLAB where it says:

Quantile plots...are primarily used for continuous data. We would like to have a similar technique for graphically comparing the shapes of discrete distributions. Hoaglin and Tukey [1985] developed several plots to accomplish this [including] the Poissonness plot.

That sounds interesting! A future blog post will present an alternative way to visualize the fit of a Poisson model.

tags: Data Analysis, SAS Programming, Statistical Programming
4月 022012
 

Locating missing values is important in statistical data analysis. I've previously written about how to count the number of missing values for each variable in a data set. In Base SAS, I showed how to use the MEANS or FREQ procedures to count missing values. In the SAS/IML language, I demonstrated the COUNTN and COUNTMISS functions that were introduced in SAS/IML 9.22.

But did you know that you can also use the COUNTN and COUNTMISS functions to determine the number of missing values in each observation? This can be useful in an analysis (such as regression) in which you need to delete observations that contain one or more missing value. Or it can be useful to determine observations that contain missing values for a large number of variables.

To begin, let's count missing values in the SAS DATA step by using the CMISS function.

Count missing values in the DATA step

For the SAS DATA step, there is a SAS Knowledge Base article that shows how to use the CMISS function to count missing values in observations. The following example uses an array of variables and the CMISS function to count the numbers of missing values in each observation:

/* define 6 x 3 data set; rows 2, 4, and 5 contain missing values */
data Missing;
input A B C;
datalines;
2 1 1
4 . .
1 3 1
. 6 1
. 1 .
3 4 2
;
 
data A;
set Missing;
array vars(3) A--C; /* contiguous vars */
numMissing = cmiss(of vars[*]);
run;
 
proc print; run;

It is the ARRAY statement that makes the CMISS function convenient. If the variables are contiguous in the data set (as in this example), you can use the double-dash notation (A -- C) to specify the variables. If your variables share a common prefix, you can use the colon wildcard character (:) to specify the variables. For other useful techniques to specify an array of variable names, see the SUGI 30 paper, "Techniques for Effectively Selecting Groups of Variables" by Stuart Pollack.

Count missing values in each row in the SAS/IML language

In one of my first blog posts, I showed how to use the SAS/IML language to remove observations with missing values. However, that post was written prior to the release of SAS/IML 9.22, so now there is an easier way that uses the COUNTMISS function. The COUNTMISS function has an optional second parameter that determines whether the function returns the total number of missing values, the number in each column, or the number in each row. The following statements define a matrix with missing values and count the number of missing values in each row:

proc iml;
use Missing;
read all var _NUM_ into x;
close Missing;
 
rowMiss = countmiss(x, "ROW"); /* returns 0,1,2 or 3 for each row */
print rowmiss;

The output shows that the returned value is a vector with the same number of rows as x. The vector contains the number of missing values in each row of x.

It is now easy to use the LOC function (another function that I've written about often) to find the rows that contain missing values:

/* which rows have one or more missing values? */
jdx = loc(rowMiss>0);
print jdx;

Similarly, it is easy to extract the subset of rows that contain no missing values:

idx = loc(rowMiss=0);
x = x[idx,]; /* delete rows that contain a missing value */

The x data matrix is the listwise deletion of the original data. You can now use the x matrix in statistical data analysis in PROC IML.

tags: Data Analysis, Getting Started, SAS Programming, Statistical Programming
3月 232012
 

Earlier this week I described a common programming pattern in the SAS macro language. The pattern sets up a loop for processing each distinct value of a classification variable. The program uses the PROC SQL SELECT INTO feature to populate SAS macro variables. The effect: you can roll your own BY processing in SAS.

I posted the example and received many useful comments, including news about new features in SAS 9.3 that allow us to improve this pattern. Instead of two SELECT statements (one to gather a count, and a second to populate a range of variables), we can now achieve the same with just one SELECT statement (and thus, just one pass through the data):

/* SAS 9.3 approach */
/* Create macro vars with values and a total count of distinct values */
proc sql noprint;
  select distinct TYPE into :varVal1- from SASHELP.CARS;
  %let varCount = &SQLOBS.;
quit;

With the new SAS 9.3 syntax, we can now specify an open-ended range for macro variable names. SAS will allocate as many macro variables as are needed to store each value. The values will be TRIMMED of whitespace by default, but you can control this by specifying NOTRIM before the FROM keyword.

Since we don't need to know the count ahead of time, we can then safely use &SQLOBS after the SELECT to determine the upper bound of our index.

You can also specify leading zeros in the macro variable range, which can help with sorting the ordinals later. Here's a revised example:

proc sql noprint;
  /* using leading zero */
  select distinct MAKE into :varVal01- from SASHELP.CARS;
  %let varCount = &SQLOBS.;
quit;

The accompanying image shows the resulting macro variables in the SAS Macro Variable Viewer, neatly collated in numerical order. (Remember that if you do add the leading zeros, you may need to use the Zw.d format when building up the macro variable names within a loop.)

tags: macro programming, sas 9.3, SAS programming, sql