data analysis

7月 082011
One of the joys of statistics is that you can often use different methods to estimate the same quantity. Last week I described how to compute a parametric density estimate for univariate data, and use the parameters estimates to compute the area under the probability density function (PDF). This article describes how to compute a kernel density estimate (KDE) and to integrate that curve numerically.

Some practicing statisticians (especially those over a certain age!) learned parametric modeling techniques in school, but later incorporated nonparametric techniques into their arsenal of statistical tools. A very common nonparametric technique is kernel density estimation, in which the PDF at a point, x is estimate by a weighted average of the sample data in a neighborhood of x. Kernel density estimates are widely used, especially when the data appear to be multi-modal.

Kernel Density Estimation in SAS

As before, consider the cholesterol of 2,873 women in the Framingham Heart Study as recorded in the SAS data set SASHELP.Heart, which is distributed with SAS/GRAPH software in SAS 9.2. You can use the UNIVARIATE procedure to construct a kernel density estimate for the cholesterol variable.

The kernel density estimator has a parameter (called the bandwidth) that determines the size of the neighborhood used in the computation to compute the estimate. Small values of the bandwidth result in wavy, wiggly, KDEs, whereas large values result in smooth KDEs. The UNIVARIATE procedure has various methods to select the bandwidth automatically. For this example, I use the Sheather-Jones plug-in (SJPI) bandwidth.

The following statements use the OUTKERNEL= option to output the kernel density estimate at a sequence of regularly space values, and the VSCALE= option to set the vertical scale of the histogram to the density scale:

data heart;
set sashelp.heart; where sex="Female";

ods graphics on;
proc univariate data=heart;
var cholesterol;
histogram / vscale=proportion
          kernel(C=SJPI) outkernel=density;

Click to Enlarge

The nonparametric kernel density estimate exhibits features that are not captured by parametric models. For example, the KDE for these data show several bumps and asymmetries that are not captured by the assumption that the data are normally distributed.

The DENSITY data set contains six variables. The _VALUE_ variable represents the X locations for the density curve and the _DENSITY_ variable represents the Y locations. Therefore the area under the density estimate curve can be computed by numerically integrating those two variables.

Overcoming a Bug in the OUTKERNEL= Option

Unfortunately, there is a bug in the OUTKERNEL= option in SAS 9.2 which results in the DENSITY data set containing two (identical) copies of the KDE curve! (The bug is fixed in SAS 9.3.) Here's how to remove the second copy: find the maximum value of the X variable and delete all observations that come after it. The maximum value of X ought to be the last observation in the DENSITY data. The following statements read the density curve into the SAS/IML language and eliminate the second copy of the curve:

proc iml;
/** read in kernel density estimate **/
use density;
read all var {_value_} into x;
read all var {_density_} into y;
close density;

/** work around bug in OUTKERNEL data set in SAS 9.2 **/
/** remove second copy of data **/
max = max(x);
idx = loc(x=max);
if ncol(idx)>1 then do;
  x = x[1:idx[1]];  y = y[1:idx[1]];

The previous statements remove the second copy of the data if it exists, but do nothing if the second copy does not exist. Therefore it is safe to run the statements in all versions of SAS.

The Area under a Kernel Density Estimate Curve

The kernel density estimate is not a simple function. The points on the curve are available only as a sequence of (X, Y) pairs in the DENSITY data set. Consequently, to integrate the KDE you need to use a numerical integration technique. For this example, you can use the TrapIntegral module from my blog post on the trapezoidal rule of integration.

The following statements define the TrapIntegral module and verify that the KDE curve is correctly defined by integrating the curve over the entire range of X values. The area under a density estimate should be 1. If the area is not close to 1, then something is wrong:

start TrapIntegral(x,y);
   N = nrow(x);
   dx    =   x[2:N] - x[1:N-1];
   meanY = ( y[2:N] + y[1:N-1] )/2;
   return( dx` * meanY );

/** compute the integral over all x **/
Area = TrapIntegral(x,y);
print Area;



Ahh! Success! The area under the KDE is 1 to within an acceptable tolerance. Now to the real task: compute the area under the KDE curve for certain medically interesting values of cholesterol. In particular, the following statements estimate the probability that the cholesterol of a random woman in the population is less than 200 mg/dL (desirable), or greater than 240 mg/dL (high):

/** compute the integral for x < 200 **/
idx = loc(x<200);
lt200 = TrapIntegral(x[idx],y[idx]);

/** compute the integral for x >= 240 **/
idx = loc(x>=240);
ge240 = TrapIntegral(x[idx],y[idx]);
print lt200 ge240;





If you believe that the KDE for these data is a good estimate of the distribution of cholesterol in women in the general population, the computations show that about 29% of women have healthy cholesterol levels, whereas 36% have high levels of cholesterol. These are approximately the same values that were computed by using empirical probability estimates, as shown in my original post on this topic. They differ by about 2% from the parametric model, which assumes normally distributed data.

7月 072011
If you create a scatter plot of highly correlated data, you will see little more than a thin cloud of points. Small-scale relationships in the data might be masked by the correlation. For example, Luke Miller recently posted a scatter plot that compares the body temperature of snails when they extend or withdraw their foot. This scatter plot appears in Miller's research article in the Biological Bulletin.

For Miller's data, the Pearson correlation coefficient is 0.99 and the data fall close to (but above) the identity line. This makes sense, because the temperature of a cold-blooded snail will be primarily dependent on the ambient temperature, and only secondarily related to whether a snail's foot is extended of withdrawn.

The scatter plot is reproduced below. (Ignore the red arrows for now.) As I looked at the graph, I thought that I could discern some additional structure in the data. It looks like the difference between the snails' body temperature is more pronounced when the ambient temperature is high. Consequently, I thought it might be interesting to plot the difference between the two variables.

Changing the Coordinate System

You could create a histogram of the differences between the variables, but that would not show how the difference might depend on the ambient temperature. Instead, you can change coordinate systems. The variable that represents the "difference between the temperatures" increases in the direction shown by the upper-left pointing arrow. The "ambient temperature" variable increases in the direction shown by the arrow that points to the upper right.

The coordinate system shown by the red arrows is a rotation by 45 degrees of the standard coordinate system. The matrix that rotates a coordinate system through a counter-clockwise angle, t is given by R = {cos(t) sin(t) // -sin(t) cos(t)}. For t = π/4, the matrix becomes a scalar multiple of R = {1 1, -1 1}. Therefore, original coordinates (x, y) are sent to the new coordinates (x+y, y-x) by the rotation.

The quantity x+y might be hard to explain to a biological audience, but if you divide by two, you get (x+y)/2 which you can interpret as the average between the snails' temperature when the foot is extended and when it is withdrawn. You can use this value as a proxy for the ambient temperature. The quantity y-x is, of course, the difference between the temperature when the foot is withdrawn and when it is extended.

Plotting the difference against the average gives the following graph, which shows much more detail than the original:

This graph reveals three facts that were not apparent from the original presentation of the data:

  1. The position of the foot makes more of a difference when the ambient temperature is high than when it is low.
  2. There are many observations for which the position of the foot does not matter (the difference is zero).
  3. There are three observations for which the difference is negative. Is this a data entry error? Is there a biological explanation for these anomalous values?
In summary, if you are graphing highly correlated data, you can change coordinates in order to better show the difference between the variables. This is nothing more than "tilting your head" to look at the data in a new way, but the results can be dramatic.

Statistical readers will notice that this is a special case of principal component analysis: the "average direction" is the first principal component; the "difference direction" is the second principal component.

7月 062011
In a previous article, I discussed random jittering as a technique to reduce overplotting in scatter plots. The example used data that are rounded to the nearest unit, although the idea applies equally well to ordinal data in general.

The act of jittering (adding random noise to data) is a statistical irony: statisticians spend most of their day trying "remove" noise from data, but jittering puts noise back in!

Personally, I rarely jitter data. I prefer to visualize the data as they are, but I acknowledge that there are situations in which jittering gives a better "feel" for the data. To help you decide on whether or not to jitter, here are some pros and cons to jittering.

Arguments in Favor of Jittering
  • Jittering reduces overplotting in ordinal data or data that are rounded.
  • Jittering helps you to better visualize the density of the data and the relationship between variables.
  • Jittering can help you to find clusters in the data. (Use a small scale parameter for this case.)

Arguments Against Jittering
  • Jittering adds random components to variables, which means that there is not a unique way to jitter.
  • The size of the random component is not easy to automate, but requires domain-specific knowledge. For example, are data recorded to the nearest unit or the nearest half-unit?
  • The distribution of the random component is not always clear. In the iris data, I jittered by using random variables from a uniform distribution. But suppose a variable records the Richter scale intensity of earthquakes (rounded to the nearest 0.1). Should you use a uniform distribution to jitter these data? Probably not, because the Richter scale is a logarithmic scale, and because earthquakes with lower intensities occur more frequently than earthquakes with higher intensities.
  • If the X and Y variables are related (for example, highly correlated), jittering each variable independently might result in a graph in which the visual impact of the relationship is less apparent. Think about the extreme case where X and Y are exactly linearly related: adding independent noise to each variable results in a graph in which the linear relationship is not as strong.
Can you think of any arguments that I did not mention? Do you think the arguments in favor of jittering outweigh the arguments against it? Are there other visualization techniques that you prefer instead of jittering? Weigh in on this issue by posting a comment.
7月 052011
Jittering. To a statistician, it is more than what happens when you drink too much coffee.

Jittering is the act of adding random noise to data in order to prevent overplotting in statistical graphs. Overplotting can occur when a continuous measurement is rounded to some convenient unit. This has the effect of changing a continuous variable into a discrete ordinal variable. For example, age is measured in years and body weight is measured in pounds or kilograms. If you construct a scatter plot of weight versus age for a sufficiently large sample of people, there might be many people recorded as, say, 29 years and 70 kg, and therefore many markers plotted at the point (29, 70).

To alleviate overplotting, you can add small random noise to the data. The size of the noise is often chosen to be the width of the measurement unit. For example, to the value 70 kg you might add the quantity u, where u is a uniform random variable in the interval [-0.5, 0.5]. You can justify jittering by assuming that the true weight of a 70 kg person is equally likely to be anywhere in the interval [69.5, 70.5].

The context of the data is important when deciding how to jitter. For example, ages are typically rounded down: a 29-year-old person might be celebrating her 29th birthday today, or might be turning 30 tomorrow, but she is still recorded as 29 years old. Therefore you might jitter an age by adding the quantity v, where v is a uniform random variable in the interval [0,1]. (We ignore the statistically significant case of women who remain 29 for many years!)

There are other reasons that markers are overplotted in scatter plots, including having many thousands of data points. Jittering does not prevent this kind of overplotting, but you can use transparency to help alleviate overplotting in large data sets.

Plotting Data That Are Rounded to the Nearest Unit

The SASHELP.Iris data set, which is distributed as part of SAS 9.2, illustrates the problem of overplotting and was used by Chambers, Cleveland, Kleiner, and Tukey (1983, p. 107, Graphical Methods for Data Analysis) to discuss jittering. The following graph shows the width and lengths of the petals for 50 flowers of the species Iris setosa. Also included are a regression line and the 95% confidence band for the mean of the width variable. This graph is created automatically by the REG procedure.

Someone who looks at the graph might ask two questions:

  1. There are 50 observations. Why are only 22 visible in the scatter plot?
  2. Why does the regression line appear so low and flat? A casual observer might expect the "tall" points at (16, 6) and (17, 5) to "pull up" the regression line.
Both of these questions are resolved by pointing out that there are repeated measurements in the data, and therefore the graph suffers from overplotting. Yes, there are 50 observations, but only 22 unique pairs of values. Furthermore, most of the duplicate points have values with PetalWidth = 2 mm. These values are below the regression line, which is why the line looks too low.

These questions are less likely to be asked if you jitter the data.

Jittering Data in SAS

There are many techniques and approaches to jittering data. The simplest is to add random uniform noise to each variable, as shown in the following DATA step:

data iris(drop=s);
set sashelp.iris(where=(Species="Setosa"));
/** add random uniform noise in [-0.5, 0.5] **/
s = 1; /** scale factor **/
jPetalWidth = PetalWidth + s*(ranuni(1)-0.5);
jPetalLength = PetalLength + s*(ranuni(1)-0.5);
label jPetalWidth = "Petal Width (mm)"
      jPetalLength= "Petal Length (mm)";

proc sgplot data=iris;
title "Scatter Plot of Jittered Data";
scatter x=jPetalLength y=jPetalWidth;

I set the "scale" of this problem to 1 because for these data I can justify adding a random variable in the interval [-0.5, 0.5]. However, you can also prevent overplotting by jittering at a smaller scale. (For example, to visualize clusters in the data, you might want to set s to a value such as 0.2.) You can also use separate scales for the X and Y variables, if necessary.

Jittering creates a fictitious set of petal lengths and widths that, when rounded, agree with the original data. The graph shows that all 50 observations are now visible. (There is a slight overplotting of two points near (15,2).) However, the scatter plot no longer shows the recorded data.

Analyze the Recorded Data; Display the Jittered Data

Jittering is primarily a data visualization technique. If you want to display the result of a statistical analysis (such as the regression analysis shown earlier), you should run the analysis on the original data but overlay the analysis on the jittered data. This type of graph is not created automatically by SAS statistical procedures, but you can create it with the SGPLOT procedure: use a statistical procedure to create an output data set, and use the SGPLOT procedure to overlay the results and the jittered data. For example, the following statements compute the regression analysis on the original data, but display the results on the jittered data:

proc sort data=iris; by PetalLength; run;
proc reg data=iris; 
   model PetalWidth = PetalLength; /** orig vars **/
   output out=RegOut P=Pred lclm=LCLM uclm=UCLM;

proc sgplot data=RegOut;
   title "Scatter Plot of Jittered Data";
   title2 "Regression of Non-Jittered Data";
   band  x=PetalLength lower=LCLM upper=UCLM / 
      legendlabel="95% Confidence Limits";
   series x=PetalLength y=Pred / legendlabel="Fit";
   scatter x=jPetalLength y=jPetalWidth;

The regression line is exactly the same as in the first graph, but it no longer looks "too low" because it is displayed on top of jittered data.

One-Dimensional Jittering

In addition to jittering scatter plots, it is common to jitter one-dimensional dot plots. For dot plots, sometimes a "systematic jittering," is used. (See Theus and Urbanek (2009), Interactive Graphics for Data Analysis, pp. 31–32. Also Chambers et al. (1983), p. 20.) In systematic jittering, no random noise is used. Instead, repeated values are offset so that all observations are visible. This approach only works for small data sets, but it can be an effective display for teaching statistics because the dot plot can be overlaid on a box plot. Kleinman and Horton show an example of systematic jittering on their SAS and R blog.

Other Approaches to Jittering

For the iris data, I use 1 for the scale factor because that is the rounding unit. Chambers, Cleveland, Kleiner, and Tukey (1983) present a second option for jittering data: they choose a scale that depends on the extent of the data. If R=max(x)–min(x) is the range of x, they use 4% or 10% of R as the scale factor. In their words, "we want the fraction to be big enough to alleviate the overlap but not so big as to seriously corrupt the data."

Another approach that is sometimes used is to let d be the smallest distance between unique values of x, and set the scale to be d/5.

Michael Friendly has a SAS %JITTER macro for which you can specify the units appropriate for each variable. Is jittering repeated values a good idea? In my next blog post, I discuss some arguments for and against jittering.

7月 012011
The area under a density estimate curve gives information about the probability that an event occurs. The simplest density estimate is a histogram, and last week I described a few ways to compute empirical estimates of probabilities from histograms and from the data themselves, including how to construct the empirical cumulative density function (ECDF).

This article describes how to estimate the density by fitting a parametric curve (for example, a normal curve) to the observed data.

Fitting a Parametric Density Estimate to Data

You can use the UNIVARIATE procedure to fit the parameters in a model to data. The UNIVARIATE procedure supports a large number of well-known distributions, and not only fits the model for you (using, typically, a maximum likelihood optimization), but can also graph the results by using ODS statistical graphics.

I will use the same example that I used in my previous article: the cholesterol of 2,873 women in the Framingham Heart Study as recorded in the SAS data set SASHELP.Heart, which is distributed with SAS/GRAPH software in SAS 9.2. The cholesterol data are approximately normal, so I will fit a normal curve to the data to illustrate the technique. You can use the UNIVARIATE procedure to create three relevant items:

  • a histogram and a normal density estimate to the data
  • a graph of the parametric and empirical CDF (not shown)
  • a table of parameter estimates
The parameter estimates are necessary in order to compute the area under the parametric PDF, so the following statements use the ODS OUTPUT statement to write the ParameterEstimates table to a SAS data set named PE:

data FHeart;
set sashelp.heart; where sex="Female";

ods graphics on;
proc univariate data=FHeart;
   ods output ParameterEstimates=PE;
   var cholesterol;
   histogram / vscale=proportion normal;
   cdfplot / normal;

Click to Enlarge
The UNIVARIATE procedure creates the parameter estimates table, as follows:

The Area under a Parametric Density Estimate Curve

You can use the parameter estimates to estimate the probability that a random woman has cholesterol less than 200 mg/dL (desirable) and greater than 240 mg/dL (high). Simply read in the parameter estimates into the SAS DATA step or the SAS/IML language, and use the CDF function in Base SAS to integrate the fitted curve. The following SAS/IML statements read the parameter estimates and compute two areas: the area under the fitted curve that is less than 200, and the area that is greater than 240:

proc iml;
use PE;
read all var {Estimate};
close PE;

mu = Estimate[1];
sigma = Estimate[2];
lt200 = cdf("Normal", 200, mu, sigma);
ge240 = 1 - cdf("Normal", 240, mu, sigma);
print lt200 ge240;





The value of LT200 represents an estimate of the probability that a random woman in the population has cholesterol less than 200. The value of GE240 is an estimate of the probability that a random woman has cholesterol greater than 240. These estimates assume that the distribution of cholesterol in women in the population is normally distributed, which is probably not the best model, based on the skewness of the histogram and the goodness-of-fit test that PROC UNIVARIATE displays (not shown here).

Notice that to compute GE240, you use the CDF function to compute the area less than 240 and then subtract that value from the total area under the curve, which is 1.

This parametric method estimates that 27% of women have cholesterol less than 200, and 40% have cholesterol greater than 240. Although the normal model might not be suitable for these data, these parametric estimates are not too different from the empirical estimates that were computed in the previous article.

6月 212011
A reader commented to me that he wants to use the HISTOGRAM statement of the SGPLOT procedure to overlay two histograms on a single plot. He could do it, but unfortunately SAS was choosing a large bin width for one of the variables and a small bin width for the other variable. "The figure looks odd because the bin widths vary so much," he wrote, "so I would like to [set] the width."

He asked me whether it is possible to control the bin width of a histogram from the HISTOGRAM statement. The answer is "not in SAS 9.2, but stay tuned for SAS 9.3!"

So what can you do in SAS 9.2? You can control the histogram bin width by using the Graph Template Language (GTL).

Defining a Template That Overlays Two Histograms

To illustrate this approach, I'll overlay histograms of the SEPALLENGTH and PETALLENGTH variables in the SASHELP.IRIS data set. Most of the statements for the following template are explained in the getting started example in the GTL documentation:

proc template;
define statgraph dualhist;
   entrytitle "Petal and Sepal Lengths"; /** optional title **/
   layout overlay / xaxisopts=(label="Length");
      /** first plot: a histogram **/
      histogram PetalLength / name="PetalLength"
      /** second plot: a semi-transparent histogram **/
      histogram SepalLength / name="SepalLength"
          binwidth=5 datatransparency=0.7
      /** optional: add legend by specifying names **/
      discretelegend "PetalLength" "SepalLength";

For this particular template:

  • The LAYOUT OVERLAY statement specifies that the graph consists of two plots, one on top of the other, and a legend.
  • The first HISTOGRAM statement specifies that the first plot is a histogram of the PETALLENGTH variable. The BINWIDTH= option specifies that the histogram should use a bin width of 5.
  • The second HISTOGRAM statement specifies that the second plot is a histogram of the SEPALLENGTH variable. Again, the histogram should use a bin width of 5. Furthermore, the second histogram should have semi-transparent bars that are filled with a different color. Which color? The second color in a pre-defined list of colors.
  • The DISCRETELEGEND statement adds a legend that associates the colors to the variables.
The RUN statement results in the template being compiled and stored in an output template named DUALHIST. The template is stored in the default template folder, but no graph is produced at this time.

In order to create (or "render") the graph, you need to call the SGRENDER procedure. You must provide PROC SGRENDER with the name of the data set and the name of the template, as follows:

proc sgrender data=sashelp.iris template=dualhist;

Click to Enlarge

As shown in the image, the second histogram (pink color) is overlaid on the first. Because the second histogram is semi-transparent, you can see the first histogram underneath. Furthermore, where the two histograms intersect, the color is purple, which is an additive mixture of the blue and pink colors.

You can learn more about the Graph Template Language if you decide to write your own templates. I also recommend the book Statistical Graphics in SAS: An Introduction to the Graph Template Language and the Statistical Graphics Procedures by my colleague, Warren Kuhfeld.

6月 172011
Each Sunday, my local paper has a starburst image on the front page that proclaims "Up to $169 in Coupons!" (The value changes from week to week.) One day I looked at the image and thought, "Does the paper hire someone to count the coupons? Is this claim a good estimate of the actual value of all coupons in the paper? What is the exact value of the coupons in today's paper?"

Furthermore, I wondered about which coupons the image refers to? I assumed that it refers to the contents of special inserts such as SmartSource™ Magazine,™, and P&G BrandSaver. I call these inserts coupon circulars.

There are other interesting statistical questions that you can ask about Sunday coupons, such as "What is the distribution of the values of coupons? Are $0.50 coupons the most prevalent or are larger values the norm?"

Data Collection

To research these questions, I collected data about the coupons in coupon circulars for 12 Sundays in early 2011. I recorded three kinds of coupons:

  1. General coupons that can be redeemed at any store and that have a certain value (such as $0.50 or $1.00) when you buy one or more specified items.
  2. Coupons that reward you with a free item when you buy a certain number of like items at the regular price. These "buy one, get one" (BOGO) coupons do not have a fixed value, but depend on the store price of the item. The coupon is valid up to some specified maximum price.
  3. The coupon circulars also include a small number of store-specific coupons such as for Red Lobster, Boston Market, or Baskin Robbins. Because these coupons are part of the coupon circulars, I counted them as well.
I did not count coupons that were part of a store-specific ad, such as for Target or Walgreens.

You can download the data and a SAS program that analyzes the data.

The Distribution of Coupon Values

As I recorded the values of coupons each week, it soon became clear that $1.00 coupons appear most frequently. The following graph (click to enlarge) shows the distribution of denominations of coupons in my Sunday newspaper:

A few statistics are apparent:

  • $1.00 coupons are the most prevalent, followed by $0.50 and $2.00 coupons.
  • There are 24 unique denominations of coupons in the data, but the six most common denominations account for more than 80% of the coupons.
  • Denominations increase by $0.05 between $0.25 and $0.75, although it is relatively rare to find a $0.70 coupon! Between $0.75 and $1.50, denominations increase by $0.25. Between $1.50 and $4.00, denominations increase by $0.50. Higher denominations are a multiple of $1.00.

Claim versus Reality: How Much Can You Really Save?

The following table sums up the total values of coupons in three categories: general coupons, BOGO coupons, and the store coupons that are intermingled with the other two categories.

A few facts are apparent:

  • Week 9 is an outlier in that the claimed value ($208) is much, much, less than the actual value of the coupons. Did the paper make a typographical error? Did they intend to print "Up to $280 in coupons" or "Up to $308 in coupons"? Another alternative: there were two coupon circulars that week, one with $204 worth of coupons and the other with $115. Did someone fail to tabulate the coupons in the second circular?
  • Week 8 is the only week in which the total value of the coupons did not meet or exceed the advertised claim.
  • For two weeks (Weeks 3, and 10), the "claimed value" is about the same as the values of coupons in the General category.
  • For most weeks (Weeks 1, 2, 4, 6, 7, and 12), the "claimed value" is exceeded by the sum of the values of coupons in the General and BOGO categories. That is, you do not need to redeem store-specific coupons in order to meet or exceed the claimed value.
The following graph shows the actual maximum possible savings plotted again the claimed values. The line on the graph is the identity line. For eleven out of twelve weeks, the possible savings exceeded the claimed value. For two weeks, the actual values of the coupons far exceed the claimed value.

Predicting the Actual Coupons Values from the Advertised Claim

The previous graph indicates a linear relationship between the actual and the claimed values of the coupons. It is interesting to try to use linear regression to predict the total value of the coupons based on the claimed (starburst) values. However, the graph also shows two observations that have high-leverage values for the regression (Weeks 1 and 10) and one value that is an outlier (Week 9). Consequently, you might decide to use the ROBUSTREG procedure to fit a robust regression line to the data. (This is carried out in the downloadable program for least trimmed squares (LTS) regression.)

The LTS regression model (shown in the following graph) states that the expected value of the Sunday coupons is about $11 more than the advertised claim: predicted = $10.87 + claim. In other words, if the newspaper advertises "up to $181 in coupons," the model predicts $192 in coupons.

In summary:
  • The value of coupons in my local Sunday newspaper over a 12-week period varied from $43 in one week to over $400 another week.
  • The most frequent coupon denomination is $1.00.
  • Although the newspaper claims a specific value for the coupons each Sunday, it is not clear how that number is obtained.
  • The claimed value usually differs from the actual value by less than $20, but the claimed value twice underestimated the actual value by $70 or more.
  • According to a robust regression model, the expected value of the actual coupon value is $11 more than the claimed value.
I'm sure there is much more that can be said about these data. Feel free to continue the analysis and let me know what you find!
6月 032011
In a previous blog post, I presented a short SAS/IML function module that implements the trapezoidal rule. The trapezoidal rule is a numerical integration scheme that gives the integral of a piecewise linear function that passes through a given set of points.

This article demonstrates an application of using the trapezoidal rule: computing the area under a receiver operator characteristic (ROC) curve.

ROC Curves
Many statisticians and SAS programmers who are familiar with logistic regression have seen receiver operator characteristic (ROC) curves. The ROC curve indicates how well you can discriminate between two groups by using a continuous variable. If the area under an ROC curves is close to 1, the model discriminates well; if the area is close to 0.5, the model is not any better than randomly guessing.

Let Y be the binary response variable that indicates the two groups. Let X be a continuous explanatory variable. In medical applications, for example, Y might indicate the presence of a disease and X might indicate the level of a certain chemical or hormone. For this blog post, I will use a more whimsical example. Let X indicate the number of shoes that a person has, and let Y indicate whether the person is female.

The following data indicates the results of a nonscientific survey of 15 friends and family members. Each person was asked to state approximately how many pairs of shoes (5, 10, ..., 30+) he or she owns. For each category, the data show the number of females in that category and the total number of people in that category:

data shoes;
input Shoes Females N;
 5 0 1
10 1 3
15 1 2
20 3 4
25 3 3
30 2 2

An easy way to generate an ROC curve for these data is to use the LOGISTIC procedure. You can get the actual values on the ROC curve by using the OUTROC= option on the MODEL statement:

ods graphics on;
proc logistic data=shoes plots(only)=roc;
   ods select ROCcurve;
   model Females / N = shoes / outroc=roc;

Notice that the graph has a subtitle that indicates the area under the ROC curve. If you want to check the result yourself, the points on the ROC curve are contained in the ROC data set:

proc print ;
   var _1mSpec_ _Sensit_;






















I used these points in my previous blog post. You can refer to that post to verify that, indeed, the area under the ROC curve is 0.88, as computed by the SAS/IML implementation of the trapezoidal rule.

By the way, the area under the ROC curve is closely related to another statistic: the Gini coefficient. Murphy Choi writes about computing the Gini coefficient in SAS in a recent issue of VIEWS News, a newsletter published by members of the international SAS programming community. The Gini coefficient is related to the area under the ROC curve (AUC) by the formula G = 2 * AUC – 1, so you can extend the program in my previous post to compute the Gini coefficient by using the following SAS/IML statement:

Gini = 2 * TrapIntegral(x,y) - 1;

How well does the number of shoes predicts gender in my small sample? The answer is "moderately well." The logistic model for these data predicts that a person who owns fewer than 15 pairs of shoes is more likely to be male than female. A person with more than 20 pairs of shoes is likely to be female. The area under the ROC curve (0.88) is fairly close to 1, which indicates that the model discriminates between males and females fairly well.

The logistic model is summarized by the following plot (created automatically by PROC LOGISTIC), which shows the predicted probability of being female, given the number of pairs of shoes owned. Notice the wide confidence limits that result from the small sample size.

5月 202011
Many people know that the SGPLOT procedure in SAS 9.2 can create a large number of interesting graphs. Some people also know how to create a panel of graphs (all of the same type) by using the SGPANEL procedure. But did you know that you can also create a panel of graphs of different types in SAS by writing a template that describes how to layout each plot within the panel? Even better, there is a gallery of pre-written templates so that for many situations you don't have to write (or even understand) the Graph Template Language (GTL). You can simply copy a pre-written template!

This blog post shows how to create a scatter plot with marginal histograms in SAS by copying a pre-written template.

Galleries of Statistical Graphics

When I want to create a plot that is somewhat complicated, the first thing I do is to look in the SAS/GRAPH Graphics Gallery. In particular, I use the galleries for the ODS Statistical Graphics (SG) procedures:

The graph that I want to produce is in the PROC SGRENDER gallery (Sample 35172), which links to a SAS Knowledge Base article on how to use the Graph Template Language (GTL) to produce a distribution plot.

How to Create a Scatter Plot with Marginal Histograms

Using the SAS Knowledge Base article as a guide, the following steps create a scatter plot with marginal histograms (download the program):

  1. Click on the Full Code tab to display the SAS program that generates the graph.
  2. Copy the first call to the TEMPLATE procedure into your SAS session:

    proc template;
      define statgraph scatterhist;

  3. Modify the example code, if necessary, to fit your needs. For example, I made the following changes:
  4. Run the PROC TEMPLATE code to create the template.
  5. Copy the PROC SGRENDER code at the end of the program and modify it to run on your data. For example, the following statements call PROC SGRENDER to produce a scatter plot with marginal histograms on the Height and Weight variables in teh SasHelp.Class data set:

    /** create panel of plots using the ScatterHist template **/
    ods graphics;
    proc sgrender data=SasHelp.Class template=scatterhist;  
      dynamic YVAR="Weight" XVAR="Height" 
              TITLE="Height-Weight Relationship";

The SGRENDER procedure uses the ScatterHist template to layout the scatter plot and histograms, as shown below (click to enlarge):

In this example, I modified the GTL template in a minor way, but I also could have used the template as it is. You can learn more about the Graph Template Language if you decide to modify templates or to write your own templates. I also recommend the book Statistical Graphics in SAS: An Introduction to the Graph Template Language and the Statistical Graphics Procedures by my colleague, Warren Kuhfeld.

5月 162011
A fundamental operation in data analysis is finding data that satisfy some criterion. How many people are older than 85? What are the phone numbers of the voters who are registered Democrats? These questions are examples of locating data with certain properties or characteristics.

The SAS DATA step has a variety of statements (such as the WHERE and IF statements) that enable statistical programmers to locate observations and subset data. The SAS/IML language has similar language features, but because data is often stored in SAS/IML matrices, the SAS/IML language also has a function that is not available in the DATA step: the LOC function.

The LOC Function

If your data are in a SAS/IML matrix, x, the LOC Function enables you to find elements of x for which a given criterion is true. The LOC function returns the LOCations (indices) of the relevant elements. (In the R language, the which function implements similar functionality.) For example, the following statements define a numeric vector, x, and use the LOC function to find the indices for which the numbers are greater than 3:

proc iml;
x = {1 4 3 5 2 7 3 5};
/** which elements are > 3? **/
k = loc( x>3 ); 
print k;






Notice the following:

  • The argument to the LOC function is an expression that resolves to a vector of 0s and 1s. (Some languages call this a logical vector.) In practice, the argument to the LOC function is almost always an expression.
  • The result of the LOC function is always a row vector. The number of columns is the number of elements of x that satisfy the given criterion.
  • The LOC function returns indices of x, not values of x. To obtain the values, use x[k]. (Indices and subscripts are related; for vectors, they are the same.)

How Many Elements Satisfy the Criterion?

You can exploit the fact that the LOC function outputs a row vector. To count the number of elements that satisfy the criterion, simply use the NCOL function, as follows:

n = ncol(k); /** how many? **/
print n;



What If No Elements Satisfy the Criterion?

The expression ncol(idx) always tells you the number of elements that satisfy the criterion, even when no elements satisfy the criterion. The following statement asks for the elements larger than 100 and handles the possible results:

j = loc( x>100 );
if ncol(j) > 0 then do;
   print "At least one element found";
   /** handle this case **/
else do;
   print "No elements found";
   /** handle alternate case **/

In the preceding example, x does not contain any elements that are greater than 100. Therefore the matrix j is an empty matrix, which means that j has zero rows and zero columns. It is a good programming practice to check the results of the LOC function to see if any elements satisfied the criterion. For more details, see Chapter 3 of Statistical Programming with SAS/IML Software.

Using the LOC Function to Subset a Vector

The LOC function finds the indices of elements that satisfy some criterion. These indices can be used to subset the data. For example, the following statements read information about vehicles in the SasHelp.Cars data set. The READ statement creates a vector that contains the make of each vehicle ("Acura," "Audi," "BMW,"...) and creates a second vector that contains the engine size (in liters) for each vehicle. The LOC function is used to find the indices for the vehicles made by Acura. These indices are then used to subset the EngineSize vector in order to produce a vector, s, that contains only the engine volumes for the Acura vehicles:

read all var {Make EngineSize};

/** find observations that 
    satisfy a criterion **/
idx = loc( Make="Acura" );
s = EngineSize[idx];
print s[label="EngineSize (Acura)"];

EngineSize (Acura)








LOC = Efficient SAS Programming

I have called the LOC function the most useful function that most DATA step programmers have never heard of. Despite its relative obscurity, it is essential that SAS/IML programmers master the LOC function. By using the LOC function, you can write efficient vectorized programs, rather than inefficient programs that loop over data.