data analysis

6月 172011
Each Sunday, my local paper has a starburst image on the front page that proclaims "Up to $169 in Coupons!" (The value changes from week to week.) One day I looked at the image and thought, "Does the paper hire someone to count the coupons? Is this claim a good estimate of the actual value of all coupons in the paper? What is the exact value of the coupons in today's paper?"

Furthermore, I wondered about which coupons the image refers to? I assumed that it refers to the contents of special inserts such as SmartSource™ Magazine,™, and P&G BrandSaver. I call these inserts coupon circulars.

There are other interesting statistical questions that you can ask about Sunday coupons, such as "What is the distribution of the values of coupons? Are $0.50 coupons the most prevalent or are larger values the norm?"

Data Collection

To research these questions, I collected data about the coupons in coupon circulars for 12 Sundays in early 2011. I recorded three kinds of coupons:

  1. General coupons that can be redeemed at any store and that have a certain value (such as $0.50 or $1.00) when you buy one or more specified items.
  2. Coupons that reward you with a free item when you buy a certain number of like items at the regular price. These "buy one, get one" (BOGO) coupons do not have a fixed value, but depend on the store price of the item. The coupon is valid up to some specified maximum price.
  3. The coupon circulars also include a small number of store-specific coupons such as for Red Lobster, Boston Market, or Baskin Robbins. Because these coupons are part of the coupon circulars, I counted them as well.
I did not count coupons that were part of a store-specific ad, such as for Target or Walgreens.

You can download the data and a SAS program that analyzes the data.

The Distribution of Coupon Values

As I recorded the values of coupons each week, it soon became clear that $1.00 coupons appear most frequently. The following graph (click to enlarge) shows the distribution of denominations of coupons in my Sunday newspaper:

A few statistics are apparent:

  • $1.00 coupons are the most prevalent, followed by $0.50 and $2.00 coupons.
  • There are 24 unique denominations of coupons in the data, but the six most common denominations account for more than 80% of the coupons.
  • Denominations increase by $0.05 between $0.25 and $0.75, although it is relatively rare to find a $0.70 coupon! Between $0.75 and $1.50, denominations increase by $0.25. Between $1.50 and $4.00, denominations increase by $0.50. Higher denominations are a multiple of $1.00.

Claim versus Reality: How Much Can You Really Save?

The following table sums up the total values of coupons in three categories: general coupons, BOGO coupons, and the store coupons that are intermingled with the other two categories.

A few facts are apparent:

  • Week 9 is an outlier in that the claimed value ($208) is much, much, less than the actual value of the coupons. Did the paper make a typographical error? Did they intend to print "Up to $280 in coupons" or "Up to $308 in coupons"? Another alternative: there were two coupon circulars that week, one with $204 worth of coupons and the other with $115. Did someone fail to tabulate the coupons in the second circular?
  • Week 8 is the only week in which the total value of the coupons did not meet or exceed the advertised claim.
  • For two weeks (Weeks 3, and 10), the "claimed value" is about the same as the values of coupons in the General category.
  • For most weeks (Weeks 1, 2, 4, 6, 7, and 12), the "claimed value" is exceeded by the sum of the values of coupons in the General and BOGO categories. That is, you do not need to redeem store-specific coupons in order to meet or exceed the claimed value.
The following graph shows the actual maximum possible savings plotted again the claimed values. The line on the graph is the identity line. For eleven out of twelve weeks, the possible savings exceeded the claimed value. For two weeks, the actual values of the coupons far exceed the claimed value.

Predicting the Actual Coupons Values from the Advertised Claim

The previous graph indicates a linear relationship between the actual and the claimed values of the coupons. It is interesting to try to use linear regression to predict the total value of the coupons based on the claimed (starburst) values. However, the graph also shows two observations that have high-leverage values for the regression (Weeks 1 and 10) and one value that is an outlier (Week 9). Consequently, you might decide to use the ROBUSTREG procedure to fit a robust regression line to the data. (This is carried out in the downloadable program for least trimmed squares (LTS) regression.)

The LTS regression model (shown in the following graph) states that the expected value of the Sunday coupons is about $11 more than the advertised claim: predicted = $10.87 + claim. In other words, if the newspaper advertises "up to $181 in coupons," the model predicts $192 in coupons.

In summary:
  • The value of coupons in my local Sunday newspaper over a 12-week period varied from $43 in one week to over $400 another week.
  • The most frequent coupon denomination is $1.00.
  • Although the newspaper claims a specific value for the coupons each Sunday, it is not clear how that number is obtained.
  • The claimed value usually differs from the actual value by less than $20, but the claimed value twice underestimated the actual value by $70 or more.
  • According to a robust regression model, the expected value of the actual coupon value is $11 more than the claimed value.
I'm sure there is much more that can be said about these data. Feel free to continue the analysis and let me know what you find!
6月 032011
In a previous blog post, I presented a short SAS/IML function module that implements the trapezoidal rule. The trapezoidal rule is a numerical integration scheme that gives the integral of a piecewise linear function that passes through a given set of points.

This article demonstrates an application of using the trapezoidal rule: computing the area under a receiver operator characteristic (ROC) curve.

ROC Curves
Many statisticians and SAS programmers who are familiar with logistic regression have seen receiver operator characteristic (ROC) curves. The ROC curve indicates how well you can discriminate between two groups by using a continuous variable. If the area under an ROC curves is close to 1, the model discriminates well; if the area is close to 0.5, the model is not any better than randomly guessing.

Let Y be the binary response variable that indicates the two groups. Let X be a continuous explanatory variable. In medical applications, for example, Y might indicate the presence of a disease and X might indicate the level of a certain chemical or hormone. For this blog post, I will use a more whimsical example. Let X indicate the number of shoes that a person has, and let Y indicate whether the person is female.

The following data indicates the results of a nonscientific survey of 15 friends and family members. Each person was asked to state approximately how many pairs of shoes (5, 10, ..., 30+) he or she owns. For each category, the data show the number of females in that category and the total number of people in that category:

data shoes;
input Shoes Females N;
 5 0 1
10 1 3
15 1 2
20 3 4
25 3 3
30 2 2

An easy way to generate an ROC curve for these data is to use the LOGISTIC procedure. You can get the actual values on the ROC curve by using the OUTROC= option on the MODEL statement:

ods graphics on;
proc logistic data=shoes plots(only)=roc;
   ods select ROCcurve;
   model Females / N = shoes / outroc=roc;

Notice that the graph has a subtitle that indicates the area under the ROC curve. If you want to check the result yourself, the points on the ROC curve are contained in the ROC data set:

proc print ;
   var _1mSpec_ _Sensit_;






















I used these points in my previous blog post. You can refer to that post to verify that, indeed, the area under the ROC curve is 0.88, as computed by the SAS/IML implementation of the trapezoidal rule.

By the way, the area under the ROC curve is closely related to another statistic: the Gini coefficient. Murphy Choi writes about computing the Gini coefficient in SAS in a recent issue of VIEWS News, a newsletter published by members of the international SAS programming community. The Gini coefficient is related to the area under the ROC curve (AUC) by the formula G = 2 * AUC – 1, so you can extend the program in my previous post to compute the Gini coefficient by using the following SAS/IML statement:

Gini = 2 * TrapIntegral(x,y) - 1;

How well does the number of shoes predicts gender in my small sample? The answer is "moderately well." The logistic model for these data predicts that a person who owns fewer than 15 pairs of shoes is more likely to be male than female. A person with more than 20 pairs of shoes is likely to be female. The area under the ROC curve (0.88) is fairly close to 1, which indicates that the model discriminates between males and females fairly well.

The logistic model is summarized by the following plot (created automatically by PROC LOGISTIC), which shows the predicted probability of being female, given the number of pairs of shoes owned. Notice the wide confidence limits that result from the small sample size.

5月 202011
Many people know that the SGPLOT procedure in SAS 9.2 can create a large number of interesting graphs. Some people also know how to create a panel of graphs (all of the same type) by using the SGPANEL procedure. But did you know that you can also create a panel of graphs of different types in SAS by writing a template that describes how to layout each plot within the panel? Even better, there is a gallery of pre-written templates so that for many situations you don't have to write (or even understand) the Graph Template Language (GTL). You can simply copy a pre-written template!

This blog post shows how to create a scatter plot with marginal histograms in SAS by copying a pre-written template.

Galleries of Statistical Graphics

When I want to create a plot that is somewhat complicated, the first thing I do is to look in the SAS/GRAPH Graphics Gallery. In particular, I use the galleries for the ODS Statistical Graphics (SG) procedures:

The graph that I want to produce is in the PROC SGRENDER gallery (Sample 35172), which links to a SAS Knowledge Base article on how to use the Graph Template Language (GTL) to produce a distribution plot.

How to Create a Scatter Plot with Marginal Histograms

Using the SAS Knowledge Base article as a guide, the following steps create a scatter plot with marginal histograms (download the program):

  1. Click on the Full Code tab to display the SAS program that generates the graph.
  2. Copy the first call to the TEMPLATE procedure into your SAS session:

    proc template;
      define statgraph scatterhist;

  3. Modify the example code, if necessary, to fit your needs. For example, I made the following changes:
  4. Run the PROC TEMPLATE code to create the template.
  5. Copy the PROC SGRENDER code at the end of the program and modify it to run on your data. For example, the following statements call PROC SGRENDER to produce a scatter plot with marginal histograms on the Height and Weight variables in teh SasHelp.Class data set:

    /** create panel of plots using the ScatterHist template **/
    ods graphics;
    proc sgrender data=SasHelp.Class template=scatterhist;  
      dynamic YVAR="Weight" XVAR="Height" 
              TITLE="Height-Weight Relationship";

The SGRENDER procedure uses the ScatterHist template to layout the scatter plot and histograms, as shown below (click to enlarge):

In this example, I modified the GTL template in a minor way, but I also could have used the template as it is. You can learn more about the Graph Template Language if you decide to modify templates or to write your own templates. I also recommend the book Statistical Graphics in SAS: An Introduction to the Graph Template Language and the Statistical Graphics Procedures by my colleague, Warren Kuhfeld.

5月 162011
A fundamental operation in data analysis is finding data that satisfy some criterion. How many people are older than 85? What are the phone numbers of the voters who are registered Democrats? These questions are examples of locating data with certain properties or characteristics.

The SAS DATA step has a variety of statements (such as the WHERE and IF statements) that enable statistical programmers to locate observations and subset data. The SAS/IML language has similar language features, but because data is often stored in SAS/IML matrices, the SAS/IML language also has a function that is not available in the DATA step: the LOC function.

The LOC Function

If your data are in a SAS/IML matrix, x, the LOC Function enables you to find elements of x for which a given criterion is true. The LOC function returns the LOCations (indices) of the relevant elements. (In the R language, the which function implements similar functionality.) For example, the following statements define a numeric vector, x, and use the LOC function to find the indices for which the numbers are greater than 3:

proc iml;
x = {1 4 3 5 2 7 3 5};
/** which elements are > 3? **/
k = loc( x>3 ); 
print k;






Notice the following:

  • The argument to the LOC function is an expression that resolves to a vector of 0s and 1s. (Some languages call this a logical vector.) In practice, the argument to the LOC function is almost always an expression.
  • The result of the LOC function is always a row vector. The number of columns is the number of elements of x that satisfy the given criterion.
  • The LOC function returns indices of x, not values of x. To obtain the values, use x[k]. (Indices and subscripts are related; for vectors, they are the same.)

How Many Elements Satisfy the Criterion?

You can exploit the fact that the LOC function outputs a row vector. To count the number of elements that satisfy the criterion, simply use the NCOL function, as follows:

n = ncol(k); /** how many? **/
print n;



What If No Elements Satisfy the Criterion?

The expression ncol(idx) always tells you the number of elements that satisfy the criterion, even when no elements satisfy the criterion. The following statement asks for the elements larger than 100 and handles the possible results:

j = loc( x>100 );
if ncol(j) > 0 then do;
   print "At least one element found";
   /** handle this case **/
else do;
   print "No elements found";
   /** handle alternate case **/

In the preceding example, x does not contain any elements that are greater than 100. Therefore the matrix j is an empty matrix, which means that j has zero rows and zero columns. It is a good programming practice to check the results of the LOC function to see if any elements satisfied the criterion. For more details, see Chapter 3 of Statistical Programming with SAS/IML Software.

Using the LOC Function to Subset a Vector

The LOC function finds the indices of elements that satisfy some criterion. These indices can be used to subset the data. For example, the following statements read information about vehicles in the SasHelp.Cars data set. The READ statement creates a vector that contains the make of each vehicle ("Acura," "Audi," "BMW,"...) and creates a second vector that contains the engine size (in liters) for each vehicle. The LOC function is used to find the indices for the vehicles made by Acura. These indices are then used to subset the EngineSize vector in order to produce a vector, s, that contains only the engine volumes for the Acura vehicles:

read all var {Make EngineSize};

/** find observations that 
    satisfy a criterion **/
idx = loc( Make="Acura" );
s = EngineSize[idx];
print s[label="EngineSize (Acura)"];

EngineSize (Acura)








LOC = Efficient SAS Programming

I have called the LOC function the most useful function that most DATA step programmers have never heard of. Despite its relative obscurity, it is essential that SAS/IML programmers master the LOC function. By using the LOC function, you can write efficient vectorized programs, rather than inefficient programs that loop over data.

5月 132011
For years I've been making presentations about SAS/IML software at conferences. Since 2008, I've always mentioned to SAS customers that they can call R from within SAS/IML software. (This feature was introduced in SAS/IML Studio 3.2 and was added to the IML procedure in SAS/IML 9.22.) I also included a chapter on calling R in my book, Statistical Programming with SAS/IML Software.

However, I've never blogged about it until today. Why? Frankly, I don't usually have a reason to call R from SAS/IML software. Both R and SAS/IML are high-level languages with a rich run-time library of functions. They both enable you to extend the language by writing and sharing user-defined functions. They both enable you to use matrix computations to compactly represent and solve statistical problems and to analyze data. I use SAS/IML software for all of my day-to-day computational needs.

However, sometimes I hear about a technique that someone has implemented in R and I want to try that technique too. If I'm going to use the technique many times, I'll consider implementing it myself in the SAS/IML language. However, if I'm only going to use it once (or I'm not sure how often I'll use it), I'll save myself time and call R.

Earlier this week, I showed some plots of airline routes that a colleague created in SAS, which are based on similar plots (created in R) that appeared on the Flowing Data blog. In my blog post I said:

I don't have the time right now to implement a great circle algorithm in SAS/IML. ... I could also generate the great arcs by using the same R package that Flowing Data uses.

Basically, I was being lazy. However, after Flowing Data posted the R code to plot great arcs, I no longer had an excuse not to use great arcs. This is a situation in which calling R will save me time: I want to compute arcs of great circles, I don't know of a comparable function already written in SAS, and I'll rarely use this functionality in the future. So, I wrote a short SAS/IML program that calls an R function to compute the arcs.

Read on if you want to learn about how to call R from SAS. If you just want to see the final result, here is it (click to enlarge):

How to Call R from a SAS/IML Program

To call R, install R 2.11 or earlier1 on the same computer that runs SAS and install the R package that contains the function you want. The function that I want to call is the gcIntermediate function in the geosphere package, so I installed the package as described on the Flowing Data blog.

In general, a SAS/IML program that calls R contains four steps:

  1. Transfer data from a SAS data set or a SAS/IML matrix into a similar data structure in R.
  2. Call R by using the SUBMIT statement with the R option.
  3. Transfer results from R into a SAS data set or a SAS/IML matrix.
  4. Use the results in SAS.
I'll discuss each step in turn.

Step 1: Transfer Data from SAS to R

For ease of presentation, assume that there is a data set called DeltaFlights that contains airline routes for Delta airlines. (The general case of multiple airlines is handled similarly.) The data contains the following variables:

  • Origin_Long and Origin_Lat contain the longitude and latitude of the origin airport.
  • Dest_Long and Dest_Lat contain the longitude and latitude of the destination airport.
I can use the ExportDataSetToR subroutine to create an R data frame from a SAS data set, or I can use the ExportMatrixToR subroutine to transfer data from a SAS/IML matrix into an R matrix.

Because I like to work in the SAS/IML environment, I'll choose the second option. The following statements read the data into SAS/IML vectors or matrices:

/** requires SAS/IML 9.22 or SAS/IML Studio 3.2 **/ 
libname MyLib "C:\Users\...\MyData";
proc iml;
use MyLib.DeltaFlights; /** 376 routes **/
  read all var {origin_long origin_lat} into mOrig;
  read all var {dest_long dest_lat} into mDest;
close MyLib.DeltaFlights;

The matrices mOrig and mDest contain 376 rows and two columns. The following statements transfer data from the two matrices into R matrices of the same dimensions:

/** copy SAS/IML matrices to R **/
run ExportMatrixToR(mOrig, "orig");
run ExportMatrixToR(mDest, "dest");

The result is two R matrices named orig and dest. Each row of orig and each row of dest contains the longitude and latitude of an airport. (The statements also start R if it is not already running.)

Step 2: Call R to Generate the Great Arcs

The call to R is straightforward:

/** get points on great arc between airports **/
submit / R;
dist <- gcIntermediate(orig, dest)

The resulting R object, named dist, is a list of 376 matrices. Each matrix has 50 rows and two columns. The ith matrix represents the longitude and latitude of 50 points along a great arc that connects the ith row of orig and the ith row of dest.

Step 3: Transfer the Results

You can use the ImportMatrixFromR subroutine to copy the data from dist into a SAS/IML matrix named distance:

/** get arcs back from R **/
run ImportMatrixFromR(distance, "dist");

The distance matrix has 50 rows and 2 x 376 columns. The first two columns correspond to dist[[1]], the next two to dist[[2]], and so forth.

Step 4: Use the Results in SAS

These results are intended to be overlaid on a map. To visualize the flight paths in SAS/IML Studio, I can use the following IMLPlus statements, which are similar to the mapping examples in my 2008 SAS Global Forum Paper:

/** create map in SAS/IML Studio **/
a = shape(distance[1,], 0, 2); /** airports **/
declare ScatterPlot p;
p = ScatterPlot.Create("flights", a[,1], a[,2]);
do i = 1 to ncol(distance)/2;
   p.DrawLine(distance[,2*i-1], distance[,2*i]);

/** draw map in background (see SAS Global Forum
    Paper) then zoom in on US **/

However, this particular computation was for my colleague, Robert Allison, who uses SAS/GRAPH software to visualize the flight paths. Therefore, I wrote the arcs to a SAS data set and let him use PROC GMAP with the ANNOTATE= option to create the image seen earlier in this article.

1. As I've said elsewhere, R changed its directory structure between versions 2.11 and 2.12. Consequently, SAS 9.22 (which shipped before R 2.12 was released) looks for certain DLLs in directories that no longer exist. The workaround is to use R 2.11 with SAS 9.22. This workaround is not required for SAS 9.3.
5月 112011
When Charlie H. posted an interesting article titled Top 10 most powerful functions for PROC SQL, there was one item on his list that was unfamiliar: the COALESCE function.

Ever since I posted my first response, "SAS/IML Operations Compared with PROC SQL: Part I," I had been dreading writing an article on the COALESCE function. Why? Because SAS/IML does not have a COALESCE function, so I assumed I'd have to write a SAS/IML module that mimics the functionality. Although writing a module is easy, I wasn't sure what the COALESCE function did, so I would have to do research.

However, a recent Tip of the Day by Mike Zdeb at reminded me of a programming tip that I always recommend to others: you can call Base SAS functions from the SAS/IML language. In particular, Zdeb showed me that the DATA step supports the COALESCE function. My problem was solved!

The COALESCE Function

The COALSCE function returns the first nonmissing value in a list of variables. Why this is useful is best illustrated with an example. Suppose that you have a data set that contains information about contact information for students. Each row represents a student. Variables include a column for the mother's cell-phone number, the father's cell-phone number, the phone number at home, and various work numbers. Some of these phone numbers might be missing. The school administrator wants you to write a program that creates a new column that contains one of the phone numbers, based on the following algorithm:

  1. If the mother's cell-phone number is not missing use it.
  2. Otherwise, if the father's cell-phone number is not missing use it.
  3. Otherwise, if the home-phone number not missing use it.
  4. ...(and so forth with the other phone numbers)...
The COALESCE function enables you to process these data with a single function call instead of writing multiple IF-THEN/ELSE statements.

The following program provides a small example of the COALESCE function. The data set B has 10 observations and two variables. The PROC SQL statements uses the COALESCE function to report the value of x1 if that value is not missing. Otherwise, it reports the value of x2 if x2 is nonmissing. If both values are missing, then the COALESCE function returns a missing value.

data B;
input x1 x2;
 . 11  
 .  . 
 3 13
 4  . 
 . 15
 6 16

proc sql; 
title 'Coalesce() -- Combine column values';
select Monotonic() as obs, 
     coalesce(x1, x2) 
from B;














Because the COALESCE function is part of Base SAS, you can call it from PROC IML:

proc iml;
use B;
read all var {x1 x2};
close B;
y = coalesce(x1, x2);

The vector y contains the same values computed by PROC SQL. Notice that it is not necessary to write a loop: the COALESCE function automatically applies to each row of the argument vectors. (If you have character data, use the COALESCEC function instead.)

The lesson to learn is this: although about 300 functions and subroutines appear in the SAS/IML User's Guide, you can also use the 500 or so functions that appear in Base SAS software. And use all the formats, informats, and global statements. This is one reason that the SAS/IML language is a natural progression for a SAS programmer who wants to learn matrix programming: a SAS programmer can leverage existing DATA step knowledge when writing an analysis in PROC IML.

5月 102011
Last week the Flowing Data blog published an excellent visualization of the flight patterns of major US airlines.

On Friday, I sent the link to Robert Allison, my partner in the 2009 ASA Data Expo, which explored airline data. Robert had written a SAS program for the Expo that plots line segments for the flight routes between US cities (see Figure 8 of our joint work), and on Monday he modified his SAS program to mimic to the Flowing Data graphs.

Here's one example of his efforts (click to enlarge):

He used line segments to display the flight patterns because I don't have the time right now to implement a great circle algorithm in SAS/IML. Sorry, Robert! (Because I can call R packages from SAS/IML software, I could also generate the great arcs by using the same R package that Flowing Data uses.)

Visit Robert's visualization gallery to see the other plots.

High-flying graphics, indeed!

5月 042011
More than a month ago I wrote a first article in response to an interesting article by Charlie H. titled Top 10 most powerful functions for PROC SQL. In that article I described SAS/IML equivalents to the MONOTONIC, COUNT, N, FREQ, and NMISS Functions in PROC SQL.

In this article, I discuss the UNIQUE functions in PROC SQL and in PROC IML.

The UNIQUE function in SQL finds the unique values of a variable. For categorical variables, it gives the levels of the variable. Charlie H. wrote a program that uses PROC SQL to count the number of levels for the Origin and Type variables in the SASHELP.CARS data set:

proc sql;
title'Unique() -- Find levels of categorical variables';
select count(unique(Origin)) as L_origin, 
       count(unique(Type)) as L_type





The SAS/IML language also has a UNIQUE function. The UNIQUE function always returns a row vector that contains the unique sorted values of its argument, as shown in the following statements:

proc iml;
read all var {Origin Type};

uOrigin = unique(Origin);
uType   = unique(Type);
print uOrigin, uType;












Because you know that the output is a row vector, you can count the number of columns by using the NCOL function, as shown below:

L_Origin = ncol(uOrigin); /** = 3 **/
L_Type   = ncol(uType);   /** = 6 **/

I think that the UNIQUE function is one of the most important functions in SAS/IML because it enables you to compute statistics for each level of a categorical variable. Furthermore, the UNIQUE/LOC technique (which is described on p. 69 of my book, Statistical Programming with SAS/IML Software) is a highly useful technique that should be a part of every statistical programmer's toolbox.

4月 292011
In last week's article on how to create a funnel plot in SAS, I wrote the following comment:
I have not adjusted the control limits for multiple comparisons. I am doing nine comparisons of individual means to the overall mean, but the limits are based on the assumption that I'm making a single comparison.
This article discusses how to adjust the control limits (called decision limits in the GLM procedure) to account for multiple comparisons. Because the adjustments are more complicated when the group sizes are not constant, this article treats the simpler case in which each group has the same number of observations. For details on multiple comparisons, see Multiple Comparisons and Multiple Tests Using SAS (the second edition is scheduled for Summer 2011).

Example Data and ANOM Chart

In the funnel plot article, I used data for the temperatures of 52 cars. Each car was one of nine colors, and I was interested in whether the mean temperature of a group (say, black cars) was different from the overall mean temperature of the cars. The number of cars in each color group varied. However, in order to simplify the analysis, today's analysis uses only the temperatures of the first four cars of each color.

You can download the new data and all of the SAS statements used in this article. The following statements create the data and run the SAS/QC ANOM procedure to generate the ANOM chart:

ods graphics on;
proc anom data=CarTemp2;
   xchart Temperature*Color;
   label Temperature = 'Mean Temperature (F)';
   label Color = 'Car Color';

Even for this smaller set of data, it is apparent that black cars are warmer than average and silver and white cars are cooler than average. You can create a similar plot by using the LSMEANS statement in the SAS/STAT GLM procedure.

Computing Decision Limits: An Overview

The formulas for computing decision limits are available in the documentation of the XCHART statement in the ANOM procedure. The decision limits have three components:

  1. The central value, y, with which you want to compare each individual group mean. Often this is the grand mean of the data.
  2. A variance term, v, which involves the root mean square error, the number of groups, and the size of each group. This term quantifies the accuracy of the comparisons.
  3. A multiplier, h, which depends on the significance level, α, and accounts for the multiple comparisons.
The upper and lower decision limits are then formed as y ± h * v. The following sections compute each component of the decision limits.

Computing the Central Value

The central value is the easiest component to compute. When the group sizes are constant, the central value is merely the overall mean:

proc iml;
use CarTemp2; 
read all var {Color Temperature}; 
close CarTemp2;

/** 1. overall mean **/
y = mean(Temperature); 

This value is 123.6, as shown in the ANOM chart. The ANOM chart compares each individual group mean to this central value.

Computing the Variance Term

The second component in the computation of decision limits is the variance term. This term measures the accuracy you have when comparing group means to the overall mean. (More variance means less accuracy.) The formula involves the mean square error, which in this case is just the average of the sample variances of the nine groups. For convenience, the following statements define a SAS/IML module that computes the average variance:

/** 2. variance term **/
start MSEofGroups(g, x);
   u = unique(g); /** g is group var **/
   nGroups = ncol(u);
   v = j(1, nGroups);
   do i = 1 to nGroups;
      v[i] = var( x[loc(g=u[i])] );
   return( sum(v)/nGroups );

The module is then used to compute the variance term:

MSE = MSEofGroups(Color, Temperature);
nGroups = 9; /** or determine from data **/
size = repeat(4, nGroups); /** {4,4,...,4} **/
v = sqrt(MSE) * sqrt((nGroups-1)/sum(size));

Computing the ANOM Multiplier

The final component in forming the ANOM decision limits is the multiplier, h. In elementary statistics, the value 2 (or more precisely, the 0.975 quantile of a t distribution) might be used as a multiplier, but that value isn’t big enough when multiple comparisons are being made. The PROC ANOM documentation states that in a comparison of several group means with the overall mean, the proper value of h is the α quantile of a certain distribution. However, the documentation does not specify how to compute this quantile.

In SAS software you can compute the quantile by using the PROBMC function. I had never heard of the PROBMC function until I started working on this article, but it is similar to the QUANTILE function in that it enables you to obtain quantiles from one of several distributions that are used in multiple comparison computations. (You can also use the PROBMC function to obtain probabilities.)

The following statements compute h for α = 0.05 and for the case of nine groups, each with four observations:

/** 3. multiplier for ANOM **/
alpha = 0.05;
pAnom = 1 - alpha;

/** degrees of freedom for 
    pooled estimate of variance **/
df = sum(size)-nGroups; 
h = probmc("ANOM", ., pAnom, df, nGroups); 

The main idea is that h is the α quantile of the "ANOM distribution." Although the "ANOM distribution" is not as well-known as the t distribution, the idea is the same. The distribution involves parameters for the degrees of freedom and the number of groups. In the general case (when the group sizes are not constant), the sizes of the groups are also parameters for the distribution (not shown here).

Computing the Decision Limits

All three pieces are computed, so it is easy to put them together to compute the upper and lower decision limits:

/** compute decision limits **/
upperAnom = y + h * v;
lowerAnom = y - h * v;
print lowerAnom upperAnom;





Notice that these values are identical to the values graphed by the ANOM procedure.

Comparing the ANOM Multiplier with Better-Known Multipliers

The computation is finished, but it is interesting to compare the ANOM multiplier with more familiar multipliers from the t distribution.

A classic way to handle multiple comparisons is to use the Bonferroni adjustment. In this method, you divide α by the number of comparisons (9) but continue to use quantiles of the t distribution. By dividing α by the number of groups, you find quantiles that are further in the tail of the t distribution and therefore are larger than the unadjusted values. You can show that the Bonferroni multiplier is a conservative multiplier that will always be larger than the ANOM multiplier.

The following statements compute decision limit multipliers based on an unadjusted t quantile (such as is used for a classical confidence interval for a mean) and on a Bonferroni adjusted quantile. These are printed, along with the multiplier h that was computed previously.

/** compare with unadjusted and Bonferroni multipliers **/
q = quantile("T", 1-alpha/2, df); 
qBonf = quantile("T", 1-alpha/2/nGroups, df); 
print q qBonf h;







For these data, the Bonferroni multiplier is only about 1% larger than h. You can see that the Bonferroni and ANOM multipliers are about 50% larger than the multiplier based on the unadjusted quantile, which means that the decision limits based on these quantiles will be wider. This is good, because the unadjusted limits are too narrow for multiple comparisons.

4月 272011
The log transformation is one of the most useful transformations in data analysis. It is used as a transformation to normality and as a variance stabilizing transformation. A log transformation is often used as part of exploratory data analysis in order to visualize (and later model) data that ranges over several orders of magnitude. Common examples include data on income, revenue, populations of cities, sizes of things, weights of things, and so forth.

In many cases, the variable of interest is positive and the log transformation is immediately applicable. However, some quantities (for example, profit) might contain a few negative values. How do you handle negative values if you want to log-transform the data?

Solution 1: Translate, then Transform

A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1.

In the SAS/IML language, this transformation is easily programmed in a single statement. The following example uses b=1 and calls the LOG10 function, but you can call LOG, the natural logarithm function, if you prefer.

proc iml;
Y = {-3,1,2,.,5,10,100}; /** negative datum **/
LY = log10(Y + 1 - min(Y)); /** translate, then transform **/

Solution 2: Use Missing Values

A criticism of the previous method is that some practicing statisticians don't like to add an arbitrary constant to the data. They argue that a better way to handle negative values is to use missing values for the logarithm of a nonpositive number.

This is the point at which some programmers decide to resort to loops and IF statements. For example, some programmers write the following inefficient SAS/IML code:

n = nrow(Y);
LogY = j(n,1); /** allocate result vector **/
do i = 1 to n; /** loop is inefficient **/
   if Y>0 then LogY[i] = log(Y);
   else LogY[i] = .;

The preceding approach is fine for the DATA step, but the DO loop is completely unnecessary in PROC IML. It is more efficient to use the LOC function to assign LogY, as shown in the following statements.

/** more efficient statements **/
LogY = j(nrow(Y),1,.); /** allocate missing **/
idx = loc(Y>0); /** find indices where Y>0 **/
if ncol(idx)>0 then 
   LogY[idx] = log10(Y[idx]);

print Y LY LogY;

























The preceding statements initially define LogY to be a vector of missing values. The LOC function finds the indices of Y for which Y is positive. If at least one such index is found, those positive values are transformed and overwrite the missing values. A missing value remains in LogY for any element for which Y is negative.

You can see why some practitioners prefer the second method over the first: the logarithms of the data are unchanged by the second method, which makes it easy to mentally convert the transformed data back to the original scale (see the transformed values for 1, 10, and 100). The translation method makes the mental conversion harder.

You can use the previous technique for other functions that have restricted domains. For example, the same technique applies to the SQRT function and to inverse trigonometric functions such as ARSIN and ARCOS.