data analysis

11月 042013
 

Mosaic plots (Hartigan and Kleiner, 1981; Friendly, 1994, JASA) are used for exploratory data analysis of categorical data. Mosaic plots have been available for decades in SAS products such as JMP, SAS/INSIGHT, and SAS/IML Studio. However, not all SAS customers have access to these specialized products, so I am pleased that mosaic plots have recently been added to two Base SAS procedures:

Both of these features were added in SAS 9.3m2, which is the 12.1 release of the analytics products. This article describes how to create a mosaic plot by using PROC FREQ. My next blog post will describe how to create mosaic plots by using the GTL.

Use mosaic plots to visualize frequency tables

You can use mosaic plots to visualize the cells of a frequency table, also called a contingency table or a crosstabulation table. A mosaic plot consists of regions (called "tiles") whose areas are proportional to the frequencies of the table cells. The widths of the tiles are proportional to the frequencies of the column variable levels. The heights of tiles are proportional to the frequencies of the row levels within the column levels.

The FREQ procedure supports two-variable mosaic plots, which are the most important case. The GTL statement supports mosaic plots with up to three categorical variables. JMP and SAS/IML Studio enable you to create mosaic plots with even more variables.

As I showed in a previous blog post, you can use user-defined formats to specify the order of levels of a categorical variable. The Sashelp.Heart data set contains data for 5,209 patients in a medical study of heart disease. You can download the program that that specifies the order of levels for certain categorical variables. The following statements use the ordered categories to create a mosaic plot. The plot shows the relationship between categories of blood pressure and body weight for the patients:

ods graphics on;
proc freq data=heart;
tables BP_Cat*Weight_Cat / norow chisq plots=MOSAIC; /* alias for MOSAICPLOT */
run;

The mosaic plot is a graphical depiction of the frequency table. The mosaic plot shows the distribution of the weight categories by dividing the X axis into three intervals. The length of each interval is proportional to the percentage of patients who are underweight (3.5%), normal weight (28%), and overweight (68%), respectively. Within each weight category, the patients are further subdivided. The first column of tiles shows the proportion of patients who have optimal (29%), normal (54%), or high (18%) blood pressure, given that they are underweight. The middle column shows similar information for the patients of normal weight. The last column shows the conditional distribution of blood pressure categories, given that the patients are overweight.

The chi-square test (not shown) tests the hypothesis that there is no association between the weight of patients and their blood pressure. The chi-square test rejects that hypothesis, and the mosaic plot shows why. If there were no association between the variables, the green, red, and blue tiles would be essentially the same height regardless of the weight category of the patients. They are not. Rather, the height of the blue tiles increases from left to right. This shows that high blood pressure is more prevalent in overweight patients. Similarly, the height of the green tiles decreases from left to right. This shows that optimal blood pressure occurs more often in normal and underweight patents than in overweight patients.

The colors in this mosaic plot indicate the levels of the second variable. This enables you to quickly assess how categories of that variable depend on categories of the first variable. There are other ways to color the mosaic plot tiles, and you can use the GTL to specify an alternate set of colors. I describe that approach in my next blog post.

tags: 12.1, Data Analysis, GTL, Statistical Graphics
11月 042013
 

Mosaic plots (Hartigan and Kleiner, 1981; Friendly, 1994, JASA) are used for exploratory data analysis of categorical data. Mosaic plots have been available for decades in SAS products such as JMP, SAS/INSIGHT, and SAS/IML Studio. However, not all SAS customers have access to these specialized products, so I am pleased that mosaic plots have recently been added to two Base SAS procedures:

Both of these features were added in SAS 9.3m2, which is the 12.1 release of the analytics products. This article describes how to create a mosaic plot by using PROC FREQ. My next blog post will describe how to create mosaic plots by using the GTL.

Use mosaic plots to visualize frequency tables

You can use mosaic plots to visualize the cells of a frequency table, also called a contingency table or a crosstabulation table. A mosaic plot consists of regions (called "tiles") whose areas are proportional to the frequencies of the table cells. The widths of the tiles are proportional to the frequencies of the column variable levels. The heights of tiles are proportional to the frequencies of the row levels within the column levels.

The FREQ procedure supports two-variable mosaic plots, which are the most important case. The GTL statement supports mosaic plots with up to three categorical variables. JMP and SAS/IML Studio enable you to create mosaic plots with even more variables.

As I showed in a previous blog post, you can use user-defined formats to specify the order of levels of a categorical variable. The Sashelp.Heart data set contains data for 5,209 patients in a medical study of heart disease. You can download the program that that specifies the order of levels for certain categorical variables. The following statements use the ordered categories to create a mosaic plot. The plot shows the relationship between categories of blood pressure and body weight for the patients:

ods graphics on;
proc freq data=heart;
tables BP_Cat*Weight_Cat / norow chisq plots=MOSAIC; /* alias for MOSAICPLOT */
run;

The mosaic plot is a graphical depiction of the frequency table. The mosaic plot shows the distribution of the weight categories by dividing the X axis into three intervals. The length of each interval is proportional to the percentage of patients who are underweight (3.5%), normal weight (28%), and overweight (68%), respectively. Within each weight category, the patients are further subdivided. The first column of tiles shows the proportion of patients who have optimal (29%), normal (54%), or high (18%) blood pressure, given that they are underweight. The middle column shows similar information for the patients of normal weight. The last column shows the conditional distribution of blood pressure categories, given that the patients are overweight.

The chi-square test (not shown) tests the hypothesis that there is no association between the weight of patients and their blood pressure. The chi-square test rejects that hypothesis, and the mosaic plot shows why. If there were no association between the variables, the green, red, and blue tiles would be essentially the same height regardless of the weight category of the patients. They are not. Rather, the height of the blue tiles increases from left to right. This shows that high blood pressure is more prevalent in overweight patients. Similarly, the height of the green tiles decreases from left to right. This shows that optimal blood pressure occurs more often in normal and underweight patents than in overweight patients.

The colors in this mosaic plot indicate the levels of the second variable. This enables you to quickly assess how categories of that variable depend on categories of the first variable. There are other ways to color the mosaic plot tiles, and you can use the GTL to specify an alternate set of colors. I describe that approach in my next blog post.

tags: 12.1, Data Analysis, GTL, Statistical Graphics
10月 282013
 

If you've ever tried to use PROC FREQ to create a frequency table of two character variables, you know that by default the categories for each variable are displayed in alphabetical order. A different order is sometimes more useful. For example, consider the following two-way table for the smoking status and weight status of 5,167 patients in a medical study, as recorded in the Sashelp.Heart data set, which is distributed with SAS software:

proc freq data=sashelp.heart;
tables Smoking_Status*Weight_Status / norow nocol nopct;
run;

The alphabetical order of the categories is not the best choice for these data. A better table would order the weight categories as "Underweight," "Normal," and "Overweight." Similarly, the smoking categories should be ordered from 'Non-smoker' to "Very Heavy (> 25)."

In most introductory SAS courses—and at almost every SAS conference—you will hear recommendations about how to encode ordinal variables when you create a SAS data set. Two common ways to make specify the order of categories are:

  • Create (or sort) the data in the order that you want the frequency table to appear. Use the ORDER=DATA option on the PROC FREQ statement to instruct the procedure that it should order categories as they appear in the data set.
  • Encode the data as a numerical variable with values 1, 2, 3, ..., and use a SAS format to provide the text that will be displayed for each category. The FREQ procedure will order the (numerical) variables by 1, 2, 3, ..., but will display the formatted values as the headers for the frequency tables.

The first approach is simple, but it has limitations. Often the data set is handed to you; you don't get to create it. You can use PROC SORT to create a sorted copy of the data, but you still need to provide variables with values such as 1, 2, 3, ..., in order to sort the data. Furthermore, if some joint categories are empty (for example, there are no underweight non-smokers in the data), then the ORDER=DATA option might not accomplish what you want. (However, a workaround is to create a weight variable and using the ZEROS option in the WEIGHT statement to include empty levels.)

The second approach requires that you add sorting variables to the data set and apply a format to those variables. First, define the format:

proc format;
value WtFmt 1 = 'Underweight'
            2 = 'Normal'
            3 = 'Overweight';
value SmFmt 1 = 'Non-smoker'
            2 = 'Light (1-5)'
            3 = 'Moderate (6-15)'
            4 = 'Heavy (16-25)'
            5 = 'Very Heavy (> 25)';
run;

Next, create new sorting variables and apply the formats to "recreate" the original variables:

data Heart / view=Heart;
format Smoking_Cat SmFmt. Weight_Cat WtFmt.;
set sashelp.heart;
select (Weight_Status);
   when ('Underweight') Weight_Cat=1;
   when ('Normal')      Weight_Cat=2;
   when ('Overweight')  Weight_Cat=3;
   when (' ')           Weight_Cat=.;
end;
select (Smoking_Status);
   when ('Non-smoker')        Smoking_Cat=1;
   when ('Light (1-5)')       Smoking_Cat=2;
   when ('Moderate (6-15)')   Smoking_Cat=3;
   when ('Heavy (16-25)')     Smoking_Cat=4;
   when ('Very Heavy (> 25)') Smoking_Cat=5;
   when (' ')                 Smoking_Cat=.;
end;
run;

I have created a data set view rather than a data set in order to save storage space, which might be important for large data sets. You can now call PROC FREQ, as follows:

proc freq data=heart;
tables Smoking_Cat*Weight_Cat / norow nocol nopct;
run;

By using this technique, you can specify any order for the categories of a contingency table. The technique extends to other analyses as well.

For more on using PROC FORMAT for data analysis, see the following articles:

tags: Data Analysis, Getting Started
10月 282013
 

If you've ever tried to use PROC FREQ to create a frequency table of two character variables, you know that by default the categories for each variable are displayed in alphabetical order. A different order is sometimes more useful. For example, consider the following two-way table for the smoking status and weight status of 5,167 patients in a medical study, as recorded in the Sashelp.Heart data set, which is distributed with SAS software:

proc freq data=sashelp.heart;
tables Smoking_Status*Weight_Status / norow nocol nopct;
run;

The alphabetical order of the categories is not the best choice for these data. A better table would order the weight categories as "Underweight," "Normal," and "Overweight." Similarly, the smoking categories should be ordered from 'Non-smoker' to "Very Heavy (> 25)."

In most introductory SAS courses—and at almost every SAS conference—you will hear recommendations about how to encode ordinal variables when you create a SAS data set. Two common ways to make specify the order of categories are:

  • Create (or sort) the data in the order that you want the frequency table to appear. Use the ORDER=DATA option on the PROC FREQ statement to instruct the procedure that it should order categories as they appear in the data set.
  • Encode the data as a numerical variable with values 1, 2, 3, ..., and use a SAS format to provide the text that will be displayed for each category. The FREQ procedure will order the (numerical) variables by 1, 2, 3, ..., but will display the formatted values as the headers for the frequency tables.

The first approach is simple, but it has limitations. Often the data set is handed to you; you don't get to create it. You can use PROC SORT to create a sorted copy of the data, but you still need to provide variables with values such as 1, 2, 3, ..., in order to sort the data. Furthermore, if some joint categories are empty (for example, there are no underweight non-smokers in the data), then the ORDER=DATA option might not accomplish what you want. (However, a workaround is to create a weight variable and using the ZEROS option in the WEIGHT statement to include empty levels.)

The second approach requires that you add sorting variables to the data set and apply a format to those variables. First, define the format:

proc format;
value WtFmt 1 = 'Underweight'
            2 = 'Normal'
            3 = 'Overweight';
value SmFmt 1 = 'Non-smoker'
            2 = 'Light (1-5)'
            3 = 'Moderate (6-15)'
            4 = 'Heavy (16-25)'
            5 = 'Very Heavy (> 25)';
run;

Next, create new sorting variables and apply the formats to "recreate" the original variables:

data Heart / view=Heart;
format Smoking_Cat SmFmt. Weight_Cat WtFmt.;
set sashelp.heart;
select (Weight_Status);
   when ('Underweight') Weight_Cat=1;
   when ('Normal')      Weight_Cat=2;
   when ('Overweight')  Weight_Cat=3;
   when (' ')           Weight_Cat=.;
end;
select (Smoking_Status);
   when ('Non-smoker')        Smoking_Cat=1;
   when ('Light (1-5)')       Smoking_Cat=2;
   when ('Moderate (6-15)')   Smoking_Cat=3;
   when ('Heavy (16-25)')     Smoking_Cat=4;
   when ('Very Heavy (> 25)') Smoking_Cat=5;
   when (' ')                 Smoking_Cat=.;
end;
run;

I have created a data set view rather than a data set in order to save storage space, which might be important for large data sets. You can now call PROC FREQ, as follows:

proc freq data=heart;
tables Smoking_Cat*Weight_Cat / norow nocol nopct;
run;

By using this technique, you can specify any order for the categories of a contingency table. The technique extends to other analyses as well.

For more on using PROC FORMAT for data analysis, see the following articles:

tags: Data Analysis, Getting Started
10月 232013
 

A challenge for statistical programmers is getting data into the right form for analysis. For graphing or analyzing data, sometimes the "wide format" (each subject is represented by one row and many variables) is required, but other times the "long format" (observations for each subject span multiple rows) is more useful. I've previously written about how to reshape data from wide to long format.

It can be a challenge to get data in the right shape. For years I have wrestled with reformatting output data sets when computing descriptive statistics for multiple variables. One of the powers of SAS software is that you can specify many variables in a procedure (such as MEANS or UNIVARIATE) and the procedure computes the requested statistics for each variable. However, to use that output in a subsequent graph or analysis sometimes requires reshaping the output data set.

Since it is inconvenient to have to reshape data sets, this article describes three ways to produce descriptive statistics (in particular, percentiles) in a tabular form in which variables form columns (or rows) and the statistics are arranged in the other dimension. This is usually the easiest shape to work with.

For standard percentiles, use PROC MEANS

As a canonical example, consider the the task of computing multiple percentiles for several variables when the underlying data are in a wide format. By default, both PROC MEANS and PROC UNIVARIATE create the output data set in a less-than-optimal shape.

For commonly used percentiles (such as the 5th, 25th, 50th, 75th, and 95th percentiles), you can use PROC MEANS and the STACKODSOUTPUT option, which was introduced in SAS 9.3, to create an output data set that contains percentiles for the multiple variables in a more convenient format. Compare the following two output data sets. The first output is usually harder to work with than the second:

/* default output data set. Yuck! */
proc means data=sashelp.cars noprint; 
var mpg_city mpg_highway;
output out=MeansWidePctls P5= P25= P75= P95= / autoname;
run;
 
proc print data=MeansWidePctls noobs; run;
/* Use the STACKODSOUTPUT option to get output in a more natural shape */
proc means data=sashelp.cars StackODSOutput P5 P25 P75 P95; 
var mpg_city mpg_highway;
ods output summary=LongPctls;
run;
 
proc print data=MeansLongPctls noobs;run;

If I want to use the percentiles in a subsequent analysis, such as plotting the percentiles on a graph, I much prefer to work with the second set of output.

For nonstandard percentiles, use PROC STDIZE

For many years I have used PROC UNIVARIATE to compute custom percentiles. One application of custom percentiles is to compute a 95% bootstrap confidence interval, which requires computing the 2.5th and 97.5th percentiles. However, when you are analyzing multiple variables PROC UNIVARIATE creates an output data set that is a single row, as shown in the following example:

/* default output data set. Yuck! */
proc univariate data=sashelp.cars noprint;
var mpg_city mpg_highway;
output out=UniWidePctls pctlpre=CityP_ HwyP_ pctlpts=2.5,15,65,97.5; 
run; 
 
proc print data=UniWidePctls noobs; run;

There are many tricks, papers, and macros published that reshape or restructure the output from PROC UNIVARIATE. In my book, Simulating Data with SAS, I show one technique for reshaping the output. (If you have a favorite method, link to it in the comment section.)

And so it was a surprise to find out that the easiest way to get customized percentiles in a tabular form is to not use PROC UNIVARIATE! It turns out that the STDIZE procedure also computes custom percentiles, and the output is written in a tabular form. I knew that PROC STDIZE standardizes variables, but I didn't know that it computes custom percentiles.

The PROC STDIZE syntax is easy and very similar to the PROC UNIVARIATE syntax: use the PCTLPTS= option to specify the custom percentiles. Unlike PROC UNIVARIATE, you do not need PCTLPRE= option, because the names of the variables are used for the output variables, and the percentiles are just additional rows that are added to the output data set. A typical example follows:

proc stdize data=sashelp.cars PctlMtd=ORD_STAT outstat=StdLongPctls
           pctlpts=2.5,15,65,97.5;
var mpg_city mpg_highway;
run;
 
proc print data=StdLongPctls noobs;
where _type_ =: 'P';
run;

I explicitly specified the PCTLMTD= option so that the algorithm uses the traditional "sort the data" algorithm for computing percentiles, rather than a newer one-pass algorithm. Although the one-pass algorithm is very fast and well-suited for computing the median, it is not recommended for computing extreme percentiles such as the 2.5th and 97.5th.

Or use the SAS/IML language

Of course, no blog post from me is complete without showing how to compute the same quantity by using the SAS/IML language. The QNTL subroutine computes customized quantiles. (Recall that quantiles and percentiles are essentially the same: The 0.05 quantile is the 5th percentile, the 0.25 quantile is the 25th percentile, and so forth.) By using the QNTL subroutine, the quantiles automatically are packed into a matrix where each column corresponds to a variable and each row corresponds to a quantile, as follows:

proc iml;
varNames = {"mpg_city" "mpg_highway"};
use sashelp.cars;
read all var varNames into X;
close sashelp.cars;
 
Prob = {2.5, 15, 65, 97.5} / 100;   /* prob in (0,1) */
call qntl(Pctls, X, Prob);
print Prob Pctls[c=varNames];

So there you have it. Three ways to compute percentiles (quantiles) so that the results are shaped in a tabular form. For standard percentiles, use PROC MEANS with the STACKODSOUTPUT option. For arbitrary percentiles, use PROC STDIZE or PROC IML. If you use these techniques, the percentiles are arranged in a tabular form and you do not need to run any additional macro or DATA step to reshape the output.

tags: Bootstrap and Resampling, Data Analysis, Tips and Techniques
10月 162013
 

On Kaiser Fung's Junk Charts blog, he showed a bar chart that was "published by Teach for America, touting its diversity." Kaiser objected to the chart because the bar lengths did not accurately depict the proportions of the Teach for America corps members.

The chart bothers me for another reason: I think if you create a graph that purports to demonstrate racial diversity, your graph should provide as reference the national percentages of each racial group. In general, if you claim that your organization does something better than a benchmark, you should include the benchmark values. (For mutual fund companies, comparing performance to a benchmark index is not just a good idea, it is the law.)

It turns out that the U.S. Census Bureau does not count "Latino" as a racial or ethnic group, but I used data and information from the 2010 US census to estimate the racial distribution of the US population. The following data and graphs compare the diversity of Teach for America to the national percentages:

title "Teach for America Diversity versus US Average";
data TeachAmerica;
input Race $28. Corps USPop;
label Diff="Difference between Corps and US Percentages"
      RelDiff="Relative Difference between Corps and US";
Diff = Corps - USPop;   /* O-E = observed - expected */
RelDiff = Diff / USPop; /* relative diff: (O-E)/E    */
datalines;
Causasian                    62   56
African American             13   13
Latino                       10   16
Asian American                6    5
Multi-Ethnic                  5    6
Other                         4    3
Native American or Hawaiian   0.5  1.1
;
 
proc sgplot data=TeachAmerica; /* standard dot plot with reference */
scatter y=Race x=USPop / markerattrs=(symbol=Diamond) legendlabel="US";
scatter y=Race x=Corps / markerattrs=(symbol=CircleFilled) legendlabel="Corps";
yaxis grid;
run;
 
proc sgplot data=TeachAmerica; /* plot differences from US population */
scatter y=Race x=Diff / markerattrs=(symbol=CircleFilled);
refline 0 / axis=x;
yaxis grid;
run;

Two graphs are created, but only one is shown. The first graph (not shown) is a dot plot of the Teach for America percentages overlaid on a dot plot of the US percentages for each race. This essentially reproduces the Teach for America bar chart, but adds the reference percentages. I use a dot plot instead of a bar chart so that the reference percentages are easier to see.

However, as I've said recently, when you are comparing two groups, you should strongly consider creating a graph of the difference between the two groups. The differences are shown in the adjacent figure. (Click to enlarge.) The graph shows that there is not much difference between the diversity of Teach for America and the US population. Five of the seven racial groups are within a percentage point of the national average. Latinos are underrepresented in the Teach for America corps, whereas Caucasians are overrepresented.

The graph tells a more complete story than the Teach for America article, which claims "a higher level of diversity in our corps" than US colleges. Although the Teach for America corps is not overrepresented in all minority groups compared to the US population, they should be commended for closely matching the diversity of the US population. The graph also sends a message: although Teach for America has done a great job with diversity, they need to continue to work on recruiting Latino teachers.

When visualizing differences, sometimes it is useful to compute relative differences. Caucasians are a large ethnic group, so although a deviation of 6 percentage points is large in absolute terms, it is small in relative terms. To create this graph by using PROG SGPLOT, specify the RelDiff variable instead of the Diff variable.

It is worth noting that SAS software makes it easy to graph and analyze data like these. The TABLES statement in the FREQ procedure supports a TESTP= option that enables you to specify proportions for a chi-square test for proportions. You can also request a DeviationPlot, which shows the relative differences between the observed proportions and the specified proportions. To use PROC FREQ, you should convert the percentages into frequencies. The Teach for America Web site says that there were 11,000 teachers in the program in 2013, so I will use that number to convert the percentages into frequencies.

data Teach;
set TeachAmerica;
N = 11000 * Corps / 100;  /* approx number of people in corps for each group */
run;
 
proc freq data=Teach order=data;
weight N;
tables Race / chisq testp=(.56 .13 .16 .05 .06 .03 .011)
              plots(type=dotplot)=(DeviationPlot FreqPlot(scale=percent));
run;

The graph shows that the Native American and Latino groups are underrepresented on a relative basis, whereas the "Other" group is overrepresented. On a relative scale, Caucasians are only slightly overrepresented. The chi-square table (not shown) indicates that Teach for America proportions are not equal to the national proportions. The chi-square p-value is tiny, and is included in the lower right corner of the graph.

In summary, if you claim that certain quantities are as good as or better than some benchmark, include the benchmark as part of your graph. Even better, graph the difference between your quantities and the benchmark. SAS software provides tools to construct the graphs by hand, or as part of a statistical test for proportions.

tags: Data Analysis, Statistical Graphics
10月 112013
 

In my last blog post I described how to implement a "runs test" in the SAS/IML language. The runs test determines whether a sequence of two values (for example, heads and tails) is likely to have been generated by random chance. This article describes two applications of the runs test. In both cases, you have to recode the data into a two-value sequence in order to use the runs test. The applications are:

  • Application 1: Are two samples from the same population?
  • Application 2: Given a series of univariate numerical values, is there evidence that the data are random deviations from a constant value?

Are two samples from the same population?

The runs test can be applied to the problem of determining whether two samples are from the same population. If you measure some quantity for two groups, you might want to know whether the distribution of that quantity is the same for both groups.

There are various parametric tests that test whether some parameter is the same in both groups. In SAS software, you can use the TTEST procedure to test whether two samples are from the same normal populations. If you do not want to make a parametric assumption on the distribution of the population, there are many distribution-free tests in the NPAR1WAY procedure. The runs test is not included in PROC NPAR1WAY because there are other tests that have greater statistical power. (In fact, Jame Bradley's Distribution-Free Statistical Tests (1968, p. 263) says "This is one of the least powerful distribution-free tests.")

Nevertheless, let's see how you can use the runs test to test for identical populations. The Sashelp.bweight data set contains the birth weight of 50,000 babies. The following DATA step extracts the first 50 observations from the data: 31 boys and 19 girls. You might want to know whether the distribution of birth weight for boys is the same as the distribution of weight for the girls.

data bweight;      /* choose 50 obs subset of the data */
set sashelp.bweight(keep=Weight Boy);
N = _N_;
if _N_<=50;
run;
 
proc sgplot data=bweight;
title "Birth weight of 50 Babies";
scatter x=N y=Weight / group=Boy markerattrs=(symbol=CircleFilled);
keylegend / position=bottomright location=inside across=1 title="Boy";
xaxistable boy / location=inside;                /* SAS 9.4 feature */
run;

The data are shown in the scatter plot, which also demonstrates a cool new feature of SAS 9.4. The XAXISTABLE statement is used to display a small table at the bottom of the graph that indicates whether each observation is a boy (1) or girl (0)! (Click to enlarge.)

If the two samples (weights of boys and girls) come from the same population, the boys and girls should be randomly distributed in a sequence of the ordered weights. However, if girls tend to weigh less than boys, you would expect to see more girls at the beginning of the sequence and more boys at the end. You can use the runs test to determine whether the sequence is random.

The following SAS/IML program reads the weight and gender of the babies. The program sorts the data by weight and uses the runs test to analyze the resulting sequence of genders. The RunsTest module needs to be defined or loaded before it is called.

proc iml;
load module=RunsTest;  /* load or define the RunsTest module */
 
/* Application 1: Determine whether two samples are from the same population.
   Sort the data. Form two-value sequence that indicates the sample to which 
   it belongs. If populations are identical, the sequence is random. */
use bweight; read all var {"Weight" "Boy"} into X; close bweight;
 
call sort(X, 1);          /* sort data by weight */
seq = X[,2];              /* two-value sequence to test */
test = RunsTest(seq);
print test[c={"Runs Statistic" "p-value"} L="Runs Test (Two Populations)"];

The RunsTest module returns a large p-value, which indicates that the sequence appears to be random. Notice that this is a distribution-free test because it uses only the fact that the data are from two groups. The test does not assume any special shape for the distribution of the population.

Is a sequence randomly distributed about some central value?

Consider the problem of filling a box with cereal as it rolls down an assembly line. If the filling process is in control, the amount of cereal in each box is the target value (such as 450 grams) plus or minus some small random amount. If you weigh a sequence of cereal boxes, you can divide the units into two categories: those that weigh more than the target value and those that weigh less. If the variation is random, the sequence of "More" and "Less" should be random.

By subtracting the target value from each box's weight, the sequence is transformed into a sequence of positive and negative values. The following SAS/IML statements model the assembly line by simulating a sequence of 30 random values from a standard normal distribution. Positive values are assigned the value +1 and negative values are assigned the value –1. The new sequence of values {+1, –1, –1,..., +1} is sent to the RunsTest module:

/* Application 2: Is P(X<=0) constant during data generation? */
call randseed(123);
x = j(30,1);               /* 30 x 1 vector */
call randgen(x, "Normal"); /* fill with random normal values */
s = sign(x);               /* s=1 iff x>0 */
Test = RunsTest(s);
print Test[c={"Runs Statistic" "p-value"}
          L="Runs Test (Normal Sequence)"];

The runs test returns a large p-value, which indicates that there is no reason to doubt that the sequence is random. Although this does not prove that the RANDGEN function generates random values, it does indicate that there is no pattern to the positive and negative values.

Although the runs test is not very powerful, the test is simple and easy to understand, and rather fun to think about and implement. It is also a good example for showing how you can use the SAS/IML language to extend the statistical capabilities of SAS software.

tags: Data Analysis
10月 112013
 

In my last blog post I described how to implement a "runs test" in the SAS/IML language. The runs test determines whether a sequence of two values (for example, heads and tails) is likely to have been generated by random chance. This article describes two applications of the runs test. In both cases, you have to recode the data into a two-value sequence in order to use the runs test. The applications are:

  • Application 1: Are two samples from the same population?
  • Application 2: Given a series of univariate numerical values, is there evidence that the data are random deviations from a constant value?

Are two samples from the same population?

The runs test can be applied to the problem of determining whether two samples are from the same population. If you measure some quantity for two groups, you might want to know whether the distribution of that quantity is the same for both groups.

There are various parametric tests that test whether some parameter is the same in both groups. In SAS software, you can use the TTEST procedure to test whether two samples are from the same normal populations. If you do not want to make a parametric assumption on the distribution of the population, there are many distribution-free tests in the NPAR1WAY procedure. The runs test is not included in PROC NPAR1WAY because there are other tests that have greater statistical power. (In fact, Jame Bradley's Distribution-Free Statistical Tests (1968, p. 263) says "This is one of the least powerful distribution-free tests.")

Nevertheless, let's see how you can use the runs test to test for identical populations. The Sashelp.bweight data set contains the birth weight of 50,000 babies. The following DATA step extracts the first 50 observations from the data: 31 boys and 19 girls. You might want to know whether the distribution of birth weight for boys is the same as the distribution of weight for the girls.

data bweight;      /* choose 50 obs subset of the data */
set sashelp.bweight(keep=Weight Boy);
N = _N_;
if _N_<=50;
run;
 
proc sgplot data=bweight;
title "Birth weight of 50 Babies";
scatter x=N y=Weight / group=Boy markerattrs=(symbol=CircleFilled);
keylegend / position=bottomright location=inside across=1 title="Boy";
xaxistable boy / location=inside;                /* SAS 9.4 feature */
run;

The data are shown in the scatter plot, which also demonstrates a cool new feature of SAS 9.4. The XAXISTABLE statement is used to display a small table at the bottom of the graph that indicates whether each observation is a boy (1) or girl (0)! (Click to enlarge.)

If the two samples (weights of boys and girls) come from the same population, the boys and girls should be randomly distributed in a sequence of the ordered weights. However, if girls tend to weigh less than boys, you would expect to see more girls at the beginning of the sequence and more boys at the end. You can use the runs test to determine whether the sequence is random.

The following SAS/IML program reads the weight and gender of the babies. The program sorts the data by weight and uses the runs test to analyze the resulting sequence of genders. The RunsTest module needs to be defined or loaded before it is called.

proc iml;
load module=RunsTest;  /* load or define the RunsTest module */
 
/* Application 1: Determine whether two samples are from the same population.
   Sort the data. Form two-value sequence that indicates the sample to which 
   it belongs. If populations are identical, the sequence is random. */
use bweight; read all var {"Weight" "Boy"} into X; close bweight;
 
call sort(X, 1);          /* sort data by weight */
seq = X[,2];              /* two-value sequence to test */
test = RunsTest(seq);
print test[c={"Runs Statistic" "p-value"} L="Runs Test (Two Populations)"];

The RunsTest module returns a large p-value, which indicates that the sequence appears to be random. Notice that this is a distribution-free test because it uses only the fact that the data are from two groups. The test does not assume any special shape for the distribution of the population.

Is a sequence randomly distributed about some central value?

Consider the problem of filling a box with cereal as it rolls down an assembly line. If the filling process is in control, the amount of cereal in each box is the target value (such as 450 grams) plus or minus some small random amount. If you weigh a sequence of cereal boxes, you can divide the units into two categories: those that weigh more than the target value and those that weigh less. If the variation is random, the sequence of "More" and "Less" should be random.

By subtracting the target value from each box's weight, the sequence is transformed into a sequence of positive and negative values. The following SAS/IML statements model the assembly line by simulating a sequence of 30 random values from a standard normal distribution. Positive values are assigned the value +1 and negative values are assigned the value –1. The new sequence of values {+1, –1, –1,..., +1} is sent to the RunsTest module:

/* Application 2: Is P(X<=0) constant during data generation? */
call randseed(123);
x = j(30,1);               /* 30 x 1 vector */
call randgen(x, "Normal"); /* fill with random normal values */
s = sign(x);               /* s=1 iff x>0 */
Test = RunsTest(s);
print Test[c={"Runs Statistic" "p-value"}
          L="Runs Test (Normal Sequence)"];

The runs test returns a large p-value, which indicates that there is no reason to doubt that the sequence is random. Although this does not prove that the RANDGEN function generates random values, it does indicate that there is no pattern to the positive and negative values.

Although the runs test is not very powerful, the test is simple and easy to understand, and rather fun to think about and implement. It is also a good example for showing how you can use the SAS/IML language to extend the statistical capabilities of SAS software.

tags: Data Analysis
10月 092013
 

While walking in the woods, a statistician named Goldilocks wanders into a cottage and discovers three bears. The bears, being hungry, threaten to eat the young lady, but Goldilocks begs them to give her a chance to win her freedom.

The bears agree. While Mama Bear and Papa Bear block Goldilocks' view, Baby Bear tosses a coin 30 times and records the results. He then makes up two other (fake) sequences of heads and tails, and gives Goldilocks a piece of paper that shows all three sequences. Papa Bear growls, "If you can determine which sequence came from the real coin toss, we will let you go. Otherwise we will eat you for dinner, for I have grown tired of porridge."

Here are the three sequences of heads (H) and tails (T) that the bears present to Goldilocks. Each of the sequences contain 16 heads and 14 tails.

  1. H H H H H H H H H H H H H H H H T T T T T T T T T T T T T T
  2. H T H T H T H T H T H T H T H T H T H T H T H T H T H T H H
  3. H T T H H H T T T T T T T H H H T H T H H H T H H H T H T H

Goldilocks studies the three sequences and tells Papa Bear:

The first sequence is "too hot." It contains 16 heads followed by 14 tails. I would not expect such long sequences of heads and tails. Similarly, the second sequence is "too cold." It alternates between heads and tails like clockwork. The third sequence is "just right." It matches my intuitive notion of a random sequence of two categories: many short subsequences interlaced with a few longer subsequences. I think that the third sequence is real.

She had chosen correctly. The three bears, impressed by her statistical knowledge, set Goldilocks free and—once again—reluctantly ate porridge for dinner.

Implementing the "runs test" in SAS

Goldilocks reasoned well. If she had brought along her laptop with SAS software on it, she could have used statistics to analyze the data.

You can quantify Goldilocks' intuitive notions by defining a run as a sequence of consecutive trials that result in the same value. The first sequence has two runs: a run of heads followed by a run of tails. The second sequence has 29 runs. The third sequence has 15 runs: eight runs of heads and seven runs of tails.

It turns out that you can calculate the expected number of runs in a random sequence that has n heads and m tails. The expected number of runs is E(R) = 2nm / (n+m) + 1. The three sequences have n=16 heads and m=14 tails, so the expected number of runs is 15.9. So Goldilocks' intuition was correct: the first sequence does not have enough runs, whereas the second has too many. The third sequence has 15 runs, which is close to the expected value.

You can use these facts to construct a statistical test for the randomness of a two-category sequence, based on the number of runs in the sequence. This is called the runs test, or the Wald-Wolfowitz test for randomness. There is a SAS Knowledge Base article that describes how to implement the runs test in SAS by using the DATA step. However, the DATA step implementation is longer and more complicated than an analogous implementation in the SAS/IML language.

To simplify counting the number of runs, recode the two categories (heads and tails) as –1 and +1. Then a "run" is a sequence of consecutive values that have the same sign. I have previously blogged about how to use the DIF-SIGN technique to detect when terms in a sequence switch signs. In the following program, that technique is used to count the number of runs in a sequence.

proc iml;
/* Implement the Wald-Wolfowitz test (or "runs test").
   The SEQ vector is assumed to contain two groups (char or numeric).
   Compare with Base SAS version: http://support.sas.com/kb/33/092.html */
start RunsTest(seq);
   u = unique(seq);                /* find unique values (should be two) */
   d = j(nrow(seq)*ncol(seq),1, 1);/* recode as vector of -1, +1 */
   d[loc(seq=u[1])] = -1;          
   n = sum(d>0);                   /* count number of +1s */
   m = sum(d<0);                   /* count number of -1s */
 
   /* Use DIF-SIGN trick to count the number of runs */
   dif = dif(sign(d));             /* difference in signs */
   dif[1] = choose(d[1]<0, -2, 2); /* replace first obs */
   R = sum(dif=2 | dif=-2);        /* R = observed number of runs */
 
   mu = 2*n*m / (n+m) + 1;         /* E(R)=expected number of runs, given n, m */
   var = 2*n*m*(2*n*m-(n+m)) / ((n+m)##2*(n+m-1)); /* var(R) */
   sigma = sqrt(var);              /* StdDev(R) */
 
   /* R is asymptotically normally distributed. Compute test statistic. */
   if (n+m)>50 then      Z = (R-mu)    /sigma; 
   else if R-mu < 0 then Z = (R-mu+0.5)/sigma; /* for small samples, apply */
   else                  Z = (R-mu-0.5)/sigma; /*    continuity correction */
 
   return(Z || 2*(1-cdf("Normal", abs(Z))));   /* (Z, two-sided p-value) */
finish;
 
flips = {H T T H H H T T T T T T T H H H T H T H H H T H H H T H T H}`;
Test = RunsTest(flips);
print Test[c={"Runs Statistic" "p-value"} L="Runs Test, two-sided"];

The output of the RunsTest module is a two-element vector. The first element is the test statistic and the second element is the two-side p-value. Running the test on Baby Bear's coin-toss sequence produces a large p-value, which means that Goldilocks would accept the null hypothesis that the sequence of coin tosses is random. You can run the test for the other two sequences ("too hot" and "too cold") and see that both of them produce tiny p-values. If Goldilocks had implemented the runs test, she would have concluded that the made-up sequences are unlikely to be random sequence of coin flips.

I'll conclude with a few thoughts:

  • The SAS/IML language is useful for implementing statistical methods, such as the runs test, that are not included in any SAS procedure.
  • The SAS/IML language is compact. For this example, the SAS/IML program contains about 15 lines, versus 65 lines for the DATA step code.
  • When I first thought about this problem, I thought I would need to use a loop to count the number of runs in a sequence. However, by recoding the data as +/–1, I was able to vectorize the computation by using the DIF-SIGN technique.
  • Goldilocks was fortunate to know statistics.

tags: Data Analysis, Just for Fun, Tips and Techniques