data analysis

5月 242017
 

According to Hyndman and Fan ("Sample Quantiles in Statistical Packages," TAS, 1996), there are nine definitions of sample quantiles that commonly appear in statistical software packages. Hyndman and Fan identify three definitions that are based on rounding and six methods that are based on linear interpolation. This blog post shows how to use SAS to visualize and compare the nine common definitions of sample quantiles. It also compares the default definitions of sample quantiles in SAS and R.

Definitions of sample quantiles

Suppose that a sample has N observations that are sorted so that x[1] ≤ x[2] ≤ ... ≤ x[N], and suppose that you are interested in estimating the p_th quantile (0 ≤ p ≤ 1) for the population. Intuitively, the data values near x[j], where j = floor(Np) are reasonable values to use to estimate the quantile. For example, if N=10 and you want to estimate the quantile for p=0.64, then j = floor(Np) = 6, so you can use the sixth ordered value (x[6]) and maybe other nearby values to estimate the quantile.

Hyndman and Fan (henceforth H&F) note that the quantile definitions in statistical software have three properties in common:

  • The value p and the sample size N are used to determine two adjacent data values, x[j]and x[j+1]. The quantile estimate will be in the closed interval between those data points. For the previous example, the quantile estimate would be in the closed interval between x[6] and x[7].
  • For many methods, a fractional quantity is used to determine an interpolation parameter, λ. For the previous example, the fraction quantity is (Np - j) = (6.4 - 6) = 0.4. If you use λ = 0.4, then an estimate the 64th percentile would be the value 40% of the way between x[6] and x[7].
  • Each definition has a parameter m, 0 ≤ m ≤ 1, which determines how the method interpolates between adjacent data points. In general, the methods define the index j by using j = floor(Np + m). The previous example used m=0, but other choices include m=0.5 or values of m that depend on p.

Thus a general formula for quantile estimates is q = (1 - λ) x[j]+ λ x[j+1], where λ and j depend on the values of p, N, and a method-specific parameter m.

You can read Hyndman and Fan (1986) for details or see the Wikipedia article about quantiles for a summary. The Wikipedia article points out a practical consideration: for values of p that are very close to 0 or 1, some definitions need to be slightly modified. For example, if p < 1/N, the quantity Np < 1 and so j = floor(Np) equals 0, which is an invalid index. The convention is to return x[1] when p is very small and return x[N] when p is very close to 1.

Compute all nine sample quantile definitions in SAS

SAS has built-in support for five of the quantile definitions, notably in PROC UNIVARIATE, PROC MEANS, and in the QNTL subroutine in SAS/IML. You can use the QNTLDEF= option to choose from the five definitions. The following table associates the five QNTLDEF= definitions in SAS to the corresponding definitions from H&F, which are also used by R. In R you choose the definition by using the type parameter in the quantile function.

SAS definitions of sample quantiles

It is straightforward to write a SAS/IML function to compute the other four definitions in H&F. In fact, H&F present the quantile interpolation functions as specific instances of one general formula that contains a parameter, which they call m. As mentioned above, you can also define a small value c (which depends on the method) such that the method returns x[1] if p < c, and the method returns x[N] if p ≥ 1 - c.

The following table presents the parameters for computing the four sample quantile definitions that are not natively supported in SAS:

Definitions of sample quantiles that are not natively supported in SAS

Visualizing the definitions of sample quantiles

Visualization of nine defniitions of sample quantiles, from Hyndman and Fan (1996)

You can download the SAS program that shows how to compute sample quantiles and graphs for any of the nine definitions in H&F. The differences between the definitions are most evident for small data sets and when there is a large "gap" between one or more adjacent data values. The following panel of graphs shows the nine sample quantile methods for a data set that has 10 observations, {0 1 1 1 2 2 2 4 5 8}. Each cell in the panel shows the quantiles for p = 0.001, 0.002, ..., 0.999. The bottom of each cell is a fringe plot that shows the six unique data values.

In these graphs, the horizontal axis represents the data and quantiles. For any value of x, the graph estimates the cumulative proportion of the population that is less than or equal to x. Notice that if you turn your head sideways, you can see the quantile function, which is the inverse function that estimates the quantile for each value of the cumulative probability.

You can see that although the nine quantile functions have the same basic shape, the first three methods estimate quantiles by using a discrete rounding scheme, whereas the other methods use a continuous interpolation scheme.

You can use the same data to compare methods. Instead of plotting each quantile definition in its own cell, you can overlay two or more methods. For example, by default, SAS computes sample quantiles by using the type=2 method, whereas R uses type=7 by default. The following graph overlays the sample quantiles to compare the default methods in SAS and R on this tiny data set. The default method in SAS always returns a data value or the average of adjacent data values; the default method in R can return any value in the range of the data.

Comparison of the default  quantile estimates in SAS and R on a tiny data set

Does the definition of sample quantiles matter?

As shown above, different software packages use different defaults for sample quantiles. Consequently, when you report quantiles for a small data set, it is important to report how the quantiles were computed.

However, in practice analysts don't worry too much about which definition they are using because the difference between methods is typically small for larger data sets (100 or more observations). The biggest differences are often between the discrete methods, which always report a data value or the average between two adjacent data values, and the interpolation methods, which can return any value in the range of the data. Extreme quantiles can also differ between the methods because the tails of the data often have fewer observations and wider gaps.

The following graph shows the sample quantiles for 100 observations that were generated from a random uniform distribution. As before, the two sample quantiles are type=2 (the SAS default) and type=7 (the R default). At this scale, you can barely detect any differences between the estimates. The red dots (type=7) are on top of the corresponding blue dots (type=2), so few blue dots are visible.

Comparison of the default  quantile estimates in SAS and R on a larger data set

So does the definition of the sample quantile matter? Yes and no. Theoretically, the different methods compute different estimates and have different properties. If you want to use an estimator that is unbiased or one that is based on distribution-free computations, feel free to read Hyndman and Fan and choose the definition that suits your needs. The differences are evident for tiny data sets. On the other hand, the previous graph shows that there is little difference between the methods for moderately sized samples and for quantiles that are not near gaps. In practice, most data analysts just accept the default method for whichever software they are using.

In closing, I will mention that there are other quantile estimation methods that are not simple formulas. In SAS, the QUANTREG procedure solves a minimization problem to estimate the quantiles. The QUANTREG procedure enables you to not only estimate quantiles, but also estimate confidence intervals, weighted quantiles, the difference between quantiles, conditional quantiles, and more.

SAS program to compute nine sample quantiles.

The post Sample quantiles: A comparison of 9 definitions appeared first on The DO Loop.

5月 222017
 

In last week's article about the Flint water crisis, I computed the 90th percentile of a small data set. Although I didn't mention it, the value that I reported is different from the the 90th percentile that is reported in Significance magazine.

That is not unusual. The data only had 20 unique values, and there are many different formulas that you can use to compute sample percentiles (generally called quantiles). Because different software packages use different default formulas for sample quantiles, it is not uncommon for researchers to report different quantiles for small data sets. This article discusses the five percentile definitions that are supported in SAS software.

You might wonder why there are multiple definitions. Recall that a sample quantile is an estimate of a population quantile. Statisticians have proposed many quantile estimators, some of which are based on the empirical cumulative distribution (ECDF) of the sample, which approximates the cumulative distribution function (CDF) for the population. The ECDF is a step function that has a jump discontinuity at each unique data value. Consequently, the inverse ECDF does not exist and the quantiles are not uniquely defined.

Definitions of sample quantiles

In SAS, you can use the PCTLDEF= option in PROC UNIVARIATE or the QNTLDEF= option in other procedures to control the method used to estimate quantiles. A sample quantile does not have to be an observed data value because you are trying to estimate an unknown population parameter.

For convenience, assume that the sample data are listed in sorted order. In high school, you probably learned that if a sorted sample has an even number of observations, then the median value is the average of the middle observations. The default quantile definition in SAS (QNTLDEF=5) extends this familiar rule to other quantiles. Specifically, if the sample size is N and you ask for the q_th quantile, then when Nq is an integer, the quantile is the data value x[Nq]. However, when Nq is not an integer, then the quantile is defined (somewhat arbitrarily) as the average of the two data x[j] and x[j+1], where j = floor(Nq). For example, if N=10 and you want the q=0.17 quantile, then Nq=1.7, so j=1 and the 17th percentile is reported as the midpoint between the ordered values x[1] and x[2].

Averaging is not the only choices you can make when Nq is not an integer. The other percentile definitions correspond to making different choices. For example, you could round Nq down (QNTLDEF=3), or you could round it to the nearest integer (QNTLDEF=2). Or you could use linear interpolation (QNTLDEF=1 and QNTLDEF=4) between the data values whose (sorted) indices are closest to Nq. In the example where N=10 and q=0.17, the QNTLDEF=1 interpolated quantile is 0.3 x[1] + 0.7 x[2].

Visualizing the definitions for quantiles

The SAS documentation contains the formulas used for the five percentile definitions, but sometimes a visual comparison is easier than slogging through mathematical equations. The differences between the definitions are most apparent on small data sets that contain integer values, so let's create a tiny data set and apply the five definitions to it. The following example has 10 observations and six unique values.

data Q;
input x @@;
datalines;
0 1 1 1 2 2 2 4 5 8
;
ECDF of a small data set

You can use PROC UNIVARIATE or other methods to plot the empirical cumulative proportions, as shown. Because the ECDF is a step function, most cumulative proportions values (such as 0.45) are "in a gap." By this I mean that there is no observation t in the data for which the cumulative proportion P(X ≤ t) equals 0.45. Depending on how you define the sample quantiles, the 0.45 quantile might be reported as 1, 1.5, 1.95, or 2.

Since the default definition is QNTLDEF=5, let's visualize the sample quantiles for that definition. You can use the PCTPTS= option on the OUTPUT statement in PROC UNIVARIATE to declare the percentiles that you want to compute. Equivalently, you can use the QNTL function in PROC IML, as below. Regardless, you can ask SAS to find the quantiles for a set of probabilities on a fine grid of points such as {0.001, 0.002, ..., 0.998, 0.999}. You can the graph of the probabilities versus the quantiles to visualize how the percentile definition computes quantiles for the sample data.

proc iml;
use Q; read all var "x"; close;       /* read data */
prob = T(1:999) / 1000;               /* fine grid of prob values */
call qntl(quantile, x, prob, 5);      /* use method=5 definition  */
create Pctls var {"quantile" "prob" "x"}; append; close;
quit;
 
title "Sample Percentiles";
title2 "QNTLDEF = 5";
proc sgplot data=Pctls noautolegend;
   scatter x=quantile y=prob / markerattrs=(size=5 symbol=CircleFilled);
   fringe x / lineattrs=GraphData2(thickness=3);
   xaxis display=(nolabel) values=(0 to 8);
   yaxis offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportions";
   refline 0 1 / axis=y;
run;
Sample quantiles (percentiles) for a small data set

For each probability value (Y axis), the graph shows the corresponding sample quantile (X axis) for the default definition in SAS, which is QNTLDEF=5. The X axis also displays red tick marks at the location of the data. You can use this graph to find any quantile. For example, to find the 0.45 quantile, you start at 0.45 on the Y axis, move to the right until you hit a blue marker, and then drop down to the X axis to discover that the 0.45 quantile estimate is 2.

If you prefer to think of the quantiles (the X values) as a function of the probabilities, just interchange the X= and Y= arguments in the SCATTER statement (or turn your head sideways!). Then the quantile function is a step function.

Comparing all five SAS percentile definitions

It is easy to put a loop around the SAS/IML computation to compute the sample quantiles for the five different definitions that are supported in SAS. The following SAS/IML program writes a data set that contains the sample quantiles. You can use the WHERE statement in PROC PRINT to compare the same quantile across the different definitions. For example, the following displays the 0.45 quantile (45th percentile) for the five definitions:

/* Compare all SAS methods */
proc iml;
use Q; read all var "x"; close;       /* read data */
prob = T(1:999) / 1000;               /* fine grid of prob values */
create Pctls var {"Qntldef" "quantile" "prob" "x"};
do def = 1 to 5;
   call qntl(quantile, x, prob, def); /* qntldef=1,2,3,4,5 */
   Qntldef = j(nrow(prob), 1, def);   /* ID variable */
   append;
end;
close;
quit;
 
proc print data=Pctls noobs;
   where prob = 0.45;                 /* compare 0.45 quantile for different definitions */
   var Qntldef quantile;
run;

You can see that the different definitions lead to different sample quantiles. How do the quantile functions compare? Let's plot them and see:

ods graphics / antialiasmax=10000;
title "Sample Percentiles in SAS";
proc sgpanel data=Pctls noautolegend;
   panelby Qntldef / onepanel rows=2;
   scatter x=quantile y=prob/ markerattrs=(size=3 symbol=CircleFilled);
   fringe x;
   rowaxis offsetmax=0 offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportion";
   refline 0 1 / axis=y;
   colaxis display=(nolabel);
run;
Compare percentile definitions in SAS

The graphs (click to enlarge) show that QNTLDEF=1 and QNTLDEF=4 are piecewise-linear interpolation methods, whereas QNTLDEF=2, 3, and 5 are discrete rounding methods. The default method (QNTLDEF=5) is similar to QNTLDEF=2 except for certain averaged values. For the discrete definitions, SAS returns either a data value or the average of adjacent data values. The interpolation methods do not have that property: the methods will return quantile values that can be any value between observed data values.

If you have a small data set, as in this blog post, it is easy to see how the percentile definitions are different. For larger data sets (say, 100 or more unique values), the five quantile functions look quite similar.

The differences between definitions are most apparent when there are large gaps between adjacent data values. For example, the sample data has a large gap between the ninth and tenth observations, which have the values 5 and 8, respectively. If you compute the 0.901 quantile, you will discover that the "round down" method (QNTLDEF=2) gives 5 as the sample quantile, whereas the "round up" method (QNTLDEF=3) gives the value 8. Similarly, the "backward interpolation method" (QNTLDEF=1) gives 5.03, whereas the "forward interpolation method" (QNTLDEF=4) gives 7.733.

In summary, this article shows how the (somewhat obscure) QNTLDEF= option results in different quantile estimates. Most people just accept the default definition (QNTLDEF=5), but if you are looking for a method that interpolates between data values, rather than a method that rounds and averages, I recommend QNTLDEF=1, which performs linear interpolations of the ECDF. The differences between the definitions are most apparent for small samples and when there are large gaps between adjacent data values.

Reference

For more information about sample quantiles, including a mathematical discussion of the various formulas, see
Hyndman, R. J. and Fan, Y. (1996) "Sample quantiles in statistical packages", American Statistician, 50, 361–365.

The post Quantile definitions in SAS appeared first on The DO Loop.

5月 172017
 

The April 2017 issue of Significance magazine features a cover story by Robert Langkjaer-Bain about the Flint (Michigan) water crisis. For those who don't know, the Flint water crisis started in 2014 when the impoverished city began using the Flint River as a source of city water. The water was not properly treated, which led to unhealthy (even toxic) levels of lead and other contaminants in the city's water supply. You can read an overview of the Flint Water Crisis on Wikipedia.

The crisis was compounded because someone excluded two data points before computing a quantile of a small data set. This seemingly small transgression had tragic ramifications. This article examines the Flint water quality data and shows why excluding those two points changed the way that the city responded. You can download the SAS program that analyzes these data.

Federal standards for detecting unsafe levels of lead

The federal Lead and Copper Rule of 1991 specifies a statistical method for determining when the concentration of lead in a water supply is too high. First, you sample from a number of "worst case" homes (such as those served by lead pipes), then compute the 90th percentile of the lead levels from those homes. If the 90th percentile exceeds 15 parts per billion (ppb), then the water is unsafe and action must be taken to correct the problem.

In spring 2015, this data collection and analysis was carried out in Flint by the Michigan Department of Environmental Quality (MDEQ), but as discussed in the Significance article, the collection process was flawed. For example, the MDEQ was supposed to collect 100 measurements, but only 71 samples were obtained, and they were not from the "worst case" homes. The 71 lead measurements that they collected are reproduced below, where I have used '0' for "not detectable." A call to PROC MEANS computes the 90th percentile (P90) of the complete sample:

/* values of lead concentration in Flint water samples.
   Use 0 for "not detectable" */
data FlintObs;
label Lead = "Lead Concentration (ppb)";
input lead @@;
Exclude = (Lead=20 | Lead=104); /* MDEQ excluded these two large values */
datalines;
0 0 0 0 0 0 0 0 0 0 0 0 0 
1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 
3 3 3 3 3 3 3 3 3 3 3 4 4 
5 5 5 5 5 5 5 5 6 6 6 6 7 7 7 
8 8 9 10 10 11 13 18 20 21 22 29 43 43 104
;
 
proc means data=FlintObs N p90; 
   var lead;
run;

According to this analysis of the full data, the 90th percentile of the sample is 18 ppb, which exceeds the federal limit of 15 ppb. Consequently, the Flint water fails the safety test, and the city must take action to improve the water.

But that is not what happened. Instead, "the MDEQ told the city's water quality supervisors to remove two of the samples" (Langkjaer-Bain, p. 19) that were over 15 ppb. The MDEQ claimed that these data were improperly collected. The two data points that were excluded have the values 20 and 104. Because these values are both higher than the 90th percentile, excluding these observations lowered the 90th percentile of the modified sample, which has 69 observations. The following call to PROC MEANS computes the 90th percentile for the modified sample:

proc means data=FlintObs N p90; 
   where Exclude=0;     /* exclude two observations */
   var lead;
run;

The second table shows that the 90th percentile of the modified sample is 13 ppb. "The exclusion of these two samples nudged the 90th percentile reading...below the all-important limit of 15 ppb." (Langkjaer-Bain, p. 20) The modified conclusion is that the city does not need to undertake expensive corrective action to render the water safe.

The consequences of keeping or excluding these outliers were huge. If kept in the analysis, officials are required to take expensive corrective action; if excluded, no action is required. In the end, the modified sample was used, and the citizens of Flint were told that the water supply had passed the safety test.

The following histogram (click to enlarge) is similar to Figure 1 in Langkjaer-Bain's article. The red bars indicate observations that the MDEQ excluded from the analysis. The graph clearly shows that the distribution of lead values has a long tail. A broken axis is used to indicate that the distance to the 104 ppb reading has been shortened to reduce horizontal space. The huge gap near 15 ppb indicates a lack of data near that important value. Therefore the quantiles near that point will be extremely sensitive to deleting extreme values. To me, the graph indicates that more data should be collected so that policy makers can be more confident in their conclusions.

Distribution of Lead Levels in Flint, Michigan (Spring 2015)

Confidence intervals for the 90th percentile

Clearly the point estimate for the 90th percentile depends on whether or not those two measurements are excluded. But can statistics provide any additional insight into the 90th percentile of lead levels in the Flint water supply? When someone reports a statistic for a small sample, I like to ask "how confident are you in that statistic?" A standard error is one way to express the accuracy of a point estimate; a 95% confidence interval (CI) is another. The width of a confidence interval gives us information about the accuracy of the point estimate. As you might expect, the standard error for an extreme quantile (such as 0.9) is typically much bigger than for a quantile near 0.5, especially when there isn't much data near the quantile.

Let's use SAS procedures to construct a 95% CI for the 90th percentile. PROC UNIVARIATE supports the CIPCTLDF option, which produces distribution-free confidence intervals. I'll give the Flint officials the benefit of the doubt and compute a confidence interval for the modified data that excluded the two "outliers":

proc univariate data=FlintObs CIPctlDF;
   where Exclude=0;     /* exclude two observations */
   var Lead;
   ods select quantiles;
run;

The 95% confidence interval for P90 is [8, 43], which is very wide and includes the critical value 15 ppb in its interior. If someone asks, "how confident are you that the 90th percentile does not exceed 15 ppb," you should respond, "based on these data, I am not confident at all."

Statistical tests for the 90th percentile

As I've written before, you can also use the QUANTREG procedure in SAS to provide an alternative method to compute confidence intervals for percentiles. Furthermore, the QUANTREG procedure supports the ESTIMATE statement, which you can use to perform a one-sided test for the hypothesis "does the 90th percentile exceed 15 ppb?" The following call to PROC QUANTREG performs this analysis and uses 10,000 bootstrap resamples to estimate the confidence interval:

proc quantreg data=FlintObs CI=resampling(nrep=10000);
   where Exclude=0;     /* exclude two observations */
   model lead = / quantile=0.9 seed=12345;
   estimate 'P90 > 15' Intercept 1 / upper CL testvalue=15;
   ods select ParameterEstimates Estimates;
run;

The standard error for the 90th percentile is about 5.3. Based on bootstrap resampling methods, the 95% CI for P90 is approximately [2.4, 23.6]. (Other methods for estimating the CI give similar or wider intervals.) The ESTIMATE statement is used to test the null hypothesis that P90 is greater than 15. The p-value is large, which means that even if you delete two large lead measurements, the data do not provide evidence to reject the null hypothesis.

Conclusions

There is not enough evidence to reject the hypothesis that P90 is greater than the legal limit of 15 ppb. Two different 95% confidence intervals for P90 include 15 ppb in their interiors.

In fact, the confidence intervals include 15 ppb whether you use all 71 observations or just the 69 observations that MDEQ used. So you can argue about whether the MDEQ should have excluded the controversial measurements, but the hypothesis test gives the same conclusion regardless. By using these data, you cannot rule out the possibility that the 90th percentile of the Flint water supply is greater than 15 ppb.

What do you have to say? Share your comments.

The post Quantiles and the Flint water crisis appeared first on The DO Loop.

5月 082017
 

Quick! What is the next term in the numerical sequence 1, 2, 1, 2, 3, 4, 5, 4, 3, 4, ...? If you said '3', then you must be an American history expert, because that sequence represents the number of living US presidents beginning with Washington's inauguration on 30APR1789 and continuing until the death of James Monroe on 04JUL1831.

I stumbled upon the Wikipedia page for "Living Presidents of the United State," which contains a table that shows the dates of inaugurations and deaths of US presidents and the number of living presidents between events. The table is very crowded and I found it difficult to visualize the data, so I decided to create a SAS graph that shows the timeline for the number of living US presidents.

All the information in the complex Wikipedia table can be derived by knowing the dates on which a new president was inaugurated or a previous president died. You can create a simple data set with that information and use SAS to calculate other information, such as the number of living presidents or the time span between events. You can download the data and the SAS program that creates the following graph. (Click to enlarge.) If you create the graph in SAS, you can hover the mouse over a marker to see which president died or was inaugurated for each event.

Timeline of living US presidents

A few interesting facts that are revealed by the visualization of these events (all statistics as of 08MAY2017):

  • The time-weighted average since Washington's inauguration is 3.43 living presidents per day.
  • The period from 1989 to 2017 featured a larger-than-usual number of living presidents. Unfortunately, I don't expect that trend to last, since Presidents Carter and G. H. W. Bush are both very old.
  • There have been six brief intervals when the only living president was the sitting president.
  • On 04JUL1826, the US lost two former presidents (John Adams and Thomas Jefferson) within 6 hours. You can see the graph drop by 2 in 1826. Interestingly, James Monroe also died on Independence Day!
  • When a president dies in office, the number of living presidents does not change, since the vice-president is inaugurated that same day. Can you spot the eight times that a president died in office?
  • No presidents died during F. D. Roosevelt's administration—except FDR himself. The 20-year period from 1933-1953 is the longest span during which the number of living presidents stayed constant.
  • The reason no one died during FDR's terms was that Herbert Hoover remained alive. Hoover had the record of being the ex-president who lived longest after his inauguration: from 1929 until 1964, which is 13,014 days or 35.65 years! However, Jimmy Carter broke that record a few years ago. Carter was inaugurated in 1977 and has lived more than 40 years since that event.

Do you notice any other interesting features of this timeline? Leave a comment.

The post Timeline of living US presidents appeared first on The DO Loop.

5月 032017
 

If a financial analyst says it is "likely" that a company will be profitable next year, what probability would you ascribe to that statement? If an intelligence report claims that there is "little chance" of a terrorist attack against an embassy, should the ambassador interpret this as a one-in-a-hundred chance, a one-in-ten chance, or some other value?

Analysts often use vague statements like "probably" or "chances are slight" to convey their beliefs that a future event will or will not occur. Government officials and policy-makers who read reports from analysts must interpret and act on these vague statements. If the reader of a report interprets a phrase different from what the writer intended, that can lead to bad decisions.

Assigning probabilities to statements

Original box plot: Distribution of probabilities for word phrases

In the book Psychology of Intelligence Analysis (Heuer, 1999), the author presents "the results of an experiment with 23 NATO military officers accustomed to reading intelligence reports. They were given a number of sentences such as: "It is highly unlikely that ...." All the sentences were the same except that the verbal expressions of probability changed. The officers were asked what percentage probability they would attribute to each statement if they read it in an intelligence report."

The results are summarized in the adjacent dot plot from Heuer (Chapter 12), which summarizes how the officers assess the probability of various statements. The graph includes a gray box for some statements. The box is not a statistical box plot. Rather it indicates the probability range according to a nomenclature proposed by Kent (1964), who tried to get the intelligence community to agree that certain phrases would be associated with certain probability ranges.

For some statements (such as "better than even" and "almost no chance") there was general agreement among the officers. For others, there was large variability in the probability estimates. For example, many officers interpreted "probable" as approximately a 75% chance, but quite a few interpreted it as less than 50% chance.

A modern re-visualization

Zonination's box plot: Distribution of probabilities for word phrases

The results of this experiment are interesting on many levels, but I am going to focus on the visualization of the data. I do not have access to the original data, but this experiment was repeated in 2015 when the user "Zonination" got 46 users on Reddit (who were not military experts) to assign probabilities to the statements. His visualization of the resulting data won a 2015 Kantar Information is Beautiful Award. The visualization uses box plots to show the schematic distribution and overlays the 46 individual estimates by using a jittered, semi-transparent, scatter plot. The Zonination plot is shown at the right (click to enlarge). Notice that the "boxes" in this second graph are determined by quantiles of the data, whereas in the first graph they were theoretical ranges.

Creating the graph in SAS

I decided to remake Zonination's plot by using PROC SGPLOT in SAS. I made several modifications that improve the readability and clarity of the plot.

  • I sorted the categories by the median probability. The median is a robust estimate of the "consensus probability" for each statement. The sorted categories indicate the relative order of the statements in terms of perceived likelihood. For example, an "unlikely" event is generally perceived as more probable than an event that has "little chance." For details about sorting the variables in SAS, see my article about how to sort variables by a statistic.
  • I removed the colors. Zonination's rainbow-colored chart is aesthetically pleasing, but the colors do not add any new information about the data. However, the colors help the eye track horizontally across the graph, so I used alternating bands to visually differentiate adjacent categories. You can create color bands by using the COLORBANDS= option in the YAXIS statement.
  • To reduce overplotting of markers, I used systematic jittering instead of random jittering. In random jittering, each vertical position is randomly offset. In systematic (centered) jittering, the markers are arranged so that they are centered on the "spine" of the box plot. Vertical positions are changed only when the markers would otherwise overlap. You can use the JITTER option in the SCATTER statement to systematically jitter marker positions.
  • Zonination's plot displays some markers twice, which I find confusing. Outliers are displayed once by the box plot and a second time by the jittered scatter plot. In my version, I suppress the display of outliers by the box plot by using the NOOUTLIERS option in the HBOX statement.

You can download the SAS code that creates the data, sorts the variables by median, and creates the plot. The following call to PROC SGPLOT shows the HBOX and SCATTER statements that create the plot:

title "Perceptions of Probability";
proc sgplot data=Long noautolegend;
   hbox _Value_ / category=_Label_ nooutliers nomean nocaps;  
   scatter x=_Value_ y=_Label_ / jitter transparency=0.5
                     markerattrs=GraphData2(symbol=circlefilled size=4);
   yaxis reverse discreteorder=data labelpos=top labelattrs=(weight=bold)
                     colorbands=even colorbandsattrs=(color=gray transparency=0.9)
                     offsetmin=0.0294 offsetmax=0.0294; /* half of 1/k, where k=number of catgories */
   xaxis grid values=(0 to 100 by 10);
   label _Value_ = "Assigned Probability (%)" _label_="Statement";
run;
SAS box plot: Distribution of probabilities for word phrases

The graph indicates that some responders either didn't understand the task or intentionally gave ridiculous answers. Of the 17 categories, nine contain extreme outliers, such as assigning certainty (100%) to the phrases "probably not," "we doubt," and "little chance." However, the extreme outliers do not affect the statistical conclusions about the distribution of probabilities because box plots (which use quartiles) are robust to outliers.

The SAS graph, which uses systematic jittering, reveals a fact about the data that was hidden in the graphs that used random jittering. Namely, most of the data values are multiples of 5%. Although a few people responded with values such as 88.7%, 1%, or 3%, most values (about 80%) are rounded to the nearest 5%. For the phrases "likely" and "we believe," 44 of 46 responses (96%) were a multiple of 5%. In contrast, the phrase "almost no chance" had only 18 of 46 responses (39%) were multiples of 5% because many responses were 1%, 2%, or 3%.

Like the military officers in the original study, there is considerable variation in the way that the Reddit users assign a probability to certain phrases. It is interesting that some phrases (for example, "We believe," "Likely," and "Probable") have the same median value but wildly different interquartile ranges. For clarity, speakers/writers should use phrases that have small variation or (even better!) provide their own assessment of probability.

Does something about this perception study surprise you? Do you have an opinion about the best way to visualize these data? Leave a comment.

The post Perceptions of probability appeared first on The DO Loop.

5月 012017
 
Split data into groups that have the same mean and variance

A frequently asked question on SAS discussion forums concerns randomly assigning units (often patients in a study) to various experimental groups so that each group has approximately the same number of units. This basic problem is easily solved in SAS by using PROC SURVEYSELECT or a DATA step program.

A more complex problem is when the researcher wants the distribution of a covariate to be approximately equal for each group. For example, a medical researcher might want each group of patients to have approximately the same mean and variance for their cholesterol levels, as shown in the plot to the right. This second problem is much harder because it involves distributing the units (non-randomly) so that the groups satisfy some objective. Conceptually, you want an assignment that minimizes the difference between moments (mean, variance, skewness,...) of the subgroups.

The OPTEX procedure for optimal design

Creating an experimental design that optimizes some criterion is one of the many uses of the OPTEX procedure in SAS/QC software In fact, this problem is one of the examples in the PROC OPTEX documentation. The example in this blog post follows the documentation example; see the documentation for details.

To solve this assignment problem with PROC OPTEX, you need to specify three things:

  • A data set (call it UNITS) that contains the individual units (often patients) that you want to assign to the treatments. For this example, the units are the living patients in the Sashelp.Heart data set.
  • A data set (call it TREATMENTS) that contains the names of the treatment groups. This example uses five groups with values 1, 2, 3, 4, and 5.
  • The covariate in the problem. The OPTEX procedure will assign units to groups so that the first k moments are approximately equal across treatment groups. This example uses the Cholesterol variable and k=2, which means that the mean and variance of the cholesterol levels for patients in each treatment group will be approximately equal.

The following statements create the two data sets and define a macro variable that contains the name of the Cholesterol variable:

/* Split the living Sashelp.heart patients into five groups so that 
   mean and variance of cholesterol is the same across groups. */
data Units;                   /* each row is a unit to be assigned */
set Sashelp.Heart(where=(status = "Alive"));
keep Sex AgeAtStart Height Weight Diastolic Systolic MRW Cholesterol;
run;
 
%let NumGroups =5;           /* number of treatment groups */
data Treatments;
do Trt = 1 to &NumGroups;    /* Trt is variable that assigns patients to groups */
   output;
end;
run;
 
%let Var = Cholesterol;      /* name of covariate */

As discussed in the PROC OPTEX documentation, the following call creates an output data set (named GROUPS) that assigns each patient to a group (1–5). A call to PROC MEANS displays the mean and variance of each group.

proc optex data=Treatments seed=97531 coding=orthcan;
   class Trt;
   model Trt;              /* specify treatment model */
   blocks design=Units;    /* specify units */
   model &Var &Var*&Var;   /* fixed covariates: &Var--> mean, &Var*&Var--> variance */
   output out=Groups;      /* merged data: units assigned to groups */
run;
 
proc means data=Groups mean std;
  class Trt;
  var &Var;
run;

Success! The table shows that each group has approximately the same size and approximately the same mean and variance of the Cholesterol variable. Remember, though, that this assignment scheme is not random, so be careful not to make unjustified inferences from estimates based on these group assignments.

I am not an expert on experimental design, but a knowledgeable colleague tells me that optimal design theory states that the design that minimizes the variance of the treatment effects (adjusting for the first two moments of the covariate) is the design in which treatment means and variances of the covariate are as equal as possible. This is the design that PROC OPTEX has produced.

The following call to PROC SGPLOT uses box plots to visualize the distribution of the Cholesterol variable across treatment groups. The graph is shown at the top of this article. Clearly the distribution of cholesterol is very similar for each group.

proc sgplot data=Groups;
   vbox &Var / category=Trt;
run;

In summary, the OPTEX procedure in SAS/QC software enables you to assign units to groups, where each group has approximately the same distribution of a specified covariate. In this article, the covariate measured cholesterol levels in patients, but you can also group individuals according to income, scholastic aptitude, and so on. For large data sets, the assignment problem is a challenging optimization problem, but PROC OPTEX provides a concise syntax and solves this problem efficiently.

The post Split data into groups that have the same mean and variance appeared first on The DO Loop.

4月 262017
 

Most SAS regression procedures support a CLASS statement which internally generates dummy variables for categorical variables. I have previously described what dummy variables are and how are they used. I have also written about how to create design matrices that contain dummy variables in SAS, and in particular how to use different parameterizations: GLM, reference, effect, and so forth.

It occurs to me that you can visualize the structure of a design matrix by using the same technique (heat maps) that I used to visualize missing value structures. In a design matrix, each categorical variable is replaced by several dummy variables. However, there are multiple parameterizations or encodings that result in different design matrices.

Heat maps of design matrices: GLM parameterization

Heat maps require several pixels for each row and column of the design matrix, so they are limited to small or moderate sized data. The following SAS DATA step extracts the first 150 observations from the Sashelp.Heart data set and renames some variables. It also adds a fake response variable because the regression procedures that generate design matrices (GLMMOD, LOGISTIC, GLMSELECT, TRANSREG, and GLIMMIX) require a response variable even though the goal is to create a design matrix for the explanatory variables. In the following statements, the OUTDESIGN option of the GLMSELECT procedure generates the design matrix. The matrix is then read into PROC IML where the HEATMAPDISC subroutine creates a discrete heat map.

/* add fake response variable; for convenience, shorten variable names */
data Temp / view=Temp;
   set Sashelp.heart(obs=150
                keep=BP_Status Chol_Status Smoking_Status Weight_Status);
   rename BP_Status=BP Chol_Status=Chol 
          Smoking_Status=Smoking Weight_Status=Weight;
   FakeY = 0;
run;
 
ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP Chol Smoking Weight / param=GLM;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
 
ods graphics / width=500px height=800px;
proc iml;  /* use HEATMAPDISC call to create heat map of design */
use Design;  read all var _NUM_ into X[c=varNames];  close;
run HeatmapDisc(X) title="GLM Design Matrix"
     xvalues=varNames displayoutlines=0 colorramp={"White" "Black"};
QUIT;
Design matrix for the GLM parameterization in SAS

Click on the heat map to enlarge it. Each row of the design matrix indicates a patient in a research study. If any explanatory variable has a missing value, the corresponding row of the design matrix is missing (shown as gray). In the design matrix for the GLM parameterization, a categorical variable with k levels is represented by k columns. The black and white heat map shows the structure of the design matrix. Black indicates a 1 and white indicates a 0. In particular:

  • This first column is all black, which indicates the intercept column.
  • Columns 2-4 represent the BP variable. For each row has one black rectangle in one of those columns. You can see that there are few black squares in column 4, which indicates that few patients in the study have optimal cholesterol.
  • In a similar way, you can see that there are many nonsmokers (column 11) in the study. There are also many overweight patients (column 14) and few underweight patients (column 15).

The GLM parameterization is called a "singular parameterization" because each it contains redundant columns. For example, the BP_Optimal column is redundant because that column contains a 1 only when the BP_High and BP_Normal columns are both 0. Similarly, if either the BP_High or the BP_Normal columns is 1, then BP_Optimal is automatically 0. The next section removes the redundant columns.

Heat maps of design matrices: Reference parameterization

There is a binary design matrix that contains only the independent columns of the GLM design matrix. It is called a reference parameterization and you can generate it by using PARAM=REF in the CLASS statement, as follows:

ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP Chol Smoking Weight / param=REF;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
Design matrix for the REFERENCE parameterization in SAS

Again, you can use the HEATMAPDISC call in PROC IML to create the heat map. The matrix is similar, but categorical variables that have k levels are replaced by k–1 dummy variables. Because the reference level was not specified in the CLASS statement, the last level of each category is used as the reference level. Thus the REFERENCE design matrix is similar to the GLM design, but that the last column for each categorical variable has been dropped. For example, there are columns for BP_High and BP_Normal, but no column for BP_Optimal.

Nonbinary designs: The EFFECT parameterization

The previous design matrices were binary 0/1 matrices. The EFFECT parameterization, which is the default parameterization for PROC LOGISTIC, creates a nonbinary design matrix. In the EFFECT parameterization, the reference level is represented by using a -1 and a nonreference level is represented by 1. Thus there are three values in the design matrix.

If you do not specify the reference levels, the last level for each categorical variable is used, just as for the REFERENCE parameterization. The following statements generate an EFFECT design matrix and use the REF= suboption to specify the reference level. Again, you can use the HEATMAPDISC subroutine to display a heat map for the design. For this visualization, light blue is used to indicate -1, white for 0, and black for 1.

ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP(ref='Normal') Chol(ref='Desirable') 
         Smoking(ref='Non-smoker') Weight(ref='Normal') / param=EFFECT;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
 
proc iml;  /* use HEATMAPDISC call to create heat map of design */
use Design; read all var _NUM_ into X[c=varNames]; close;
run HeatmapDisc(X) title="Effect Design Matrix"
     xvalues=varNames displayoutlines=0 colorramp={"LightBlue" "White" "Black"};
QUIT;
Design matrix for the EFFECT parameterization in SAS

In the adjacent graph, blue indicates that the value for the patient was the reference category. White and black indicates that the value for the patient was a nonreference category, and the black rectangle appears in the column that indicates the value of the nonreference category. For me, this design matrix takes some practice to "read." For example, compared to the GLM matrix, it is harder to determine the most frequent levels for a categorical variable.

Heat maps in Base SAS

In the example, I have used the HEATMAPDISC subroutine in SAS/IML to visualize the design matrices. But you can also create heat maps in Base SAS.

If you have SAS 9.4m3, you can use the HEATMAPPARM statement in PROC SGPLOT to create these heat maps. First you have to convert the data from wide form to long form, which you can do by using the following DATA step:

/* convert from wide (matrix) to long (row, col, value)*/
data Long;
set Design;
array dummy[*] _NUMERIC_;
do varNum = 1 to dim(dummy);
   rowNum = _N_;
   value = dummy[varNum];
   output;
end;
keep varNum rowNum value;
run;
 
proc sgplot data=Long;
/* the observation values are in the order {1, 0, -1}; use STYLEATTRIBS to set colors */
styleattrs  DATACOLORS=(Black White LightBlue);
heatmapparm x=varNum y=rowNum colorgroup=value / showxbins discretex;
xaxis type=discrete; /* values=(1 to 11)  valuesdisplay=("A" "B" ... "J" "K"); */
yaxis reverse;
run;

The heat map is similar to the one in the previous section, except that the columns are labeled 1, 2, 3, and so forth. If you want the columns to contain the variable names, use the VALUESDISPLAY= option, as shown in the comments.

If you are running an earlier version of SAS, you will need to use the Graph Template Language (GTL) to create a template for the discrete heat maps.

In summary, you can use the OUTDESIGN= option in PROC GLMSELECT to create design matrices that use dummy variables to encode classification variables. If you have SAS/IML, you can use the HEATMAPDISC subroutine to visualize the design matrix. Otherwise, you can use the HEATMAPPARM statement in PROC SGPLOT (SAS 9.4m3) or the GTL to create the heat maps. The visualization is useful for teaching and understanding the different parameterizations schemes for classification variables.

The post Visualize a design matrix appeared first on The DO Loop.

4月 242017
 

There are several ways to visualize data in a two-way ANOVA model. Most visualizations show a statistical summary of the response variable for each category. However, for small data sets, it can be useful to overlay the raw data. This article shows a simple trick that you can use to combine two categorical variables and plot the raw data for the joint levels of the two categorical variables.

An ANOVA for two-way interactions

Recall that an ANOVA (ANalysis Of VAriance) model is used to understand differences among group means and the variation among and between groups. The documentation for the ROBUSTREG procedure in SAS/STAT contains an example that compares the traditional ANOVA using PROC GLM with a robust ANOVA that uses PROC ROBUSTREG. The response variable is the survival time (Time) for 16 mice who were randomly assigned to different combinations of two successive treatments (T1, T2). (Higher times are better.) The data are shown below:

data recover;
input  T1 $ T2 $ Time @@;
datalines;
0 0 20.2  0 0 23.9  0 0 21.9  0 0 42.4
1 0 27.2  1 0 34.0  1 0 27.4  1 0 28.5
0 1 25.9  0 1 34.5  0 1 25.1  0 1 34.2
1 1 35.0  1 1 33.9  1 1 38.3  1 1 39.9
;

The response variable depends on the joint levels of the binary variables T1 and T2. A first attempt to visualize the data in SAS might be to create a box plot of the four combinations of T1 and T2. You can do this by assigning T1 to be the "category" variable and T2 to be a "group" variable in a clustered box plot, as follows:

title "Response for Two Groups";
title2 "Use VBOX Statement with Categories and Groups";
proc sgplot data=recover;
   vbox Time / category=T1 group=T2;
run;
Box plots for a binary 'category' variable and a binary 'group' variable

The graph shows the distribution of response for the four joint combinations of T1 and T2. The graph is a little hard to interpret because the category levels are 0/1. The two box plots on the left are for T1=0, which means "Did not receive the T1 treatment." The two box plots on the right are for mice who received the T1 treatment. Within those clusters, the blue boxes indicate the distribution of responses for the mice who did not receive the T2 treatment, whereas the red boxes indicate the response distribution for mice that did receive T2. Both treatments seem to increase the mean survival time for mice, and receiving both treatments seems to give the highest survival times.

Interpreting the graph took a little thought. Also, the colors seem somewhat arbitrary. I think the graph could be improved if the category labels indicate the joint levels. In other words, I'd prefer to see a box plot of the levels of interaction variable T1*T2. If possible, I'd also like to optionally plot the raw response values.

Method 1: Use the EFFECTPLOT statement

The LOGISTIC and GENMOD procedures in SAS/STAT support the EFFECTPLOT statement. Many other SAS regression procedures support the STORE statement, which enables you to save a regression model and then use the PLM procedure (which supports the EFFECTPLOT statement). The EFFECTPLOT statement can create a variety of plots for visualizing regression models, including a box plot of the joint levels for two categorical variables, as shown by the following statements:

/* Use the EFFECTPLOT statement in PROC GENMOD, or use the STORE statement and PROC PLM */
proc genmod data=recover;
   class T1 T2;
   model Time = T1 T2 T1*T2;
   effectplot box / cluster;
   effectplot interaction /  obs(jitter);  /* or use interaction plot to see raw data */
run;
Box plots of joint levels created by the EFFECTPLOT statement in SAS

The resulting graph uses box plots to show the schematic distribution of each of the joint levels of the two categorical variables. (The second EFFECTPLOT statement creates an "interaction plot" that shows the raw values and mean responses.) The means of each group are connected, which makes it easier to compare adjacent means. The labels indicate the levels of the T1*T2 interaction variable. I think this graph is an improvement over the previous multi-colored box plot, and I find it easier to read and interpret.

Although the EFFECTPLOT statement makes it easy to create this plot, the EFFECTPLOT statement does not support overlaying raw values on the box plots. (You can, however, see the raw values on the "interaction plot".) The next section shows an alternative way to create the box plots.

Method 2: Concatenate values to form joint levels of categories

You can explicitly form the interaction variable (T1*T2) by using the CATX function to concatenate the T1 and T2 variables, as shown in the following DATA step view. Because the levels are binary-encoded, the resulting levels are '0 0', '0 1', '1 0', and '1 1'. You can define a SAS format to make the joint levels more readable. You can then display the box plots for the interaction variable and, optionally, overlay the raw values:

data recover2 / view=recover2;
length Treatment $3;          /* specify length of concatenated variable */
set recover;
Treatment = catx(' ',T1,T2);  /* combine into one group */
run;
 
proc format;                  /* make the joint levels more readable */
  value $ TreatFmt '0 0' = 'Control'
                   '1 0' = 'T1 Only'
                   '0 1' = 'T2 Only'
                   '1 1' = 'T1 and T2';
run;
 
proc sgplot data=recover2 noautolegend;
   format Treatment $TreatFmt.;
   vbox Time / category=Treatment;
   scatter x=Treatment y=Time / jitter markerattrs=(symbol=CircleFilled size=10);
   xaxis discreteorder=data;
run;
Distribution of response variable in two-way ANOVA: box plots and raw data overlaid

By manually concatenating the two categorical variables to form a new interaction variable, you have complete control over the plot. You can also overlay the raw data, as shown. The raw data indicates that the "Control" group seems to contain an outlier: a mouse who lived longer than would be expected for his treatment. Using PROC ROBUSTREG to compute a robust ANOVA is one way to deal with extreme outliers in the ANOVA setting.

In summary, the EFFECTPLOT statement enables you to quickly create box plots that show the response distribution for joint levels of two categorical variables. However, sometimes you might want more control, such as the ability to format the labels or overlay the raw data. This article shows how to use the CATX function to manually create a new variable that contains the joint categories.

The post Visualize an ANOVA with two-way interactions appeared first on The DO Loop.

4月 242017
 

There are several ways to visualize data in a two-way ANOVA model. Most visualizations show a statistical summary of the response variable for each category. However, for small data sets, it can be useful to overlay the raw data. This article shows a simple trick that you can use to combine two categorical variables and plot the raw data for the joint levels of the two categorical variables.

An ANOVA for two-way interactions

Recall that an ANOVA (ANalysis Of VAriance) model is used to understand differences among group means and the variation among and between groups. The documentation for the ROBUSTREG procedure in SAS/STAT contains an example that compares the traditional ANOVA using PROC GLM with a robust ANOVA that uses PROC ROBUSTREG. The response variable is the survival time (Time) for 16 mice who were randomly assigned to different combinations of two successive treatments (T1, T2). (Higher times are better.) The data are shown below:

data recover;
input  T1 $ T2 $ Time @@;
datalines;
0 0 20.2  0 0 23.9  0 0 21.9  0 0 42.4
1 0 27.2  1 0 34.0  1 0 27.4  1 0 28.5
0 1 25.9  0 1 34.5  0 1 25.1  0 1 34.2
1 1 35.0  1 1 33.9  1 1 38.3  1 1 39.9
;

The response variable depends on the joint levels of the binary variables T1 and T2. A first attempt to visualize the data in SAS might be to create a box plot of the four combinations of T1 and T2. You can do this by assigning T1 to be the "category" variable and T2 to be a "group" variable in a clustered box plot, as follows:

title "Response for Two Groups";
title2 "Use VBOX Statement with Categories and Groups";
proc sgplot data=recover;
   vbox Time / category=T1 group=T2;
run;
Box plots for a binary 'category' variable and a binary 'group' variable

The graph shows the distribution of response for the four joint combinations of T1 and T2. The graph is a little hard to interpret because the category levels are 0/1. The two box plots on the left are for T1=0, which means "Did not receive the T1 treatment." The two box plots on the right are for mice who received the T1 treatment. Within those clusters, the blue boxes indicate the distribution of responses for the mice who did not receive the T2 treatment, whereas the red boxes indicate the response distribution for mice that did receive T2. Both treatments seem to increase the mean survival time for mice, and receiving both treatments seems to give the highest survival times.

Interpreting the graph took a little thought. Also, the colors seem somewhat arbitrary. I think the graph could be improved if the category labels indicate the joint levels. In other words, I'd prefer to see a box plot of the levels of interaction variable T1*T2. If possible, I'd also like to optionally plot the raw response values.

Method 1: Use the EFFECTPLOT statement

The LOGISTIC and GENMOD procedures in SAS/STAT support the EFFECTPLOT statement. Many other SAS regression procedures support the STORE statement, which enables you to save a regression model and then use the PLM procedure (which supports the EFFECTPLOT statement). The EFFECTPLOT statement can create a variety of plots for visualizing regression models, including a box plot of the joint levels for two categorical variables, as shown by the following statements:

/* Use the EFFECTPLOT statement in PROC GENMOD, or use the STORE statement and PROC PLM */
proc genmod data=recover;
   class T1 T2;
   model Time = T1 T2 T1*T2;
   effectplot box / cluster;
   effectplot interaction /  obs(jitter);  /* or use interaction plot to see raw data */
run;
Box plots of joint levels created by the EFFECTPLOT statement in SAS

The resulting graph uses box plots to show the schematic distribution of each of the joint levels of the two categorical variables. (The second EFFECTPLOT statement creates an "interaction plot" that shows the raw values and mean responses.) The means of each group are connected, which makes it easier to compare adjacent means. The labels indicate the levels of the T1*T2 interaction variable. I think this graph is an improvement over the previous multi-colored box plot, and I find it easier to read and interpret.

Although the EFFECTPLOT statement makes it easy to create this plot, the EFFECTPLOT statement does not support overlaying raw values on the box plots. (You can, however, see the raw values on the "interaction plot".) The next section shows an alternative way to create the box plots.

Method 2: Concatenate values to form joint levels of categories

You can explicitly form the interaction variable (T1*T2) by using the CATX function to concatenate the T1 and T2 variables, as shown in the following DATA step view. Because the levels are binary-encoded, the resulting levels are '0 0', '0 1', '1 0', and '1 1'. You can define a SAS format to make the joint levels more readable. You can then display the box plots for the interaction variable and, optionally, overlay the raw values:

data recover2 / view=recover2;
length Treatment $3;          /* specify length of concatenated variable */
set recover;
Treatment = catx(' ',T1,T2);  /* combine into one group */
run;
 
proc format;                  /* make the joint levels more readable */
  value $ TreatFmt '0 0' = 'Control'
                   '1 0' = 'T1 Only'
                   '0 1' = 'T2 Only'
                   '1 1' = 'T1 and T2';
run;
 
proc sgplot data=recover2 noautolegend;
   format Treatment $TreatFmt.;
   vbox Time / category=Treatment;
   scatter x=Treatment y=Time / jitter markerattrs=(symbol=CircleFilled size=10);
   xaxis discreteorder=data;
run;
Distribution of response variable in two-way ANOVA: box plots and raw data overlaid

By manually concatenating the two categorical variables to form a new interaction variable, you have complete control over the plot. You can also overlay the raw data, as shown. The raw data indicates that the "Control" group seems to contain an outlier: a mouse who lived longer than would be expected for his treatment. Using PROC ROBUSTREG to compute a robust ANOVA is one way to deal with extreme outliers in the ANOVA setting.

In summary, the EFFECTPLOT statement enables you to quickly create box plots that show the response distribution for joint levels of two categorical variables. However, sometimes you might want more control, such as the ability to format the labels or overlay the raw data. This article shows how to use the CATX function to manually create a new variable that contains the joint categories.

The post Visualize an ANOVA with two-way interactions appeared first on The DO Loop.

4月 192017
 

Restricted cubic splines are a powerful technique for modeling nonlinear relationships by using linear regression models. I have attended multiple SAS Global Forum presentations that show how to use restricted cubic splines in SAS regression procedures. However, the presenters have all used the %RCSPLINE macro (Frank Harrell, 1988) to generate a SAS data set that contains new variables for the spline basis functions. They then use those basis variables in a SAS regression procedure.

Since SAS 9.3, many SAS regression procedures provide a native implementation of restricted cubic splines by using the EFFECT statement in SAS. This article provides examples of using splines in regression models. Because some older procedures (such as PROC REG) do not support the EFFECT statement, the article also shows how to use SAS procedures to generate the spline basis, just like the %RCSPLINE macro does.

If you are not familiar with splines and knots, read the overview article "Understanding splines in the EFFECT statement." You can also read the documentation for the EFFECT statement, which includes an overview of spline effects as well as a brief description of restricted cubic splines, which are also called "natural cubic splines." I think the fact that the SAS documentation refers to the restricted cubic splines as "natural cubic splines" has prevented some practitioners from realizing that SAS supports restricted cubic splines.

Regression with restricted cubic splines in SAS

This section provides an example of using splines in PROC GLMSELECT to fit a GLM regression model. Because the functionality is contained in the EFFECT statement, the syntax is the same for other procedures. For example, if you have a binary response you can use the EFFECT statement in PROC LOGISTIC. If you a fitting a generalized linear model or a mixed model, you can use PROC GLMMIX.

This example fits the MPG_CITY variable as a function of the WEIGHT variable for vehicles in the Sashelp.Cars data set. The data and a scatter plot smoother are shown in the adjacent graph. (To produce the graph, the following statements sort the data, but that is not required.) The smoother is based on a restricted spline basis with five knots. Notice that the SELECTION=NONE option in the MODEL statement turns off the variable selection features of PROC GLMSELECT, which makes the procedure fit one model just like PROC GLM.

/* create sample data; sort by independent variable for graphing */
proc sort data=sashelp.cars   
          out=cars(keep=mpg_city weight model);
by weight;
run;
 
/* Fit data by using restricted cubic splines.
   The EFFECT statement is supported by many procedures:
   GLIMMIX, GLMSELECT, LOGISTIC, PHREG, PLS, QUANTREG, ROBUSTREG, ... */
ods select ANOVA ParameterEstimates SplineKnots;
proc glmselect data=cars;
  effect spl = spline(weight / details naturalcubic basis=tpf(noint)                 
                               knotmethod=percentiles(5) );
   model mpg_city = spl / selection=none;         /* fit model by using spline effects */
   output out=SplineOut predicted=Fit;            /* output predicted values for graphing */
quit;
 
title "Restricted Cubic Spline Regression";
proc sgplot data=SplineOut noautolegend;
   scatter x=weight y=mpg_city;
   series x=weight y=Fit / lineattrs=(thickness=3 color=red);
run;

The EFFECT statement with the SPLINE option is used to generate spline effects from the WEIGHT variable. The effects are assigned the collective name 'spl'. Putting 'spl' on the MODEL statement says "use the spline effects that I created." You can specify multiple EFFECT statements. Spline effects depend on three quantities: the kind of spline, the number of knots, and the placement of the knots.

  • You specify restricted cubic splines by using the NATURALCUBIC BASIS=TPF(NOINT) options. Technically you do not need the NOINT suboption, but I recommend it because it makes the parameter estimates table simpler.
  • The KNOTMETHOD= option enables you to specify the number and placement of knots. In this example, PERCENTILES(5) places knots at five evenly spaced percentiles of the explanatory variable, which are the 16.6%, 33.3%, 50%, 66.6%, and 83.3% percentiles. Equivalently, the knots are placed at the 1/6, 2/6, 3/6, 4/6, and 5/6 quantiles.

The number and placement of knots for splines

Most researchers use a small number of knots, often three to five. The exact location of knots is not usually critical: if you change the positions slightly the predicted values of the regression change only slightly. A common scheme is to place the knots at fixed percentiles of the data, as shown above. Alternatively, Harrell (Regression Modeling Strategies, 2010 and 2015) suggests heuristic percentiles as shown below, and this scheme seems to be popular among biostatisticians.

Harrell's table of knot placement for restricted cubic splines

The KNOTMETHOD= option on the EFFECT statement provides several options for specifying the locations of knots. Try each of the following options and look at the "SplineKnots" table to see the positions of the knots:

  • KNOTMETHOD=PERCENTILES(5): Places knots at the percentiles 1/6, 2/6, ..., 5/6. An example is shown above.
  • KNOTMETHOD=EQUAL(5): Places knots evenly within the range of the data. If δ = (max-min)/6, then the knots are located at min + i δ for i=1, 2, ..., 5.
  • KNOTMETHOD=RANGEFRACTIONS(0.05 0.275 0.50 0.725 0.95): If you want knots placed unevenly with the range of the data, use this option. For example, the value 0.05 means "place a knot 5% along the data range" and 0.272 means "place a knot 27.5% along the data range." You can separate list values by using spaces or commas.
  • KNOTMETHOD=LIST(2513, 3174, 3474.5, 3903, 5000): This option enables you to list the locations (in data coordinates) of the knots. You can use this option to specify Harrell's schemes. For example, Harrel suggests the 5th, 27.5th, 50th, 72.5th, and 95th percentiles when you use five knots. You can use PROC UNIVARIATE to find these percentiles for the WEIGHT variable and then type the results into the KNOTMETHOD=LIST option.
Comparison of several knot-placement schemes for restricted cubic spline regression in SAS

The adjacent graph shows the predicted values for the four different knot placement methods. (Click to enlarge.) You can see that the general shape of the regression curves is similar regardless of the placement of knots. The differences can be understood by looking at the placement of the first and last knots. The slope of the natural cubic spline fit is linear past the extreme knots, so the placement of the extreme knots dictate where the predictions become linear. For the EQUAL and RANGEFRACTION methods, the largest knots are placed at WEIGHT=6300 and WEIGHT=6923, respectively. Consequently, the predictions for large values of WEIGHT look more cubic than linear. In contrast, the largest knot for the PERCENTILES method is placed at 4175 and the largest LIST knot is 5000. The predictions past those values are linear.

Automate Harrell's knot placement suggestions

In the previous section, I manually typed in the values that correspond to the 5th, 27.5th, 50th, 72.5th, and 95th percentiles of the WEIGHT variable, as suggested by Harrell's scheme. However, it is not difficult to automate that step. You can use PROC UNIVARIATE to write the percentile values to a SAS data set and then use a short DATA _NULL_ step to store those values into a macro variable. The macro variable can then be used as the argument to the KNOTMETHOD=LIST option, as follows:

proc univariate data=cars noprint;
   var weight;
   output out=percentiles pctlpts=5 27.5 50 72.5 95 pctlpre=p_; / * specify the percentiles */
run;
 
data _null_;
set percentiles;
call symput('pctls',catx(',', of _numeric_));   /* put all values into a comma-separated list */
run;
 
%put &=pctls;                 /* optional: display the list of values */
...
   effect spl = spline(weight / ... knotmethod=list(&pctls) ); /* use the macro variable here */
...
PCTLS=2513,3174,3474.5,3903,5000

Write the spline basis functions to a SAS data set

As mentioned eaerlier, not every SAS procedure supports the EFFECT statement. However, the GLMSELECT, LOGISTIC, and GLIMMIX procedures all provide an OUTDESIGN= option, which enables you to output the design matrix that contains the spline basis functions. Furthermore, PROC LOGISTIC supports the OUTDESIGNONLY option and PROC GLIMMIX supports the NOFIT option so that the procedures do not fit the model but only output the design matrix.

You can use the variables in the design matrix to perform regression or other analyses. For example, the following example writes the restricted cubic basis functions to a data set, then uses PROC REG to fit the model:

/* create SplineBasis = data set that contains spline basis functions */
proc glmselect data=cars outdesign(addinputvars fullmodel)=SplineBasis;
   effect spl = spline(weight / naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
   model mpg_city = spl / selection=none;  
quit;
 
/* use design variables in other procedures */
proc reg data=SplineBasis;
   model mpg_city = spl_:;
   output out=out p=p;
run;

One last comment: the basis functions that are generated by the EFFECT statement are not equal to the basis functions created by the %RCSPLINE macro, but they are equivalent. The EFFECT statement uses the definition from The Elements of Statistical Learning (Hastie, Tibshirani, and Friedman, 2009, 2nd Ed, pp. 144-146). The %RCSPLINE macro implements scaled versions of the basis function. Thus parameter estimates will be different but the predicted values will be the same.

Summary

You can use the NATURALCUBIC BASIS=TPF(NOINT) option in the EFFECT statement in SAS to perform regression with restricted cubic splines, which are also called natural cubic splines. You can use the KNOTMETHOD= option to specify the number and placement of the knots. The EFFECT statement is supported by the GLMSELECT, LOGISTIC, and GLIMMIX procedures, among others. These procedures enable you to output a design matrix that contains the spline basis functions, which you can use in procedures that do not support the EFFECT statement.

This article was inspired by several talks that I heard at SAS Global Forum. For more information on this topic, see the following:

The post Regression with restricted cubic splines in SAS appeared first on The DO Loop.