rick wicklin

5月 242017
 

According to Hyndman and Fan ("Sample Quantiles in Statistical Packages," TAS, 1996), there are nine definitions of sample quantiles that commonly appear in statistical software packages. Hyndman and Fan identify three definitions that are based on rounding and six methods that are based on linear interpolation. This blog post shows how to use SAS to visualize and compare the nine common definitions of sample quantiles. It also compares the default definitions of sample quantiles in SAS and R.

Definitions of sample quantiles

Suppose that a sample has N observations that are sorted so that x[1] ≤ x[2] ≤ ... ≤ x[N], and suppose that you are interested in estimating the p_th quantile (0 ≤ p ≤ 1) for the population. Intuitively, the data values near x[j], where j = floor(Np) are reasonable values to use to estimate the quantile. For example, if N=10 and you want to estimate the quantile for p=0.64, then j = floor(Np) = 6, so you can use the sixth ordered value (x[6]) and maybe other nearby values to estimate the quantile.

Hyndman and Fan (henceforth H&F) note that the quantile definitions in statistical software have three properties in common:

  • The value p and the sample size N are used to determine two adjacent data values, x[j]and x[j+1]. The quantile estimate will be in the closed interval between those data points. For the previous example, the quantile estimate would be in the closed interval between x[6] and x[7].
  • For many methods, a fractional quantity is used to determine an interpolation parameter, λ. For the previous example, the fraction quantity is (Np - j) = (6.4 - 6) = 0.4. If you use λ = 0.4, then an estimate the 64th percentile would be the value 40% of the way between x[6] and x[7].
  • Each definition has a parameter m, 0 ≤ m ≤ 1, which determines how the method interpolates between adjacent data points. In general, the methods define the index j by using j = floor(Np + m). The previous example used m=0, but other choices include m=0.5 or values of m that depend on p.

Thus a general formula for quantile estimates is q = (1 - λ) x[j]+ λ x[j+1], where λ and j depend on the values of p, N, and a method-specific parameter m.

You can read Hyndman and Fan (1986) for details or see the Wikipedia article about quantiles for a summary. The Wikipedia article points out a practical consideration: for values of p that are very close to 0 or 1, some definitions need to be slightly modified. For example, if p < 1/N, the quantity Np < 1 and so j = floor(Np) equals 0, which is an invalid index. The convention is to return x[1] when p is very small and return x[N] when p is very close to 1.

Compute all nine sample quantile definitions in SAS

SAS has built-in support for five of the quantile definitions, notably in PROC UNIVARIATE, PROC MEANS, and in the QNTL subroutine in SAS/IML. You can use the QNTLDEF= option to choose from the five definitions. The following table associates the five QNTLDEF= definitions in SAS to the corresponding definitions from H&F, which are also used by R. In R you choose the definition by using the type parameter in the quantile function.

SAS definitions of sample quantiles

It is straightforward to write a SAS/IML function to compute the other four definitions in H&F. In fact, H&F present the quantile interpolation functions as specific instances of one general formula that contains a parameter, which they call m. As mentioned above, you can also define a small value c (which depends on the method) such that the method returns x[1] if p < c, and the method returns x[N] if p ≥ 1 - c.

The following table presents the parameters for computing the four sample quantile definitions that are not natively supported in SAS:

Definitions of sample quantiles that are not natively supported in SAS

Visualizing the definitions of sample quantiles

Visualization of nine defniitions of sample quantiles, from Hyndman and Fan (1996)

You can download the SAS program that shows how to compute sample quantiles and graphs for any of the nine definitions in H&F. The differences between the definitions are most evident for small data sets and when there is a large "gap" between one or more adjacent data values. The following panel of graphs shows the nine sample quantile methods for a data set that has 10 observations, {0 1 1 1 2 2 2 4 5 8}. Each cell in the panel shows the quantiles for p = 0.001, 0.002, ..., 0.999. The bottom of each cell is a fringe plot that shows the six unique data values.

In these graphs, the horizontal axis represents the data and quantiles. For any value of x, the graph estimates the cumulative proportion of the population that is less than or equal to x. Notice that if you turn your head sideways, you can see the quantile function, which is the inverse function that estimates the quantile for each value of the cumulative probability.

You can see that although the nine quantile functions have the same basic shape, the first three methods estimate quantiles by using a discrete rounding scheme, whereas the other methods use a continuous interpolation scheme.

You can use the same data to compare methods. Instead of plotting each quantile definition in its own cell, you can overlay two or more methods. For example, by default, SAS computes sample quantiles by using the type=2 method, whereas R uses type=7 by default. The following graph overlays the sample quantiles to compare the default methods in SAS and R on this tiny data set. The default method in SAS always returns a data value or the average of adjacent data values; the default method in R can return any value in the range of the data.

Comparison of the default  quantile estimates in SAS and R on a tiny data set

Does the definition of sample quantiles matter?

As shown above, different software packages use different defaults for sample quantiles. Consequently, when you report quantiles for a small data set, it is important to report how the quantiles were computed.

However, in practice analysts don't worry too much about which definition they are using because the difference between methods is typically small for larger data sets (100 or more observations). The biggest differences are often between the discrete methods, which always report a data value or the average between two adjacent data values, and the interpolation methods, which can return any value in the range of the data. Extreme quantiles can also differ between the methods because the tails of the data often have fewer observations and wider gaps.

The following graph shows the sample quantiles for 100 observations that were generated from a random uniform distribution. As before, the two sample quantiles are type=2 (the SAS default) and type=7 (the R default). At this scale, you can barely detect any differences between the estimates. The red dots (type=7) are on top of the corresponding blue dots (type=2), so few blue dots are visible.

Comparison of the default  quantile estimates in SAS and R on a larger data set

So does the definition of the sample quantile matter? Yes and no. Theoretically, the different methods compute different estimates and have different properties. If you want to use an estimator that is unbiased or one that is based on distribution-free computations, feel free to read Hyndman and Fan and choose the definition that suits your needs. The differences are evident for tiny data sets. On the other hand, the previous graph shows that there is little difference between the methods for moderately sized samples and for quantiles that are not near gaps. In practice, most data analysts just accept the default method for whichever software they are using.

In closing, I will mention that there are other quantile estimation methods that are not simple formulas. In SAS, the QUANTREG procedure solves a minimization problem to estimate the quantiles. The QUANTREG procedure enables you to not only estimate quantiles, but also estimate confidence intervals, weighted quantiles, the difference between quantiles, conditional quantiles, and more.

SAS program to compute nine sample quantiles.

The post Sample quantiles: A comparison of 9 definitions appeared first on The DO Loop.

5月 222017
 

In last week's article about the Flint water crisis, I computed the 90th percentile of a small data set. Although I didn't mention it, the value that I reported is different from the the 90th percentile that is reported in Significance magazine.

That is not unusual. The data only had 20 unique values, and there are many different formulas that you can use to compute sample percentiles (generally called quantiles). Because different software packages use different default formulas for sample quantiles, it is not uncommon for researchers to report different quantiles for small data sets. This article discusses the five percentile definitions that are supported in SAS software.

You might wonder why there are multiple definitions. Recall that a sample quantile is an estimate of a population quantile. Statisticians have proposed many quantile estimators, some of which are based on the empirical cumulative distribution (ECDF) of the sample, which approximates the cumulative distribution function (CDF) for the population. The ECDF is a step function that has a jump discontinuity at each unique data value. Consequently, the inverse ECDF does not exist and the quantiles are not uniquely defined.

Definitions of sample quantiles

In SAS, you can use the PCTLDEF= option in PROC UNIVARIATE or the QNTLDEF= option in other procedures to control the method used to estimate quantiles. A sample quantile does not have to be an observed data value because you are trying to estimate an unknown population parameter.

For convenience, assume that the sample data are listed in sorted order. In high school, you probably learned that if a sorted sample has an even number of observations, then the median value is the average of the middle observations. The default quantile definition in SAS (QNTLDEF=5) extends this familiar rule to other quantiles. Specifically, if the sample size is N and you ask for the q_th quantile, then when Nq is an integer, the quantile is the data value x[Nq]. However, when Nq is not an integer, then the quantile is defined (somewhat arbitrarily) as the average of the two data x[j] and x[j+1], where j = floor(Nq). For example, if N=10 and you want the q=0.17 quantile, then Nq=1.7, so j=1 and the 17th percentile is reported as the midpoint between the ordered values x[1] and x[2].

Averaging is not the only choices you can make when Nq is not an integer. The other percentile definitions correspond to making different choices. For example, you could round Nq down (QNTLDEF=3), or you could round it to the nearest integer (QNTLDEF=2). Or you could use linear interpolation (QNTLDEF=1 and QNTLDEF=4) between the data values whose (sorted) indices are closest to Nq. In the example where N=10 and q=0.17, the QNTLDEF=1 interpolated quantile is 0.3 x[1] + 0.7 x[2].

Visualizing the definitions for quantiles

The SAS documentation contains the formulas used for the five percentile definitions, but sometimes a visual comparison is easier than slogging through mathematical equations. The differences between the definitions are most apparent on small data sets that contain integer values, so let's create a tiny data set and apply the five definitions to it. The following example has 10 observations and six unique values.

data Q;
input x @@;
datalines;
0 1 1 1 2 2 2 4 5 8
;
ECDF of a small data set

You can use PROC UNIVARIATE or other methods to plot the empirical cumulative proportions, as shown. Because the ECDF is a step function, most cumulative proportions values (such as 0.45) are "in a gap." By this I mean that there is no observation t in the data for which the cumulative proportion P(X ≤ t) equals 0.45. Depending on how you define the sample quantiles, the 0.45 quantile might be reported as 1, 1.5, 1.95, or 2.

Since the default definition is QNTLDEF=5, let's visualize the sample quantiles for that definition. You can use the PCTPTS= option on the OUTPUT statement in PROC UNIVARIATE to declare the percentiles that you want to compute. Equivalently, you can use the QNTL function in PROC IML, as below. Regardless, you can ask SAS to find the quantiles for a set of probabilities on a fine grid of points such as {0.001, 0.002, ..., 0.998, 0.999}. You can the graph of the probabilities versus the quantiles to visualize how the percentile definition computes quantiles for the sample data.

proc iml;
use Q; read all var "x"; close;       /* read data */
prob = T(1:999) / 1000;               /* fine grid of prob values */
call qntl(quantile, x, prob, 5);      /* use method=5 definition  */
create Pctls var {"quantile" "prob" "x"}; append; close;
quit;
 
title "Sample Percentiles";
title2 "QNTLDEF = 5";
proc sgplot data=Pctls noautolegend;
   scatter x=quantile y=prob / markerattrs=(size=5 symbol=CircleFilled);
   fringe x / lineattrs=GraphData2(thickness=3);
   xaxis display=(nolabel) values=(0 to 8);
   yaxis offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportions";
   refline 0 1 / axis=y;
run;
Sample quantiles (percentiles) for a small data set

For each probability value (Y axis), the graph shows the corresponding sample quantile (X axis) for the default definition in SAS, which is QNTLDEF=5. The X axis also displays red tick marks at the location of the data. You can use this graph to find any quantile. For example, to find the 0.45 quantile, you start at 0.45 on the Y axis, move to the right until you hit a blue marker, and then drop down to the X axis to discover that the 0.45 quantile estimate is 2.

If you prefer to think of the quantiles (the X values) as a function of the probabilities, just interchange the X= and Y= arguments in the SCATTER statement (or turn your head sideways!). Then the quantile function is a step function.

Comparing all five SAS percentile definitions

It is easy to put a loop around the SAS/IML computation to compute the sample quantiles for the five different definitions that are supported in SAS. The following SAS/IML program writes a data set that contains the sample quantiles. You can use the WHERE statement in PROC PRINT to compare the same quantile across the different definitions. For example, the following displays the 0.45 quantile (45th percentile) for the five definitions:

/* Compare all SAS methods */
proc iml;
use Q; read all var "x"; close;       /* read data */
prob = T(1:999) / 1000;               /* fine grid of prob values */
create Pctls var {"Qntldef" "quantile" "prob" "x"};
do def = 1 to 5;
   call qntl(quantile, x, prob, def); /* qntldef=1,2,3,4,5 */
   Qntldef = j(nrow(prob), 1, def);   /* ID variable */
   append;
end;
close;
quit;
 
proc print data=Pctls noobs;
   where prob = 0.45;                 /* compare 0.45 quantile for different definitions */
   var Qntldef quantile;
run;

You can see that the different definitions lead to different sample quantiles. How do the quantile functions compare? Let's plot them and see:

ods graphics / antialiasmax=10000;
title "Sample Percentiles in SAS";
proc sgpanel data=Pctls noautolegend;
   panelby Qntldef / onepanel rows=2;
   scatter x=quantile y=prob/ markerattrs=(size=3 symbol=CircleFilled);
   fringe x;
   rowaxis offsetmax=0 offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportion";
   refline 0 1 / axis=y;
   colaxis display=(nolabel);
run;
Compare percentile definitions in SAS

The graphs (click to enlarge) show that QNTLDEF=1 and QNTLDEF=4 are piecewise-linear interpolation methods, whereas QNTLDEF=2, 3, and 5 are discrete rounding methods. The default method (QNTLDEF=5) is similar to QNTLDEF=2 except for certain averaged values. For the discrete definitions, SAS returns either a data value or the average of adjacent data values. The interpolation methods do not have that property: the methods will return quantile values that can be any value between observed data values.

If you have a small data set, as in this blog post, it is easy to see how the percentile definitions are different. For larger data sets (say, 100 or more unique values), the five quantile functions look quite similar.

The differences between definitions are most apparent when there are large gaps between adjacent data values. For example, the sample data has a large gap between the ninth and tenth observations, which have the values 5 and 8, respectively. If you compute the 0.901 quantile, you will discover that the "round down" method (QNTLDEF=2) gives 5 as the sample quantile, whereas the "round up" method (QNTLDEF=3) gives the value 8. Similarly, the "backward interpolation method" (QNTLDEF=1) gives 5.03, whereas the "forward interpolation method" (QNTLDEF=4) gives 7.733.

In summary, this article shows how the (somewhat obscure) QNTLDEF= option results in different quantile estimates. Most people just accept the default definition (QNTLDEF=5), but if you are looking for a method that interpolates between data values, rather than a method that rounds and averages, I recommend QNTLDEF=1, which performs linear interpolations of the ECDF. The differences between the definitions are most apparent for small samples and when there are large gaps between adjacent data values.

Reference

For more information about sample quantiles, including a mathematical discussion of the various formulas, see
Hyndman, R. J. and Fan, Y. (1996) "Sample quantiles in statistical packages", American Statistician, 50, 361–365.

The post Quantile definitions in SAS appeared first on The DO Loop.

5月 172017
 

The April 2017 issue of Significance magazine features a cover story by Robert Langkjaer-Bain about the Flint (Michigan) water crisis. For those who don't know, the Flint water crisis started in 2014 when the impoverished city began using the Flint River as a source of city water. The water was not properly treated, which led to unhealthy (even toxic) levels of lead and other contaminants in the city's water supply. You can read an overview of the Flint Water Crisis on Wikipedia.

The crisis was compounded because someone excluded two data points before computing a quantile of a small data set. This seemingly small transgression had tragic ramifications. This article examines the Flint water quality data and shows why excluding those two points changed the way that the city responded. You can download the SAS program that analyzes these data.

Federal standards for detecting unsafe levels of lead

The federal Lead and Copper Rule of 1991 specifies a statistical method for determining when the concentration of lead in a water supply is too high. First, you sample from a number of "worst case" homes (such as those served by lead pipes), then compute the 90th percentile of the lead levels from those homes. If the 90th percentile exceeds 15 parts per billion (ppb), then the water is unsafe and action must be taken to correct the problem.

In spring 2015, this data collection and analysis was carried out in Flint by the Michigan Department of Environmental Quality (MDEQ), but as discussed in the Significance article, the collection process was flawed. For example, the MDEQ was supposed to collect 100 measurements, but only 71 samples were obtained, and they were not from the "worst case" homes. The 71 lead measurements that they collected are reproduced below, where I have used '0' for "not detectable." A call to PROC MEANS computes the 90th percentile (P90) of the complete sample:

/* values of lead concentration in Flint water samples.
   Use 0 for "not detectable" */
data FlintObs;
label Lead = "Lead Concentration (ppb)";
input lead @@;
Exclude = (Lead=20 | Lead=104); /* MDEQ excluded these two large values */
datalines;
0 0 0 0 0 0 0 0 0 0 0 0 0 
1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 
3 3 3 3 3 3 3 3 3 3 3 4 4 
5 5 5 5 5 5 5 5 6 6 6 6 7 7 7 
8 8 9 10 10 11 13 18 20 21 22 29 43 43 104
;
 
proc means data=FlintObs N p90; 
   var lead;
run;

According to this analysis of the full data, the 90th percentile of the sample is 18 ppb, which exceeds the federal limit of 15 ppb. Consequently, the Flint water fails the safety test, and the city must take action to improve the water.

But that is not what happened. Instead, "the MDEQ told the city's water quality supervisors to remove two of the samples" (Langkjaer-Bain, p. 19) that were over 15 ppb. The MDEQ claimed that these data were improperly collected. The two data points that were excluded have the values 20 and 104. Because these values are both higher than the 90th percentile, excluding these observations lowered the 90th percentile of the modified sample, which has 69 observations. The following call to PROC MEANS computes the 90th percentile for the modified sample:

proc means data=FlintObs N p90; 
   where Exclude=0;     /* exclude two observations */
   var lead;
run;

The second table shows that the 90th percentile of the modified sample is 13 ppb. "The exclusion of these two samples nudged the 90th percentile reading...below the all-important limit of 15 ppb." (Langkjaer-Bain, p. 20) The modified conclusion is that the city does not need to undertake expensive corrective action to render the water safe.

The consequences of keeping or excluding these outliers were huge. If kept in the analysis, officials are required to take expensive corrective action; if excluded, no action is required. In the end, the modified sample was used, and the citizens of Flint were told that the water supply had passed the safety test.

The following histogram (click to enlarge) is similar to Figure 1 in Langkjaer-Bain's article. The red bars indicate observations that the MDEQ excluded from the analysis. The graph clearly shows that the distribution of lead values has a long tail. A broken axis is used to indicate that the distance to the 104 ppb reading has been shortened to reduce horizontal space. The huge gap near 15 ppb indicates a lack of data near that important value. Therefore the quantiles near that point will be extremely sensitive to deleting extreme values. To me, the graph indicates that more data should be collected so that policy makers can be more confident in their conclusions.

Distribution of Lead Levels in Flint, Michigan (Spring 2015)

Confidence intervals for the 90th percentile

Clearly the point estimate for the 90th percentile depends on whether or not those two measurements are excluded. But can statistics provide any additional insight into the 90th percentile of lead levels in the Flint water supply? When someone reports a statistic for a small sample, I like to ask "how confident are you in that statistic?" A standard error is one way to express the accuracy of a point estimate; a 95% confidence interval (CI) is another. The width of a confidence interval gives us information about the accuracy of the point estimate. As you might expect, the standard error for an extreme quantile (such as 0.9) is typically much bigger than for a quantile near 0.5, especially when there isn't much data near the quantile.

Let's use SAS procedures to construct a 95% CI for the 90th percentile. PROC UNIVARIATE supports the CIPCTLDF option, which produces distribution-free confidence intervals. I'll give the Flint officials the benefit of the doubt and compute a confidence interval for the modified data that excluded the two "outliers":

proc univariate data=FlintObs CIPctlDF;
   where Exclude=0;     /* exclude two observations */
   var Lead;
   ods select quantiles;
run;

The 95% confidence interval for P90 is [8, 43], which is very wide and includes the critical value 15 ppb in its interior. If someone asks, "how confident are you that the 90th percentile does not exceed 15 ppb," you should respond, "based on these data, I am not confident at all."

Statistical tests for the 90th percentile

As I've written before, you can also use the QUANTREG procedure in SAS to provide an alternative method to compute confidence intervals for percentiles. Furthermore, the QUANTREG procedure supports the ESTIMATE statement, which you can use to perform a one-sided test for the hypothesis "does the 90th percentile exceed 15 ppb?" The following call to PROC QUANTREG performs this analysis and uses 10,000 bootstrap resamples to estimate the confidence interval:

proc quantreg data=FlintObs CI=resampling(nrep=10000);
   where Exclude=0;     /* exclude two observations */
   model lead = / quantile=0.9 seed=12345;
   estimate 'P90 > 15' Intercept 1 / upper CL testvalue=15;
   ods select ParameterEstimates Estimates;
run;

The standard error for the 90th percentile is about 5.3. Based on bootstrap resampling methods, the 95% CI for P90 is approximately [2.4, 23.6]. (Other methods for estimating the CI give similar or wider intervals.) The ESTIMATE statement is used to test the null hypothesis that P90 is greater than 15. The p-value is large, which means that even if you delete two large lead measurements, the data do not provide evidence to reject the null hypothesis.

Conclusions

There is not enough evidence to reject the hypothesis that P90 is greater than the legal limit of 15 ppb. Two different 95% confidence intervals for P90 include 15 ppb in their interiors.

In fact, the confidence intervals include 15 ppb whether you use all 71 observations or just the 69 observations that MDEQ used. So you can argue about whether the MDEQ should have excluded the controversial measurements, but the hypothesis test gives the same conclusion regardless. By using these data, you cannot rule out the possibility that the 90th percentile of the Flint water supply is greater than 15 ppb.

What do you have to say? Share your comments.

The post Quantiles and the Flint water crisis appeared first on The DO Loop.

5月 152017
 

Last week I showed a timeline of living US presidents. The number of living presidents is constant during the time interval between inaugurations and deaths of presidents. The data was taken from a Wikipedia table (shown below) that shows the number of years and days between events. This article shows how you can use the INTCK and INTNX functions in SAS to compute the time between events in this format. In particular, I use two little-known options to these functions that make this task easy.

Intervals between dates

If you are computing the interval between two dates (a start date and an end date) there are two SAS functions that you absolutely must know about.

  • The INTCK function returns the number of time units between two dates. For the time unit, you can choose years, months, weeks, days, and more. For example, in my previous article I used the INTCK function to determine the number of days between two dates.
  • The INTNX function returns a SAS date that is a specified number of time units away from a specified date. For example, you can use the INTNX function to compute the date that is 308 days in the future from a given date.

These two functions complement each other: one computes the difference between two dates, the other enables you to add time units to a date value.

By default, these functions use the number of "calendar boundaries" between the dates, such as the first day of a year, month, or week. For example, if you choose to measure year intervals, the INTCK function counts how many times 01JAN occurred between the dates, and the INTNX function returns a future 01JAN date. Similarly, if you measure month intervals, the INTCK function counts how many first-of-the-months occur between two dates, and the INTNX function returns a future first-of-the-month date.

Options to compute anniversary dates

Both functions support many options to modify the default behavior. If you want to count full year intervals, instead of the number of times people celebrated New Year's Eve, these function support options (as of SAS 9.2) to count the number of "anniversaries" between two dates and to compute the date of a future anniversary. You can use the 'CONTINUOUS' option for the INTCK function and the 'SAME' option for the INTNX function, as follows:

  • The 'CONTINUOUS' option in the INTCK function enables you to count the number of anniversaries of one date that occur prior to a second date. For example, the statement
    Years = intck('year', '30APR1789'd, '04MAR1797'd, 'continuous');
    returns the value 7 because there are 7 full years (anniversaries of 30APR) between those two dates. Without the 'CONTINUOUS' option, the function returns 8 because 01JAN occurs 8 times between those dates.
  • The statement
    Anniv = intnx('year', '30APR1789'd, 7, 'same');
    returns the 7th anniversary of the date 30APR1789. In other words, it returns the date value for 30APR1796.

The beauty of these functions is that they automatically handle leap years! If you request the number of days between two dates, the INTCK function includes leap days in the result. If an event occurs on a leap day, and you ask the INTNX function for the next anniversary of that event, you will get 28FEB of the next year, which is the most common convention for handling anniversaries of a leap day.

An algorithm to compute years and days between events

The following algorithm computes the number of years and days between dates in SAS:

  • Use the INTCK function with the 'CONTINUOUS' option to compute the number of complete years between the two dates.
  • Use the INTNX function to find a third date (the anniversary date) which is the same month and day as the start date, but occurs less than one year prior to the end date. (The anniversary of a leap days is either 28FEB or 29FEB, depending on whether the anniversary occurs in a leap year.)
  • Use the INTCK function to compute the number of days between the anniversary date and the end date.

The following DATA step computes the time interval in years and days between the first few US presidential inaugurations and deaths. The resulting Year and Day variables contain the same information as is displayed in the Wikipedia table.

data YearDays;
format Date prevDate anniv Date9.;
input @1  Date anydtdte12.
      @13 Event $26.;
prevDate = lag(Date);
if _N_=1 then do;                               /* when _N_=1, lag(Date)=. */
   Years=.; Days=.; return;            /* set years & days, go to next obs */
end;
Years = intck('year', prevDate, Date, 'continuous'); /* num complete years */
Anniv = intnx('year', prevDate, Years, 'same');      /* most recent anniv  */
Days = intck('day', anniv, Date);                    /* days since anniv   */
datalines;
Apr 30, 1789 Washington Inaug
Mar 4, 1797  J Adams Inaug
Dec 14, 1799 Washington Death
Mar 4, 1801  Jefferson Inaug
Mar 4, 1809  Madison Inaug
Mar 4, 1817  Monroe Inaug
Mar 4, 1825  JQ Adams Inaug
Jul 4, 1826  Jefferson Death
Jul 4, 1826  J Adams Death
run;
 
proc print data=YearDays;
var Event prevDate Date Anniv Years Days;
run;

Summary and references

In summary, the INTCK and INTNX functions are essential for computing intervals between dates. In this article, I emphasized two little-known options: the 'CONTINUOUS' option in INTCK and the 'SAME' option in INTNX. By using these options, you can to compute the number of anniversaries between dates and the most recent anniversary. Thus you can compute the years and days between two dates.

There have been countless articles and papers written about SAS dates and finding intervals between dates. I recommend the following articles:

Lastly, do you know what the acronyms INTCK and INTNX stand for? Obviously the 'INT' part refers to INTervals. The general consensus is that 'INTCK' stands for 'Interval Check' and 'INTNX' stands for "Interval Next."

The post INTCK and INTNX: Two essential functions for computing intervals between dates in SAS appeared first on The DO Loop.

5月 102017
 

A SAS customer asked how to simulate data from a three-parameter lognormal distribution as specified in the PROC UNIVARIATE documentation. In particular, he wanted to incorporate a threshold parameter into the simulation.

Simulating lognormal data is easy if you remember an important fact: if X is lognormally distributed, then Y=log(X) is normally distributed. The converse is also true: If Y is normally distributed, then X=exp(Y) is lognormally distributed. Consequently, to simulate lognormal data you can simulate Y from the normal distribution and exponentiate it to get X, which is lognormally distributed by definition. If you want, you can add a threshold parameter to ensure that all values are greater than the threshold.

Simulate a sample from a two-parameter lognormal distribution

To reiterate: if Y ~ N(μ, σ) is normally distributed with location parameter μ and scale parameter σ, then X = exp(Y) is lognormally distributed with log-location parameter μ and log-scale parameter σ. Different authors use different names for the μ and σ parameters. The PROC UNIVARIATE documentation uses the symbol ζ (zeta) instead of μ, and it calls ζ a scale parameter. Hence, I will use the symbol ζ, too. I have previously written about the relationship between the two lognormal parameters and the mean and variance of the lognormal distribution.

Regardless of what name and symbol you use, you can use the definition to simulate lognormal data. The following SAS DATA set simulates one sample of size 1000 from a lognormal distribution with parameters ζ=2 and σ=0.5. PROC UNIVARIATE then fits a two-parameter lognormal distribution to the simulated data. The maximum likelihood estimates for the sample are 2.01 and 0.49, so the estimates from the simulated data are very close to the parameter values:

ods graphics/reset;
%let N = 1000;          /* sample size */
data LN1;
call streaminit(98765);
sigma = 0.5;      /* shape or log-scale parameter    */
zeta = 2;         /* scale or log-location parameter */
do i = 1 to &N;
   Y = rand("Normal", zeta, sigma);  /* Y ~ N(zeta, sigma)    */
   X = exp(Y);                       /* X ~ LogN(zeta, sigma) */
   output;
end;
keep X;
run;
 
proc univariate data=LN1;      /* visualize simulated data and check fit */
   histogram X / lognormal endpoints=(0 to 50 by 5)
                 odstitle="Simulated Lognormal Data (zeta=2, sigma=0.5)";
run;
Simulated lognormal data

Simulate many samples from a three-parameter lognormal distribution

You can modify the previous program to simulate from a lognormal distribution that has a threshold parameter. You simply add the threshold value to the exp(Y) value, like this: X = theta + exp(Y). Because exp(Y) is always positive, X is always greater than theta, which is the threshold value.

In Monte Carlo simulation studies, you often want to investigate the sampling distribution of the model parameter estimates. That is, you want to generate many samples from the same model and see how the estimates differ across the random samples. The following DATA step simulates 500 random samples from the three-parameter lognormal distribution with threshold value 10. You can analyze all the samples with one call to PROC UNIVARIATE that uses the BY statement to identify each sample. This is the efficient way to perform Monte Carlo simulation studies in SAS.

%let N = 100;          /* sample size */
%let NumSamples = 500; /* number of samples */
%let Threshold = 10;   
data LN;               /* generate many random samples */
call streaminit(98765);
sigma = 0.5;      /* shape or log-scale parameter    */
zeta = 2;         /* scale or log-location parameter */
do SampleID = 1 to &NumSamples;
   do i = 1 to &N;
      Y = rand("Normal", zeta, sigma);
      X = &Threshold + exp(Y);
      output;
   end;
end;
keep SampleID X;
run;
 
ods exclude all;                 /* do not produce tables during analyses */
proc univariate data=LN;
   by SampleID;                  /* analyze the many random samples */
   histogram x / lognormal(theta=&Threshold); /* 2-param estimation */
   ods output parameterestimates=PE;
run;
ods exclude none;
 
data Estimates;                 /* convert from long to wide for plotting */
keep SampleID Zeta Sigma;
merge PE(where=(Parameter="Scale") rename=(Estimate=Zeta))
      PE(where=(Parameter="Shape") rename=(Estimate=Sigma));
by sampleID;
label Zeta="zeta: Estimates of Scale (log-location)"
      Sigma="sigma: Estimate of Shape (log-scale)";
run;
 
title "Approximate Sampling Distribution of Lognormal Estimates";
title2 "Estimates from &NumSamples Random Samples (N=&N)";
proc sgplot data=Estimates;
   scatter x=Zeta y=Sigma;
   refline 2 / axis=x;
   refline 0.5 / axis=y;
run;
MLE Parameter Estimates for 500 Simulated Random Samples from Lognormal Distribution

The distribution of the 500 estimates appears to be centered on (ζ, σ) = (2, 0.5), which are the parameter values that were used to simulate the data. You can use the usual techniques in Monte Carlo simulation to estimate the standard deviation of the estimates.

A few closing remarks:

  • The RAND function does not support location and scale parameters for the lognormal distribution in SAS in 9.4m4. However, the RANDGEN function in SAS/IML does support two-parameter lognormal parameters. The RAND function will support lognormal parameters in 9.4m5.
  • In this study, the estimates are all two-parameter estimates, which assumes that you know the threshold value in the population. If not, you can use THETA=EST on the HISTOGRAM statement to obtain three-parameter lognormal estimates.
  • Because you need to exponentiate the Y variable, random values of Y must be less than the value of CONSTANT('logbig'), which is about 709. To avoid numerical overflows, make sure that ζ + 4*σ is safely less than 709.
  • This sort of univariate simulation is discussed in detail in Chapter 7 of the book Simulating Data with SAS, along with a general discussion about how to simulate from location-scale families even for distributions for which the RAND function does not support location or scale parameters.

The post Simulate lognormal data in SAS appeared first on The DO Loop.

5月 082017
 

Quick! What is the next term in the numerical sequence 1, 2, 1, 2, 3, 4, 5, 4, 3, 4, ...? If you said '3', then you must be an American history expert, because that sequence represents the number of living US presidents beginning with Washington's inauguration on 30APR1789 and continuing until the death of James Monroe on 04JUL1831.

I stumbled upon the Wikipedia page for "Living Presidents of the United State," which contains a table that shows the dates of inaugurations and deaths of US presidents and the number of living presidents between events. The table is very crowded and I found it difficult to visualize the data, so I decided to create a SAS graph that shows the timeline for the number of living US presidents.

All the information in the complex Wikipedia table can be derived by knowing the dates on which a new president was inaugurated or a previous president died. You can create a simple data set with that information and use SAS to calculate other information, such as the number of living presidents or the time span between events. You can download the data and the SAS program that creates the following graph. (Click to enlarge.) If you create the graph in SAS, you can hover the mouse over a marker to see which president died or was inaugurated for each event.

Timeline of living US presidents

A few interesting facts that are revealed by the visualization of these events (all statistics as of 08MAY2017):

  • The time-weighted average since Washington's inauguration is 3.43 living presidents per day.
  • The period from 1989 to 2017 featured a larger-than-usual number of living presidents. Unfortunately, I don't expect that trend to last, since Presidents Carter and G. H. W. Bush are both very old.
  • There have been six brief intervals when the only living president was the sitting president.
  • On 04JUL1826, the US lost two former presidents (John Adams and Thomas Jefferson) within 6 hours. You can see the graph drop by 2 in 1826. Interestingly, James Monroe also died on Independence Day!
  • When a president dies in office, the number of living presidents does not change, since the vice-president is inaugurated that same day. Can you spot the eight times that a president died in office?
  • No presidents died during F. D. Roosevelt's administration—except FDR himself. The 20-year period from 1933-1953 is the longest span during which the number of living presidents stayed constant.
  • The reason no one died during FDR's terms was that Herbert Hoover remained alive. Hoover had the record of being the ex-president who lived longest after his inauguration: from 1929 until 1964, which is 13,014 days or 35.65 years! However, Jimmy Carter broke that record a few years ago. Carter was inaugurated in 1977 and has lived more than 40 years since that event.

Do you notice any other interesting features of this timeline? Leave a comment.

The post Timeline of living US presidents appeared first on The DO Loop.

5月 032017
 

If a financial analyst says it is "likely" that a company will be profitable next year, what probability would you ascribe to that statement? If an intelligence report claims that there is "little chance" of a terrorist attack against an embassy, should the ambassador interpret this as a one-in-a-hundred chance, a one-in-ten chance, or some other value?

Analysts often use vague statements like "probably" or "chances are slight" to convey their beliefs that a future event will or will not occur. Government officials and policy-makers who read reports from analysts must interpret and act on these vague statements. If the reader of a report interprets a phrase different from what the writer intended, that can lead to bad decisions.

Assigning probabilities to statements

Original box plot: Distribution of probabilities for word phrases

In the book Psychology of Intelligence Analysis (Heuer, 1999), the author presents "the results of an experiment with 23 NATO military officers accustomed to reading intelligence reports. They were given a number of sentences such as: "It is highly unlikely that ...." All the sentences were the same except that the verbal expressions of probability changed. The officers were asked what percentage probability they would attribute to each statement if they read it in an intelligence report."

The results are summarized in the adjacent dot plot from Heuer (Chapter 12), which summarizes how the officers assess the probability of various statements. The graph includes a gray box for some statements. The box is not a statistical box plot. Rather it indicates the probability range according to a nomenclature proposed by Kent (1964), who tried to get the intelligence community to agree that certain phrases would be associated with certain probability ranges.

For some statements (such as "better than even" and "almost no chance") there was general agreement among the officers. For others, there was large variability in the probability estimates. For example, many officers interpreted "probable" as approximately a 75% chance, but quite a few interpreted it as less than 50% chance.

A modern re-visualization

Zonination's box plot: Distribution of probabilities for word phrases

The results of this experiment are interesting on many levels, but I am going to focus on the visualization of the data. I do not have access to the original data, but this experiment was repeated in 2015 when the user "Zonination" got 46 users on Reddit (who were not military experts) to assign probabilities to the statements. His visualization of the resulting data won a 2015 Kantar Information is Beautiful Award. The visualization uses box plots to show the schematic distribution and overlays the 46 individual estimates by using a jittered, semi-transparent, scatter plot. The Zonination plot is shown at the right (click to enlarge). Notice that the "boxes" in this second graph are determined by quantiles of the data, whereas in the first graph they were theoretical ranges.

Creating the graph in SAS

I decided to remake Zonination's plot by using PROC SGPLOT in SAS. I made several modifications that improve the readability and clarity of the plot.

  • I sorted the categories by the median probability. The median is a robust estimate of the "consensus probability" for each statement. The sorted categories indicate the relative order of the statements in terms of perceived likelihood. For example, an "unlikely" event is generally perceived as more probable than an event that has "little chance." For details about sorting the variables in SAS, see my article about how to sort variables by a statistic.
  • I removed the colors. Zonination's rainbow-colored chart is aesthetically pleasing, but the colors do not add any new information about the data. However, the colors help the eye track horizontally across the graph, so I used alternating bands to visually differentiate adjacent categories. You can create color bands by using the COLORBANDS= option in the YAXIS statement.
  • To reduce overplotting of markers, I used systematic jittering instead of random jittering. In random jittering, each vertical position is randomly offset. In systematic (centered) jittering, the markers are arranged so that they are centered on the "spine" of the box plot. Vertical positions are changed only when the markers would otherwise overlap. You can use the JITTER option in the SCATTER statement to systematically jitter marker positions.
  • Zonination's plot displays some markers twice, which I find confusing. Outliers are displayed once by the box plot and a second time by the jittered scatter plot. In my version, I suppress the display of outliers by the box plot by using the NOOUTLIERS option in the HBOX statement.

You can download the SAS code that creates the data, sorts the variables by median, and creates the plot. The following call to PROC SGPLOT shows the HBOX and SCATTER statements that create the plot:

title "Perceptions of Probability";
proc sgplot data=Long noautolegend;
   hbox _Value_ / category=_Label_ nooutliers nomean nocaps;  
   scatter x=_Value_ y=_Label_ / jitter transparency=0.5
                     markerattrs=GraphData2(symbol=circlefilled size=4);
   yaxis reverse discreteorder=data labelpos=top labelattrs=(weight=bold)
                     colorbands=even colorbandsattrs=(color=gray transparency=0.9)
                     offsetmin=0.0294 offsetmax=0.0294; /* half of 1/k, where k=number of catgories */
   xaxis grid values=(0 to 100 by 10);
   label _Value_ = "Assigned Probability (%)" _label_="Statement";
run;
SAS box plot: Distribution of probabilities for word phrases

The graph indicates that some responders either didn't understand the task or intentionally gave ridiculous answers. Of the 17 categories, nine contain extreme outliers, such as assigning certainty (100%) to the phrases "probably not," "we doubt," and "little chance." However, the extreme outliers do not affect the statistical conclusions about the distribution of probabilities because box plots (which use quartiles) are robust to outliers.

The SAS graph, which uses systematic jittering, reveals a fact about the data that was hidden in the graphs that used random jittering. Namely, most of the data values are multiples of 5%. Although a few people responded with values such as 88.7%, 1%, or 3%, most values (about 80%) are rounded to the nearest 5%. For the phrases "likely" and "we believe," 44 of 46 responses (96%) were a multiple of 5%. In contrast, the phrase "almost no chance" had only 18 of 46 responses (39%) were multiples of 5% because many responses were 1%, 2%, or 3%.

Like the military officers in the original study, there is considerable variation in the way that the Reddit users assign a probability to certain phrases. It is interesting that some phrases (for example, "We believe," "Likely," and "Probable") have the same median value but wildly different interquartile ranges. For clarity, speakers/writers should use phrases that have small variation or (even better!) provide their own assessment of probability.

Does something about this perception study surprise you? Do you have an opinion about the best way to visualize these data? Leave a comment.

The post Perceptions of probability appeared first on The DO Loop.

5月 012017
 
Split data into groups that have the same mean and variance

A frequently asked question on SAS discussion forums concerns randomly assigning units (often patients in a study) to various experimental groups so that each group has approximately the same number of units. This basic problem is easily solved in SAS by using PROC SURVEYSELECT or a DATA step program.

A more complex problem is when the researcher wants the distribution of a covariate to be approximately equal for each group. For example, a medical researcher might want each group of patients to have approximately the same mean and variance for their cholesterol levels, as shown in the plot to the right. This second problem is much harder because it involves distributing the units (non-randomly) so that the groups satisfy some objective. Conceptually, you want an assignment that minimizes the difference between moments (mean, variance, skewness,...) of the subgroups.

The OPTEX procedure for optimal design

Creating an experimental design that optimizes some criterion is one of the many uses of the OPTEX procedure in SAS/QC software In fact, this problem is one of the examples in the PROC OPTEX documentation. The example in this blog post follows the documentation example; see the documentation for details.

To solve this assignment problem with PROC OPTEX, you need to specify three things:

  • A data set (call it UNITS) that contains the individual units (often patients) that you want to assign to the treatments. For this example, the units are the living patients in the Sashelp.Heart data set.
  • A data set (call it TREATMENTS) that contains the names of the treatment groups. This example uses five groups with values 1, 2, 3, 4, and 5.
  • The covariate in the problem. The OPTEX procedure will assign units to groups so that the first k moments are approximately equal across treatment groups. This example uses the Cholesterol variable and k=2, which means that the mean and variance of the cholesterol levels for patients in each treatment group will be approximately equal.

The following statements create the two data sets and define a macro variable that contains the name of the Cholesterol variable:

/* Split the living Sashelp.heart patients into five groups so that 
   mean and variance of cholesterol is the same across groups. */
data Units;                   /* each row is a unit to be assigned */
set Sashelp.Heart(where=(status = "Alive"));
keep Sex AgeAtStart Height Weight Diastolic Systolic MRW Cholesterol;
run;
 
%let NumGroups =5;           /* number of treatment groups */
data Treatments;
do Trt = 1 to &NumGroups;    /* Trt is variable that assigns patients to groups */
   output;
end;
run;
 
%let Var = Cholesterol;      /* name of covariate */

As discussed in the PROC OPTEX documentation, the following call creates an output data set (named GROUPS) that assigns each patient to a group (1–5). A call to PROC MEANS displays the mean and variance of each group.

proc optex data=Treatments seed=97531 coding=orthcan;
   class Trt;
   model Trt;              /* specify treatment model */
   blocks design=Units;    /* specify units */
   model &Var &Var*&Var;   /* fixed covariates: &Var--> mean, &Var*&Var--> variance */
   output out=Groups;      /* merged data: units assigned to groups */
run;
 
proc means data=Groups mean std;
  class Trt;
  var &Var;
run;

Success! The table shows that each group has approximately the same size and approximately the same mean and variance of the Cholesterol variable. Remember, though, that this assignment scheme is not random, so be careful not to make unjustified inferences from estimates based on these group assignments.

I am not an expert on experimental design, but a knowledgeable colleague tells me that optimal design theory states that the design that minimizes the variance of the treatment effects (adjusting for the first two moments of the covariate) is the design in which treatment means and variances of the covariate are as equal as possible. This is the design that PROC OPTEX has produced.

The following call to PROC SGPLOT uses box plots to visualize the distribution of the Cholesterol variable across treatment groups. The graph is shown at the top of this article. Clearly the distribution of cholesterol is very similar for each group.

proc sgplot data=Groups;
   vbox &Var / category=Trt;
run;

In summary, the OPTEX procedure in SAS/QC software enables you to assign units to groups, where each group has approximately the same distribution of a specified covariate. In this article, the covariate measured cholesterol levels in patients, but you can also group individuals according to income, scholastic aptitude, and so on. For large data sets, the assignment problem is a challenging optimization problem, but PROC OPTEX provides a concise syntax and solves this problem efficiently.

The post Split data into groups that have the same mean and variance appeared first on The DO Loop.

4月 262017
 

Most SAS regression procedures support a CLASS statement which internally generates dummy variables for categorical variables. I have previously described what dummy variables are and how are they used. I have also written about how to create design matrices that contain dummy variables in SAS, and in particular how to use different parameterizations: GLM, reference, effect, and so forth.

It occurs to me that you can visualize the structure of a design matrix by using the same technique (heat maps) that I used to visualize missing value structures. In a design matrix, each categorical variable is replaced by several dummy variables. However, there are multiple parameterizations or encodings that result in different design matrices.

Heat maps of design matrices: GLM parameterization

Heat maps require several pixels for each row and column of the design matrix, so they are limited to small or moderate sized data. The following SAS DATA step extracts the first 150 observations from the Sashelp.Heart data set and renames some variables. It also adds a fake response variable because the regression procedures that generate design matrices (GLMMOD, LOGISTIC, GLMSELECT, TRANSREG, and GLIMMIX) require a response variable even though the goal is to create a design matrix for the explanatory variables. In the following statements, the OUTDESIGN option of the GLMSELECT procedure generates the design matrix. The matrix is then read into PROC IML where the HEATMAPDISC subroutine creates a discrete heat map.

/* add fake response variable; for convenience, shorten variable names */
data Temp / view=Temp;
   set Sashelp.heart(obs=150
                keep=BP_Status Chol_Status Smoking_Status Weight_Status);
   rename BP_Status=BP Chol_Status=Chol 
          Smoking_Status=Smoking Weight_Status=Weight;
   FakeY = 0;
run;
 
ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP Chol Smoking Weight / param=GLM;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
 
ods graphics / width=500px height=800px;
proc iml;  /* use HEATMAPDISC call to create heat map of design */
use Design;  read all var _NUM_ into X[c=varNames];  close;
run HeatmapDisc(X) title="GLM Design Matrix"
     xvalues=varNames displayoutlines=0 colorramp={"White" "Black"};
QUIT;
Design matrix for the GLM parameterization in SAS

Click on the heat map to enlarge it. Each row of the design matrix indicates a patient in a research study. If any explanatory variable has a missing value, the corresponding row of the design matrix is missing (shown as gray). In the design matrix for the GLM parameterization, a categorical variable with k levels is represented by k columns. The black and white heat map shows the structure of the design matrix. Black indicates a 1 and white indicates a 0. In particular:

  • This first column is all black, which indicates the intercept column.
  • Columns 2-4 represent the BP variable. For each row has one black rectangle in one of those columns. You can see that there are few black squares in column 4, which indicates that few patients in the study have optimal cholesterol.
  • In a similar way, you can see that there are many nonsmokers (column 11) in the study. There are also many overweight patients (column 14) and few underweight patients (column 15).

The GLM parameterization is called a "singular parameterization" because each it contains redundant columns. For example, the BP_Optimal column is redundant because that column contains a 1 only when the BP_High and BP_Normal columns are both 0. Similarly, if either the BP_High or the BP_Normal columns is 1, then BP_Optimal is automatically 0. The next section removes the redundant columns.

Heat maps of design matrices: Reference parameterization

There is a binary design matrix that contains only the independent columns of the GLM design matrix. It is called a reference parameterization and you can generate it by using PARAM=REF in the CLASS statement, as follows:

ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP Chol Smoking Weight / param=REF;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
Design matrix for the REFERENCE parameterization in SAS

Again, you can use the HEATMAPDISC call in PROC IML to create the heat map. The matrix is similar, but categorical variables that have k levels are replaced by k–1 dummy variables. Because the reference level was not specified in the CLASS statement, the last level of each category is used as the reference level. Thus the REFERENCE design matrix is similar to the GLM design, but that the last column for each categorical variable has been dropped. For example, there are columns for BP_High and BP_Normal, but no column for BP_Optimal.

Nonbinary designs: The EFFECT parameterization

The previous design matrices were binary 0/1 matrices. The EFFECT parameterization, which is the default parameterization for PROC LOGISTIC, creates a nonbinary design matrix. In the EFFECT parameterization, the reference level is represented by using a -1 and a nonreference level is represented by 1. Thus there are three values in the design matrix.

If you do not specify the reference levels, the last level for each categorical variable is used, just as for the REFERENCE parameterization. The following statements generate an EFFECT design matrix and use the REF= suboption to specify the reference level. Again, you can use the HEATMAPDISC subroutine to display a heat map for the design. For this visualization, light blue is used to indicate -1, white for 0, and black for 1.

ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP(ref='Normal') Chol(ref='Desirable') 
         Smoking(ref='Non-smoker') Weight(ref='Normal') / param=EFFECT;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
 
proc iml;  /* use HEATMAPDISC call to create heat map of design */
use Design; read all var _NUM_ into X[c=varNames]; close;
run HeatmapDisc(X) title="Effect Design Matrix"
     xvalues=varNames displayoutlines=0 colorramp={"LightBlue" "White" "Black"};
QUIT;
Design matrix for the EFFECT parameterization in SAS

In the adjacent graph, blue indicates that the value for the patient was the reference category. White and black indicates that the value for the patient was a nonreference category, and the black rectangle appears in the column that indicates the value of the nonreference category. For me, this design matrix takes some practice to "read." For example, compared to the GLM matrix, it is harder to determine the most frequent levels for a categorical variable.

Heat maps in Base SAS

In the example, I have used the HEATMAPDISC subroutine in SAS/IML to visualize the design matrices. But you can also create heat maps in Base SAS.

If you have SAS 9.4m3, you can use the HEATMAPPARM statement in PROC SGPLOT to create these heat maps. First you have to convert the data from wide form to long form, which you can do by using the following DATA step:

/* convert from wide (matrix) to long (row, col, value)*/
data Long;
set Design;
array dummy[*] _NUMERIC_;
do varNum = 1 to dim(dummy);
   rowNum = _N_;
   value = dummy[varNum];
   output;
end;
keep varNum rowNum value;
run;
 
proc sgplot data=Long;
/* the observation values are in the order {1, 0, -1}; use STYLEATTRIBS to set colors */
styleattrs  DATACOLORS=(Black White LightBlue);
heatmapparm x=varNum y=rowNum colorgroup=value / showxbins discretex;
xaxis type=discrete; /* values=(1 to 11)  valuesdisplay=("A" "B" ... "J" "K"); */
yaxis reverse;
run;

The heat map is similar to the one in the previous section, except that the columns are labeled 1, 2, 3, and so forth. If you want the columns to contain the variable names, use the VALUESDISPLAY= option, as shown in the comments.

If you are running an earlier version of SAS, you will need to use the Graph Template Language (GTL) to create a template for the discrete heat maps.

In summary, you can use the OUTDESIGN= option in PROC GLMSELECT to create design matrices that use dummy variables to encode classification variables. If you have SAS/IML, you can use the HEATMAPDISC subroutine to visualize the design matrix. Otherwise, you can use the HEATMAPPARM statement in PROC SGPLOT (SAS 9.4m3) or the GTL to create the heat maps. The visualization is useful for teaching and understanding the different parameterizations schemes for classification variables.

The post Visualize a design matrix appeared first on The DO Loop.

4月 242017
 

There are several ways to visualize data in a two-way ANOVA model. Most visualizations show a statistical summary of the response variable for each category. However, for small data sets, it can be useful to overlay the raw data. This article shows a simple trick that you can use to combine two categorical variables and plot the raw data for the joint levels of the two categorical variables.

An ANOVA for two-way interactions

Recall that an ANOVA (ANalysis Of VAriance) model is used to understand differences among group means and the variation among and between groups. The documentation for the ROBUSTREG procedure in SAS/STAT contains an example that compares the traditional ANOVA using PROC GLM with a robust ANOVA that uses PROC ROBUSTREG. The response variable is the survival time (Time) for 16 mice who were randomly assigned to different combinations of two successive treatments (T1, T2). (Higher times are better.) The data are shown below:

data recover;
input  T1 $ T2 $ Time @@;
datalines;
0 0 20.2  0 0 23.9  0 0 21.9  0 0 42.4
1 0 27.2  1 0 34.0  1 0 27.4  1 0 28.5
0 1 25.9  0 1 34.5  0 1 25.1  0 1 34.2
1 1 35.0  1 1 33.9  1 1 38.3  1 1 39.9
;

The response variable depends on the joint levels of the binary variables T1 and T2. A first attempt to visualize the data in SAS might be to create a box plot of the four combinations of T1 and T2. You can do this by assigning T1 to be the "category" variable and T2 to be a "group" variable in a clustered box plot, as follows:

title "Response for Two Groups";
title2 "Use VBOX Statement with Categories and Groups";
proc sgplot data=recover;
   vbox Time / category=T1 group=T2;
run;
Box plots for a binary 'category' variable and a binary 'group' variable

The graph shows the distribution of response for the four joint combinations of T1 and T2. The graph is a little hard to interpret because the category levels are 0/1. The two box plots on the left are for T1=0, which means "Did not receive the T1 treatment." The two box plots on the right are for mice who received the T1 treatment. Within those clusters, the blue boxes indicate the distribution of responses for the mice who did not receive the T2 treatment, whereas the red boxes indicate the response distribution for mice that did receive T2. Both treatments seem to increase the mean survival time for mice, and receiving both treatments seems to give the highest survival times.

Interpreting the graph took a little thought. Also, the colors seem somewhat arbitrary. I think the graph could be improved if the category labels indicate the joint levels. In other words, I'd prefer to see a box plot of the levels of interaction variable T1*T2. If possible, I'd also like to optionally plot the raw response values.

Method 1: Use the EFFECTPLOT statement

The LOGISTIC and GENMOD procedures in SAS/STAT support the EFFECTPLOT statement. Many other SAS regression procedures support the STORE statement, which enables you to save a regression model and then use the PLM procedure (which supports the EFFECTPLOT statement). The EFFECTPLOT statement can create a variety of plots for visualizing regression models, including a box plot of the joint levels for two categorical variables, as shown by the following statements:

/* Use the EFFECTPLOT statement in PROC GENMOD, or use the STORE statement and PROC PLM */
proc genmod data=recover;
   class T1 T2;
   model Time = T1 T2 T1*T2;
   effectplot box / cluster;
   effectplot interaction /  obs(jitter);  /* or use interaction plot to see raw data */
run;
Box plots of joint levels created by the EFFECTPLOT statement in SAS

The resulting graph uses box plots to show the schematic distribution of each of the joint levels of the two categorical variables. (The second EFFECTPLOT statement creates an "interaction plot" that shows the raw values and mean responses.) The means of each group are connected, which makes it easier to compare adjacent means. The labels indicate the levels of the T1*T2 interaction variable. I think this graph is an improvement over the previous multi-colored box plot, and I find it easier to read and interpret.

Although the EFFECTPLOT statement makes it easy to create this plot, the EFFECTPLOT statement does not support overlaying raw values on the box plots. (You can, however, see the raw values on the "interaction plot".) The next section shows an alternative way to create the box plots.

Method 2: Concatenate values to form joint levels of categories

You can explicitly form the interaction variable (T1*T2) by using the CATX function to concatenate the T1 and T2 variables, as shown in the following DATA step view. Because the levels are binary-encoded, the resulting levels are '0 0', '0 1', '1 0', and '1 1'. You can define a SAS format to make the joint levels more readable. You can then display the box plots for the interaction variable and, optionally, overlay the raw values:

data recover2 / view=recover2;
length Treatment $3;          /* specify length of concatenated variable */
set recover;
Treatment = catx(' ',T1,T2);  /* combine into one group */
run;
 
proc format;                  /* make the joint levels more readable */
  value $ TreatFmt '0 0' = 'Control'
                   '1 0' = 'T1 Only'
                   '0 1' = 'T2 Only'
                   '1 1' = 'T1 and T2';
run;
 
proc sgplot data=recover2 noautolegend;
   format Treatment $TreatFmt.;
   vbox Time / category=Treatment;
   scatter x=Treatment y=Time / jitter markerattrs=(symbol=CircleFilled size=10);
   xaxis discreteorder=data;
run;
Distribution of response variable in two-way ANOVA: box plots and raw data overlaid

By manually concatenating the two categorical variables to form a new interaction variable, you have complete control over the plot. You can also overlay the raw data, as shown. The raw data indicates that the "Control" group seems to contain an outlier: a mouse who lived longer than would be expected for his treatment. Using PROC ROBUSTREG to compute a robust ANOVA is one way to deal with extreme outliers in the ANOVA setting.

In summary, the EFFECTPLOT statement enables you to quickly create box plots that show the response distribution for joint levels of two categorical variables. However, sometimes you might want more control, such as the ability to format the labels or overlay the raw data. This article shows how to use the CATX function to manually create a new variable that contains the joint categories.

The post Visualize an ANOVA with two-way interactions appeared first on The DO Loop.