Statistical Graphics

5月 162018

In my article about how to construct calibration plots for logistic regression models in SAS, I mentioned that there are several popular variations of the calibration plot. The previous article showed how to construct a loess-based calibration curve. Austin and Steyerberg (2013) recommend the loess-based curve on the basis of an extensive simulation study. However, some practitioners prefer to use a decile calibration plot. This article shows how to construct a decile-based calibration curve in SAS.

The decile calibration plot

The decile calibration plot is a graphical analog of the Hosmer-Lemeshow goodness-of-fit test for logistic regression models. The subjects are divided into 10 groups by using the deciles of the predicted probability of the fitted logistic model. Within each group, you compute the mean predicted probability and the mean of the empirical binary response. A calibration plot is a scatter plot of these 10 ordered pairs, although most calibration plots also include the 95% confidence interval for the proportion of the binary responses within each group. Many calibration plots connect the 10 ordered pairs with piecewise line segments, others use a loess curve or a least squares line to smooth the points.

Create the decile calibration plot in SAS

The previous article simulated 500 observations from a logistic regression model logit(p) = b0 + b1*x + b2*x2 where x ~ U(-3, 3). The following call to PROC LOGISTIC fits a linear model to these simulated data. That is, the model is intentionally misspecified. A call to PROC RANK creates a new variable (Decile) that identifies the deciles of the predicted probabilities for the model. This variable is used to compute the means of the predicted probabilities and the empirical proportions (and 95% confidence intervals) for each decile:

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim noprint;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
/* To construct the decile calibration plot, identify deciles of the predicted prob. */
proc rank data=LogiOut out=LogiDecile groups=10;
   var PredProb;
   ranks Decile;
/* Then compute the mean predicted prob and the empirical proportions (and CI) for each decile */
proc means data=LogiDecile noprint;
   class Decile;
   types Decile;
   var y PredProb;
   output out=LogiDecileOut mean=yMean PredProbMean
          lclm=yLower uclm=yUpper;
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiDecileOut noautolegend aspect=1;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   *loess x=PredProbMean y=yMean;  /* if you want a smoother based on deciles */
   series x=PredProbMean y=yMean;  /* if you to connect the deciles */
   scatter x=PredProbMean y=yMean / yerrorlower=yLower yerrorupper=yUpper;
   yaxis label="Observed Probability of Outcome";
   xaxis label="Predicted Probability of Outcome";
Decile calibration curve for a misspecified logistic regression model

The diagonal line is the line of perfect calibration. In a well-calibrated model, the 10 markers should lie close to the diagonal line. For this example, the graph indicates that the linear model does not fit the data well. For the first decile of the predicted probability (the lowest predicted-risk group), the observed probability of the event is much higher than the mean predicted probability. For the fourth, sixth, and seventh deciles, the observed probability is much lower than the mean predicted probability. For the tenth decile (the highest predicted-risk group), the observed probability is higher than predicted. By the way, this kind of calibration is sometimes called internal calibration because the same observations are used to fit and assess the model.

The decile calibration plot for a correctly specified model

You can fit a quadratic model to the data to see how the calibration plot changes for a correctly specified model. The results are shown below. In this graph, all markers are close to the diagonal line, which indicates a very close agreement between the predicted and observed probabilities of the event.

Decile calibration curve for a correctly specified logistic regression model

Should you use the decile calibration curve?

The decile-based calibration plot is popular, perhaps because it is so simple that it can be constructed by hand. Nevertheless, Austin and Steyerberg (2013) suggest using the loess-based calibration plot instead of the decile-based plot. Reasons include the following:

  • The use of deciles results in estimates that "display greater variability than is evident in the loess-based method" (p. 524).
  • Several researchers have argued that the use of 10 deciles is arbitrary. Why not use five? Or 15? In fact, the results of the Hosmer-Lemeshow test "can depend markedly on the number of groups, and there's no theory to guide the choice of that number." (P. Allison, 2013. "Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression")

Many leading researchers in logistic regression do not recommend the Hosmer-Lemeshow test for these reasons. The decile-based calibration curve shares the same drawbacks. Since SAS can easily create the loess-based calibration curve (see the previous article), there seems to be little reason to prefer the decile-based version.

The post Decile calibration plots in SAS appeared first on The DO Loop.

5月 142018

A logistic regression model is a way to predict the probability of a binary response based on values of explanatory variables. It is important to be able to assess the accuracy of a predictive model. This article shows how to construct a calibration plot in SAS. A calibration plot is a goodness-of-fit diagnostic graph. It enables you to qualitatively compare a model's predicted probability of an event to the empirical probability.

Calibration curves

There are two standard ways to assess the accuracy of a predictive model for a binary response: discrimination and calibration. To assess discrimination, you can use the ROC curve. Discrimination involves counting the number of true positives, false positive, true negatives, and false negatives at various threshold values. In contrast, calibration curves compare the predicted probability of the response to the empirical probability. Calibration can help diagnose lack of fit. It also helps to rank subjects according to risk.

This article discusses internal calibration, which is the agreement on the sample used to fit the model. External calibration, which involves comparing the predicted and observed probabilities on a sample that was not used to fit the model, is not discussed.

In the literature, there are two types of calibration plots. One uses a smooth curve to compare the predicted and empirical probabilities. The curve is usually a loess curve, but sometimes a linear regression curve is used. The other (to be discussed in a future article) splits the data into deciles. An extensive simulation study by Austin and Steyerberg (2013) concluded that loess-based calibration curves have several advantages. This article discusses how to use a loess fit to construct a calibration curve.

A calibration curve for a misspecified model

Calibration curves help to diagnose lack of fit. This example simulates data according to a quadratic model but create a linear fit to that data. The calibration curve should indicate that the model is misspecified.

Step 1: Simulate data for a quadratic logistic regression model

Austin and Steyerberg (2013) use the following simulated data, which is based on the earlier work of Hosmer et al. (1997). Simulate X ~ U(-3, 3) and define η = b0 + b1*X + b2*X2) to be a linear predictor, where b0 = -2.337, b1 = 0.8569, b2 = 0.3011. The logistic model is logit(p) = η, where p = Pr(Y=1 | x) is the probability of the binary event Y=1. The following SAS DATA step simulates N=500 observations from this logistic regression model:

/* simulate sample of size N=500 that follows a quadratic model from Hosmer et al. (1997) */
%let N = 500;                              /* sample size */ 
data LogiSim(keep=Y x);
call streaminit(1234);                     /* set the random number seed */
do i = 1 to &N;                            /* for each observation in the sample */
   /* Hosmer et al. chose x ~ U(-3,3). Use rand("Uniform",-3,3) for modern versions of SAS */
   x = -3 + 6*rand("Uniform");
   eta = -2.337 + 0.8569*x + 0.3011*x**2;  /* quadratic model used by Hosmer et al. */
   mu = logistic(eta);
   Y = rand("Bernoulli", mu);

For this simulated data, the "true" model is known to be quadratic. Of course, in practice, the true underlying model is unknown.

Step 2: Fit a logistic model

The next step is to fit a logistic regression model and save the predicted probabilities. The following call to PROC LOGISTIC intentionally fits a linear model. The calibration plot will indicate that the model is misspecified.

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim plots=effect;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
Plot of predicted probabilities for a logistic regression model in SAS

As shown above, PROC LOGISTIC can automatically create a fit plot for a regression that involves one continuous variable. The fit plot shows the observed responses, which are plotted at Y=0 (failure) or Y=1 (success). The predicted probabilities are shown as a sigmoidal curve. The predicted probabilities are in the range (0, 0.8). The next section creates a calibration plot, which is a graph of the predicted probability versus the observed response.

Create the calibration plot

The output from PROC LOGISTIC includes the predicted probabilities and the observed responses. Copas (1983), Harrell et al. (1984), and others have suggested assessing calibration by plotting a scatter plot smoother and overlaying a diagonal line that represents the line of perfect calibration. If the smoother lies close to the diagonal, the model is well calibrated. If there is a systematic deviation from the diagonal line, that indicates that the model might be misspecified.

The following statements sort the data by the predicted probabilities and create the calibration plot. (The sort is optional for the loess smoother, but is useful for other analyses.)

proc sort data=LogiOut;  by PredProb;  run;
/* let the data choose the smoothing parameter */
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiOut noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm;   /* smoothing value by AICC (0.657) */
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
Calibration plot for a misspecified logistic model

The loess smoother does not lie close to the diagonal line, which indicates that the linear model is not a good fit for the quadratic data. The loess curve is systematically above the diagonal line for very small and very large values of the predicted probability. This indicates that the empirical probability (the proportion of events) is higher than predicted on these intervals. In contrast, when the predicted probability is in the range [0.1, 0.45], the empirical probability is lower than predicted.

Several variations of the calibration plot are possible:

  • Add the NOMARKERS option to the LOESS statement to suppress the observed responses.
  • Omit the CLM option if you do not want to see a confidence band.
  • Use the SMOOTH=0.75 option if you want to specify the value for the loess smoothing parameter. The value 0.75 is the value that Austin and Steyerberg used in their simulation study, in part because that is the default parameter value in the R programming language.
  • Use a REG statement instead of the LOESS statement if you want a polynomial smoother.

A calibration plot for a correctly specified model

In the previous section, the model was intentionally misspecified. The calibration curve for the example did not lie close to the diagonal "perfect calibration" line, which indicates a lack of fit. Let's see how the calibration curve looks for the same data if you include a quadratic term in the model. First, generate the predicted probabilities for the quadratic model, then create the calibration curve for the model:

proc logistic data=LogiSim noprint;
   model Y(event='1') = x x*x;  /* fit a quadratic model */
   output out=LogiOut2 predicted=PredProb;
proc sort data=LogiOut2;  by PredProb;  run;
title "Calibration Plot for a Correct Model";
proc sgplot data=LogiOut2 noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm nomarkers;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   yaxis grid; xaxis grid;
Calibration plot for a correctly specified logistic model

This calibration plot indicates that the quadratic model fits the data well. The calibration curve is close to the diagonal reference line, which is the line of perfect calibration. The curve indicates that the predicted and empirical probabilities are close to each other for low-risk, moderate-risk, and high-risk subjects.

It is acceptable for the calibration curve to have small dips, bends, and wiggles. The extent that the curve wiggles is determined by the (automatically chosen) smoothing parameter for the loess algorithm, The value of the smoothing parameter is not displayed by PROC SGPLOT, but you can use PROC LOESS to display the smoothing parameter and other statistics for the loess smoother.

According to Austin and Steyerberg (2013), the calibration curve can detect misspecified models when the magnitude of the omitted term (or terms) is moderate to strong. The calibration curve also behaves better for large samples and when the incidence of outcomes is close to 50% (as opposed to having rare outcomes). Lastly, there is a connection between discrimination and calibration: The calibration curve is most effective in models for which the discrimination (as measured by the C-statistic or area under the ROC curve) is good.

In summary, this article shows how to construct a loess-based calibration curve for logistic regression models in SAS. To create a calibration curve, use PROC LOGISTIC to output the predicted probabilities for the model and plot a loess curve that regresses the observed responses onto the predicted probabilities.

My next article shows how to construct a popular alternative to the loess-based calibration curve. In the alternative version, the subjects are ranked into deciles according to the predicted probability or risk.

Further reading

The post Calibration plots in SAS appeared first on The DO Loop.

5月 022018

Order matters. When you create a graph that has a categorical axis (such as a bar chart), it is important to consider the order in which the categories appear. Most software defaults to alphabetical order, which typically gives no insight into how the categories relate to each other.

Alphabetical order rarely adds any insight to the data visualization. For a one-dimensional graph (such as a bar chart) software often provides alternative orderings. For example, in SAS you can use the CATEGORYORDER= option to sort categories of a bar chart by frequency. PROC SGPLOT also supports the DISCRETEORDER= option on the XAXIS and YAXIS statements, which you can use to specify the order of the categories.

It is much harder to choose the order of variables for a two-dimensional display such as a heat map (for example, of a correlation matrix) or a matrix of scatter plots. Both of these displays are often improved by strategically choosing an order for the variables so that adjacent levels have similar properties. Catherine Hurley ("Clustering Visualizations of Multidimensional Data", JCGS, 2004) discusses several ways to choose a particular order for p variables from among the p! possible permutations. This article implements the single-link clustering technique for ordering variables. The same technique is applicable for ordering levels of a nominal categorical variable.

Order variables in two-dimensional displays

I first implemented the single-link clustering method in 2008, which was before I started writing this blog. Last week I remembered my decade-old SAS/IML program and decided to blog about it. This article discusses how to use the program; see Hurley's paper for details about how the program works.

To illustrate ordering a set of variables, the following program creates a heat map of the correlation matrix for variables from the Sashelp.Cars data set. The variables are characteristics of motor vehicles. The rows and columns of the matrix display the variables in alphabetical order. The (i,j)th cell of the heat map visualizes the correlation between the i_th and j_th variable:

proc iml;
/* create correlation matric for variables in alphabetical order */
varNames = {'Cylinders' 'EngineSize' 'Horsepower' 'Invoice' 'Length' 
            'MPG_City' 'MPG_Highway' 'MSRP' 'Weight' 'Wheelbase'};
use Sashelp.Cars;  read all var varNames into Y;  close;
corr = corr(Y);
call HeatmapCont(corr) xvalues=varNames yvalues=varNames 
            colorramp=colors range={-1.01 1.01} 
            title="Correlation Matrix: Variables in Alphabetical Order";
Heat map of correlation matrix; variables in alphabetical order

The heat map shows that the fuel economy variables (MPG_City and MPG_Highway) are negatively correlated with most other variables. The size of the vehicle and the power of the engine are positively correlated with each other. The correlations with the price variables (Invoice and MSRP) are small.

Although this 10 x 10 heat map visualizes all pairwise correlations, it is possible to permute the variables so that highly correlated variables are adjacent to each other. The single-link cluster method rearranges the variables by using a symmetric "merit matrix." The merit matrix can be a correlation matrix, a distance matrix, or some other matrix that indicates how the i_th category is related to the j_th category.

The following statements assume that the SingleLinkFunction is stored in a module library. The statements load and call the SingleLinkCluster function. The "merit matrix" is the correlation matrix so the algorithm tries to put strongly positively correlated variables next to each other. The function returns a permutation of the indices. The permutation induces a new order on the variables. The correlation matrix for the new order is then displayed:

/* load the module for single-link clustering */
load module=(SingleLinkCluster);
permutation = SingleLinkCluster( corr );  /* permutation of 1:p */
V = varNames[permutation];                /* new order for variables */
R = corr[permutation, permutation];       /* new order for matrix */
call HeatmapCont(R) xvalues=V yvalues=V 
            colorramp=colors range={-1.01 1.01} 
            title="Correlation: Variables in Clustered Order";
Heat map of correlation matrix; variables in clustered order

In this permuted matrix, the variables are ordered according to their pairwise correlation. For this example, the single-link cluster algorithm groups together the three "size" variables (Length, Wheelbase, and Weight), the three "power" variables (EngineSize, Cylinders, and Horsepower), the two "price" variables (MSRP and Invoice), and the two "fuel economy" variables (MPG_Highway and MPG_City). In this representation, strong positive correlations tend to be along the diagonal, as evidenced by the dark green colors near the matrix diagonal.

The same order is useful if you want to construct a huge scatter plot matrix for these variables. Pairs of variables that are "interesting" tend to appear near the diagonal. In this case, the definition of "interesting" is "highly correlated," but you could choose some other statistic to build the "merit matrix" that is used to cluster the variables. The scatter plot matrix is shown below. Click to enlarge.

Scatter plot matrix of 10 variables. The highly correlated variable tend to appear together so that diagonal scatter plot exhibit high correlation.

Other merit matrices

I remembered Hurley's article while I was working on an article about homophily in married couples. Recall that homophily is the tendency for similar individuals to associate, or in this case to marry partners that have similar academic interests. My heat map used the data order for the disciplines, but I wondered whether the order could be improved by using single-link clustering. The data in the heat map are not symmetric, presumably because women and men using different criteria to choose their partners. However, you can construct the "average homophily matrix" in a natural way. If A is any square matrix then (A` + A)/2 is a symmetric matrix that averages the lower- and upper-triangular elements of A. The following image shows the "average homophily" matrix with the college majors ordered by clusters:

The heat map shows the "averaged data," but in a real study you would probably want to display the real data. You can see that green cells and red cells tend to occur in blocks and that most of the dark green cells are close to the diagonal. In particular, the order of adjacent majors on an axis gives information about affinity. For example, here are a few of the adjacent categories:

  • Biology and Medical/Health
  • Engineering tech, computers, and engineering
  • Interdisciplinary studies, philosophy, and theology/religion
  • Math/statistics and physical sciences
  • Communications and business


In conclusion, when you create a graph that has a nominal categorical axis, you should consider whether the graph can be improved by ordering the categories. The default order, which is often alphabetical, makes it easy to locate a particular category, but there are other orders that use the data to reveal additional structure. This article uses the single-link cluster method, but Hurley (2004) mentions several other methods.

The post Order variables in a heat map or scatter plot matrix appeared first on The DO Loop.

4月 232018

You've probably heard about the "80-20 Rule," which describes many natural and manmade phenomena. This rule is sometimes called the "Pareto Principle" because it was discovered by Vilfredo Pareto (1848–1923) who used it to describe the unequal distribution of wealth. Specifically, in his study, 80% of the wealth was held by 20% of the population. The unequal distribution of effort, resources, and other quantities can often be described in terms of the Pareto distribution, which has a parameter that controls the inequity. Whereas some data seem to obey an 80-20 rule, other data are better described as following a "70-30 rule" or a "60-40 rule," and so on. (Although the two numbers do not need to add to 100, it makes the statement pleasingly symmetric.)

I thought about the Pareto distribution recently when I was looking at some statistics for the SAS blogs. I found myself wondering whether 80% of the traffic to a blog is generated by 20% of its posts. (Definition: A blog is a collection of articles. Each individual article is a post.) As you can imagine, some posts are more popular than others. Some posts are timeless. They appear in internet searches when someone searches for a particular statistical or programming technique, even though they were published long ago. Other posts (such as Christmas-themed posts) do not get much traffic from internet searches. They generate traffic for a week or two and then fade into oblivion.

To better understand how various blogs at SAS follow the Pareto Principle, I downloaded data for seven blogs during a particular time period. I then kept only the top 100 posts for each blog.

The following line plot shows the pageviews for one blog. The horizontal axis indicates the posts for this blog, ranked by popularity. (The most popular post is 1, the next most popular post is 2, and so forth.) This blog has four very popular posts that each generated more than 7,000 pageviews during the time period. Another group of four posts were slightly less popular (between 4,000 and 5,000 pageviews). After those eight "blockbuster" posts are the rank-and-file posts that were viewed less than 3,000 times during the time period.

Long-tailed distribution of pageviews for a blog

The pageviews for the other blogs look similar. This distribution is also common in book-sales data: most books sell only a few thousand copies whereas the best-sellers (think John Grisham) sell hundreds of millions of copies. Movie revenue is another example that follows this distribution.

The distribution is a long-tailed distribution, a fact that becomes evident if you graph a histogram of the underlying quantity. For the blog, the following histogram shows the distribution of the pageviews for the top 100 posts:

Long-tailed distribution of pageviews for a blog

Notice the "power law" nature of the distribution for the first few bars of the histogram. The height of each bar is about 0.4 of the previous height. About 57% of the blog posts had less than 1,000 pageviews. Another 22% had between 1,000 and 2,000 pageviews. The number of rank-and-file posts in each category decreases like a power law, but then the blockbuster posts start to appear. These popular posts give the distribution a very long tail.

Because some blogs (like the one pictured) attract thousands of readers whereas other blogs have fewer readers, you need to standardize the data if you want to compare the distributions for several blogs. Recall that the Pareto Principle is a statement about cumulative percentages. The following graph shows the cumulative percentages of pageviews for seven blogs at SAS (based on the Top 100 posts):

Cumulative percentages of pageviews for seven blogs

The graph shows a dashed line with slope –1 that is overlaid on the cumulative percentage curves. The places where the dashed line intersects a curve are the "Pareto locations" for which Y% of the pageviews are generated by the first (1-Y)% most popular blog posts. In general, these blogs appear to satisfy a "70-30 rule" because about 70% of the pageviews are generated by the 30 / 100 most popular posts. There is some variation between blogs, with the upper curve following a "72-28 rule" and the lower curve satisfying a "63-37 rule."

All this is approximate and is based on only the top 100 posts for each blog. For a more rigorous analysis, you could use PROC UNIVARIATE or PROC NLMIXED to fit the parameters of the Pareto distribution to the data for each blog. However, I am happy to stop here. In general, blogs at SAS satisfy an approximate "70-30 rule," where 70% of the traffic is generated by the top 30% of the posts.

The post The 80-20 rule for blogs appeared first on The DO Loop.

3月 262018
Zipper plot for normally distributed data

Simulation studies are used for many purposes, one of which is to examine how distributional assumptions affect the coverage probability of a confidence interval. This article describes the "zipper plot," which enables you to compare the coverage probability of a confidence interval when the data do or do not follow certain distributional assumptions.

I saw the zipper plot in a tutorial about how to design, implement, and report the results of simulation studies (Morris, White, Crowther, 2017). An example of a zipper plot is shown to the right. The main difference between the zipper plot and the usual plot of confidence intervals is that the zipper plot sorts the intervals by their associated p-values, which causes the graph to look like a garment's zipper. The zipper plot in this blog post is slightly different from the one that is presented in Morris et al. (2017, p. 22), as discussed at the end of this article.

Coverage probabilities for confidence intervals

A formulas for a confidence interval (CI) is based on an assumption about the distribution of the parameter estimates on a simple random sample. For example, the two-sided 95% confidence interval for the mean of normally distributed data has upper and lower limits given by the formula
x̄ ± t* s / sqrt(n)
where x̄ is the sample mean, s is the sample standard deviation, n is the sample size, and t* is the 97.5th percentile of the t distribution with n – 1 degrees of freedom. You can confirm that the formula is a 95% CI by simulating many N(0,1) samples and computing the CI for each sample. About 95% of the samples should contain the population mean, which is 0.

The central limit theorem indicates that the formula is asymptotically valid for sufficiently large samples nonnormal data. However, for small nonnormal samples, the true coverage probability of those intervals might be different from 95%. Again, you can estimate the coverage probability by simulating many samples, constructing the intervals, and counting how many time the mean of the distribution is inside the confidence interval. A previous article shows how to use simulation to estimate the coverage probability of this formula.

Zipper plots and simulation studies of confidence intervals

Zipper plot that compares the empirical coverage probability of confidence intervals that assume normality. The data samples are size N=50 drawn from the normal or exponential distribution.

The graph to the right shows the zipper plot applied to the results of the two simulations in the previous article. The zipper plot on the left visualizes the estimate of the coverage probability 10,000 for normal samples of size n = 50. (The graph at the top of this article provides more detail.) The Monte Carlo estimate of the coverage probability of the CIs is 0.949, which is extremely close to the true coverage of 0.95. The Monte Carlo 95% confidence interval, shown by a blue band, is [0.945, 0.953], which includes 0.95. You can see that the width of the confidence intervals does not seem to depend on the sample mean: the samples whose means are "too low" tend to produce intervals that are about the same length as the intervals for the samples whose means are "too high."

The zipper plot on the right is for 10,000 random samples of size n = 50 that are drawn from an exponential distribution with mean 0 and unit scale parameter. (The graph at this link provides more detail.) The Monte Carlo estimate of the coverage probability is 0.9360. The Monte Carlo 95% confidence interval, shown by a blue band, is [0.931, 0.941] and does not include 0.95. Notice that the left side of the "zipper" tends to contain intervals that are shorter than those on the right side. This indicates that samples whose means are "too low" tend to produce shorter confidence intervals than the samples whose means are "too high."

The zipper plot enables you to compare the results of two simulations that generate data from different distributions. The zipper plot enables you to visualize the fact that the nominal 95% CIs do, in fact, have empirical coverage of about 95% for normal data, whereas the intervals have about 93.6% empirical coverage for the exponential data.

If you want to experiment with the zipper plot or use it for your own simulation studies in SAS, you can download the SAS program that generates the graphs in this article.

Differences from the "zip plot" of Morris et al.

There are some minor differences between the zipper plot in this article and the graph in Morris et al. (2017, p. 22).

  • Morris et al., use the term "zip plot." However, statisticians use "ZIP" for the zero-inflated Poisson distribution, so I prefer the term "zipper plot."
  • Morris et al. bin the p-values into 100 centiles and graph the CIs against the centiles. This has the advantage of being able to plot the CI's from thousands or millions of simulations in a graph that uses only a few hundred vertical pixels. In contrast, I plot the CI's for each fractional rank, which is the rank divided by the number of simulations. In the SAS program, I indicate how to compute and use the centiles.
  • Morris et al. plot all the centiles. Consequently, the interesting portion of the graph occupies only about 5-10% of the total graph area. In contrast, I display only the CIs whose fractional rank is less than some cutoff value, which is 0.2 in this article. Thus my zipper plots are a "zoomed in" version of the ones that appear in Morris et al. In the SAS program, you can set the FractionMaxDisplay macro variable to 1.0 to show all the centiles.

The post A zipper plot for visualizing coverage probability in simulation studies appeared first on The DO Loop.

2月 122018

Did you know that SAS can combine or "merge" a symbol and a line pattern into a single legend item, as shown below? This kind of legend is useful when you are overlaying a group of curves onto a scatter plot. It enables the reader to quickly associate values of a categorical variable with colors, line patterns, and marker symbols in a plot.

Merged legend that combines symbols and line patterns in SAS

Legend items for marker/curve overlays

When you use PROC SGPLOT and the GROUP= option to create a graph, the SGPLOT procedure automatically displays the group attributes (such a symbol, color, and line pattern) in a legend. If you overlay multiple plot types (such as a series plot on a scatter plot) the default behavior is to create a legend for the first plot statement. You can use the KEYLEGEND statement to control which plots contribute to the legend. In the following example, the KEYLEGEND statement creates a legend that shows the attributes for the scatter plot (the marker shapes) and also the series plot (line patterns):

data ScatCurve;    /* example data: scatter plot and curves */
call streaminit(1);
do Group = 1 to 2;
   do x = 0 to 5 by 0.2;
      Pred = Group + (1/Group)*x - (0.2*x)**2;
      y = Pred + rand("Normal",0,0.5);
ods graphics / antialias=on subpixel=on;
title "Legend Not Merged";
proc sgplot data=ScatCurve;
   scatter x=x y=y / group=Group name="obs" markerattrs=(symbol=CircleFilled);
   series x=x y=Pred / group=Group name="curves"; 
   keylegend "obs" "curves" / title="Group"; /* separate items for markers and lines */
Legend that shows symbols and line patterns separately

The legend contains all the relevant information about symbols, colors, and line patterns, but it is longer than it needs to be. When you display curves and markers for the same groups, you can obtain a more compact representation by merging the symbols and line patterns into a single legend, as shown in the next sections.

The MERGEDLEGEND statement in GTL

If you are comfortable with the Graph Template Language (GTL) in SAS, then you can use the MERGEDLEGEND statement to create a merged legend. The following statements create a graph template that overlays a scatter plot and a series plot and creates a merged legend in which each item contains both a symbol and a line pattern:

proc template;
  define statgraph ScatCurveLegend;
  dynamic _X _Y _Curve _Group _Title;       /* dynamic variables */
  entrytitle _Title;  
   layout overlay; 
     scatterplot x=_X y=_Y / group=_Group name="obs" markerattrs=(symbol=CircleFilled);
     seriesplot x=_X y=_Curve / group=_Group name="curves";
     /* specify exactly two names for the MERGEDLEGEND statement */
     mergedlegend "obs" "curves" / border=true title=_Group;
proc sgrender data=ScatCurve template=ScatCurveLegend;
   dynamic _Title="Overlay Curves, Merged Legend"
           _X="x" _Y="y" _Curve="Pred" _Group="Group";
Merged legend that combines symbols and line patterns in SAS

The graph is the same as before, but the legend is different. The MERGEDLEGEND statement is used in many graphs that are created by SAS regression procedures. As you can see, the legend shows the symbol and line pattern for each group.


Unfortunately, the MERGEDLEGEND statement is not supported by PROC SGPLOT. However, SAS 9.4M5 supports the LEGENDITEM statement in PROC SGPLOT. The LEGENDITEM statement enables you to construct a custom legend and gives you complete control over every item in a legend. The following example constructs a legend that uses the symbols, line patterns, and group values that are present in the graph. Notice that you have to specify these attributes manually, as shown in the following example:

title "Merged Legend by Using LEGENDITEM in SGPLOT";
proc sgplot data=ScatCurve;
   scatter x=x y=y / group=Group markeratters=(symbol=CircleFilled);
   series x=x y=Pred / group=Group; 
   legenditem type=markerline name="I1" /
              label="1" lineattrs=GraphData1 markerattrs=GraphData1(symbol=CircleFilled);
   legenditem type=markerline name="I2" /
              label="2" lineattrs=GraphData2 markerattrs=GraphData2(symbol=CircleFilled);
   keylegend "I1" "I2" / title="Group"; 

The graph and legend are identical to the previous graph and are not shown.

The advantage of the LEGENDITEM statement is that you can layout the legend however you choose; the legend is not tied to the attributes in any previous graph component. However, this is also a disadvantage. If you change the marker attributes in the SCATTER statement, the legend will not reflect that change until you manually modify each LEGENDITEM statement. Although there is no denying the power of the LEGENDITEM statement, the MERGEDLEGEND statement in the GTL always faithfully and automatically reflects the attributes in the SCATTERPLOT and SERIESPLOT statements.

In summary, the SG procedures in SAS automatically create a legend. When you overlay multiple plots, you can use the KEYLEGEND statement to control which plots contribute to the legend. However, it is also possible to merge the symbols and line patterns into a single compact legend. In the GTL, you can use the MERGEDLEGEND statement. In SAS 9.4M5, PROC SGPLOT supports the LEGENDITEM statement to customize the items that appear in a legend.

The post Merged legends: Overlay a symbol and line in a legend item appeared first on The DO Loop.

1月 312018

This article shows how to construct a "stacked band plot" in SAS, as shown to the right. (Click to enlarge.) You are probably familiar with a stacked bar chart in which the cumulative amount of some quantity is displayed by stacking the contributions of several groups. A canonical example is a bar chart of revenue for a company in which the revenues for several regions (North America, Europe, Asia, and so forth) are stacked to show the total revenue. If you plot the revenue for several years, you get a stacked bar chart that shows how the total revenue changes over time, as well as the changes in the regional revenues. A stacked band plot is similar, but it is used when the horizontal variable is continuous rather than discrete.

Create a stacked bar chart in SAS

A few years ago Sanjay and I each wrote about ways to create a stacked bar chart in SAS by using the SGPLOT procedure in SAS. Recently I created a stacked bar chart that spanned about 20 years of data. When I looked at it, I realized that as the number of categories (years) grows, the bars become a nuisance. The graph becomes cluttered with vertical bars that are not necessary to show the underlying trends in the data. At that point, it is better to switch to a continuous scale and use a stacked band plot.

This section shows how to create a stacked bar chart by using PROC SGPLOT. The next section shows how to create a stacked band chart.

For data, I decided to use the amount of carbon dioxide (CO2, in million metric tons) that is produced by several sources from 1973–2016. These data were plotted as a stacked bar chart by my friend Robert Allison. The data are originally in "wide form," with five variables that contain the data. For a stacked bar chart, you should convert the data to "long form," as follows. You can download the complete SAS program that creates the graphs in this article.

/* 1. Original data is in "wide form": CO2 emission for five sources for each year. */
data CO2;
informat Transportation Commercial Residential Industrial Electric COMMA5.;
input Year Transportation Commercial Residential Industrial Electric;
2016	1,879	235	308	928	1,821
2015	1,845	240	321	942	1,913
2014	1,821	233	347	955	2,050
2013	1,803	223	333	952	2,050
   ... more lines ...   
/* 2. Convert from wide to long format */
proc sort data=CO2 out=Wide;
   by Year;                     /* sort X categories */
proc transpose data=Wide 
   out=Long(rename=(Col1=Value))/* name of new column that contains values */
   name=Source;                 /* name of new column that contains categories */
   by Year;                     /* original X var */
   var Transportation Commercial Residential Industrial Electric; /* original Y vars */

For data in the long form, you can use the VBAR statement and the GROUPDISPLAY=STACK option in PROC SGPLOT to create a stacked bar chart, as follows:

/* 3. create a stacked bar chart */
ods graphics / subpixel=on;
title "U.S. Carbon Dioxide Emissions from Energy Consumption";
title2 "Stacked Bar Chart";
proc sgplot data=Long;
   vbar Year / response=Value group=Source groupdisplay=stack grouporder=data;
   xaxis type=linear thresholdmin=0;   /* use TYPE=LINEAR so not every bar is labeled */ 
   yaxis grid values=(0 to 6000 by 1000);
   label Source = "Source of CO2" Value = "CO2 (mmt)";

The graph is not terrible, but it can be improved. The vertical lines are a distraction. With 43 years of data, the horizontal axis begins to lose its discrete nature. All those bars and the gaps between bars are cluttering the display. The next section shows how to create two new variables and use those variables to create a stacked band plot.

Create a stacked band plot

To create a stacked bar chart with PROC SGPLOT, you must create two new variables. The cumValue variable will contain the cumulative amount of CO2 per year as you accumulate the amount for each group (Transportation, Commercial, Residential, and so forth). This will serve as the upper boundary of each band. The Previous variable will contain the lower boundary of each band. You can use the LAG function in Base SAS to easily obtain the previous cumulative value: Previous = lag(cumValue). When you create these variables, you need to initialize the value for the first group (Transportation). Use the BY YEAR statement and the FIRST.YEAR indicator variable to detect the beginning of each new value of the Year variable, as follows:

/* 4. Accumulate the contributions of each group in the 'cumValue' variable.
      Add the lag(cumValue) variable and call it 'Previous'. 
      Create a band plot for each group with LOWER=Previous and UPPER=cumValue.
data Energy;   
   set Long;
   by Year;
   if first.Year then cumValue=0;   /* for each year, initialize cumulative amount to 0 */
   cumValue + Value;
   Previous = lag(cumValue);
   if first.Year then Previous=0;   /* for each year, initialize baseline value */
   label Source = "Source of CO2" cumValue = "CO2 (mmt)";
title2 "Stacked Band Plot";
proc sgplot data=Energy;
   band x=Year lower=Previous upper=cumValue / group=Source;
   xaxis display=(nolabel) thresholdmin=0; 
   yaxis grid;
   keylegend / position=right sortorder=reverseauto; /* SAS 9.4M5: reverse legend order */

The graph appears at the top of this article. The vertical lines are gone. The height of a band shows the amount of CO2 that is produced by the corresponding source. The top of the highest band (Electric) shows the cumulative amount of CO2 that is produced by all sources. This kind of display makes it easy to see the cumulative total. If you want to see the trends for each individual source, a time series plot would be a better choice.

The graph uses a new feature of SAS 9.4M5. The KEYLEGEND statement now supports the SORTORDER=REVERSEAUTO option, which reverses the order of the legend elements. This option makes the legend match the bottom-to-top progression of the stacked bar chart. I also used the POSTITION= option to move the legend to the right side so that the legend is next to the color bands.

Label the stacked bands directly

If you have very thin bands, a legend is probably the best way to associate colors with groups. However, for these data the bands are wide enough that you might want to display the name of each group on the bands. I tried several label positions (left, center, and right) and decided to display the group name near the left side of the graph. Since many people scan a graph from left to right, this causes the reader to see the labels early in the scanning process.

The following DATA step computes the positions for the labels. I use 1979 as the horizontal position and use the midpoint of the band as the vertical position. After you compute the positions for the labels, you can concatenate the data and the labels and use the TEXT statement to overlay the labels on the band plot, as follows:

data Labels;
   set Energy;
   where Year = 1979;  /* position labels at 1979 */
   Label = Source;
   XPos = Year;
   YPos = (cumValue + Previous) / 2;
   keep Label XPos YPos;
data EnergyLabels;
   set Energy Labels;
title2 "Stacked Band Plot";
proc sgplot data=EnergyLabels noautolegend;
band x=Year lower=Previous upper=cumValue / group=Source;
refline 1000 to 6000 by 1000 / axis=y lineattrs=GraphGridLines transparency=0.75;
text x=XPos y=Ypos text=Label;
xaxis display=(nolabel) values=(1973, 1980 to 2010 by 10, 2016)
      offsetmin=0 offsetmax=0;
yaxis grid values=(0 to 6000 by 1000) label="CO2 (mmt)"; 


In summary, you can use PROC SGPLOT to create a stacked band plot in SAS. A stacked band plot is similar to a stacked bar chart but presumes that the positions of the bars represent a continuous variable on a linear scale. For this example, the bars represented years. You can let PROC SGPLOT automatically place the legend along the bottom, but if you have SAS 9.4M5 you might want to move the legend to the right and use the SORTORDER=REVERSEAUTO option to reverse the legend order. Alternatively, you can display labels on the bands directly if there is sufficient room.

The post Create a stacked band plot in SAS appeared first on The DO Loop.

1月 172018

Money magazine (Jan/Feb 2018) contains an article about how much it costs to give birth in the US. The costs, which are based on insurance data, include prenatal care and hospital delivery but exclude infant care. The data are compiled for each state (including Washington, DC) and by type of delivery (vaginal versus cesarean section). The data includes the average and median costs for each state.

The online version of the article contains a map and a table of average costs, colored by the quintiles of the costs. Because I think that median costs are more relevant, I decided to create a visualization of the distribution of the median costs. Additionally, I want to visualize the incremental cost of a C-section over a vaginal delivery. According to the CDC, about 32% of deliveries are C-sections in the US. Cesarean delivery is major surgery and often requires an additional two days of hospital recovery in addition to operating-room charges.

With a little sleuthing, I was able to locate the data and download it into a SAS data set. You can download the data and SAS program that creates the graphs in this article.

Cost Distribution and incremental cost of a cesarean delivery

Median cost of vaginal and cesarean delivery by US state (2016-2017)

The adjacent bar chart (click to enlarge) shows the distribution of the median costs of childbirth in the US. Since the median cost of a cesarian delivery is always more than the median cost of a vaginal delivery, I overlaid the two graphs. The states are ordered by the median cost of a vaginal delivery. The data shows that the states of Alabama, Rhode Island, Nebraska, Louisiana, and Utah are the least expensive states for vaginal delivery. The median cost is about $5000 in those states. The most expensive states include Alaska, New Jersey, New York, Wisconsin, and Massachusetts. The median cost is more than $8000 for those states, with Alaska topping out at $14,500.

If you are more interested in the cost of a cesarean delivery, I created a similar graph sorted by the cost of a C-section. No matter how you sort it, the graph indicates that a C-section costs about $2500 to $3500 more than a vaginal delivery. In Washington, DC, the incremental cost is about $1100, which is relatively low. In Vermont and Alaska, the incremental cost is more than $4000, which is relatively high.

The Money magazine map of the data does not reveal any unexpected regional trends. Costs are high in Alaska and New England. Costs are low in some southern states.

Comparing delivery types by using a scatter plot

For a more sophisticated audience, you can use a scatter plot to plot the costs for vaginal and cesarean delivery in each state. A plot of the median costs is shown below. A regression line to these data has a slope of 1.2, which indicates that, on average, the median cost of a C-section is about 20% more than for a vaginal delivery. This visualization also enables you to see that Alaska is an extreme outlier for both types of delivery.

Median cost of vaginal and cesarean delivery by US state (2016-2017)

The Money article about these data points out two facts that cannot be seen in the data. First, it says that women "who have no insurance... are usually charged a higher amount than the negotiated rate." Second, US women "pay more to have a baby than residents of any other country. The highest prices in the U.S. were more than double those of the second-most expensive country, Switzerland" (emphasis added)." For a comparison of different countries, see Parents magazine (Jan 2017).

What interesting facts do you notice about these data? Leave a comment.

The post How much does it cost to give birth in the US? appeared first on The DO Loop.

1月 102018

Last week I wrote about the 10 most popular articles from The DO Loop in 2017. My most popular articles tend to be about elementary statistics or SAS programming tips. Less popular are the articles about advanced statistical and programming techniques. However, these technical articles fill an important niche. Not everyone needs to know how to interpret a diffogram that visually compares the differences between means between groups, but those who do often send me a note of thanks, which motivates me to research and write similar articles.

Statistical programmers might want to review the following technical articles from 2017. This ain't summertime, and the reading ain't easy, but I think these articles are worth the effort. I've broken them into three categories. Enjoy!

Statistical Concepts

Visualization of regression that uses a weight variable in SAS

Statistical Data Analysis and Visualization

Visualize multivariate regression model by slicing the continuous variables. Graph created by using the EFFECTPLOT SLICEFIT statement in SAS.

Advanced SAS Programming Techniques

If you are searching for a way to enhance your SAS knowledge in this New Year, I think these 10 articles from The DO Loop are worth a second look. Was there an article from 2017 that you found particularly useful? Leave a comment.

The post 10 posts from 2017 that deserve a second look appeared first on The DO Loop.

1月 082018
Label multiple regression lines in a graph in SAS by using PROC SGPLOT

A SAS programmer asked how to label multiple regression lines that are overlaid on a single scatter plot. Specifically, he asked to label the curves that are produced by using the REG statement with the GROUP= option in PROC SGPLOT. He wanted the labels to be the slope and intercept of a linear regression line, as shown to the right. (Click to enlarge.)

Initially I thought that you could use the CURVELABEL option on the REG statement to generate labels, as follows:

proc sgplot data=sashelp.iris noautolegend;
   reg x=SepalLength y=PetalLength / group=Species CURVELABEL; /* does NOT work */

However, the SAS log displays the following warning:

WARNING: CURVELABEL not supported for fit plots when a group variable is
         used.  The option will be ignored.

Fortunately, I thought of two other ways to create a graph that has a regression line for each group level, each with its own label. For linear regression, you can use the LINEPARM statement, as shown in the article "Add a diagonal line to a scatter plot." For general (possibly nonlinear) regression curves, you can find the location of the end of the curve and use the TEXT statement in PROC SGPLOT to add a label at that location.

Label the regression line for each group: The LINEPARM statement

Let's use Fisher's Iris data set for our example data. The Iris data contains 50 observations for each of three species of flowers: iris Setosa, iris Versicolor, and iris Virginica. The programmer wants to label the regression line for each species by using the slope and intercept of the line. The first step is to create a SAS data set that contains the intercept and slope for each curve. You can use the OUTEST= option in PROC REG to write the parameter estimates (intercept and slope) to a SAS data set. You can then use the CATX function in the DATA step to construct the labels, as follows:

proc sort data=sashelp.iris out=iris;
   by Species;
/* compute parameter estimates */
proc reg data=iris outest=PE noprint;
   by Species;
   model PetalLength = SepalLength;
/* construct labels from the parameter estimates */
data Labels;
   length Label $30;
   set PE(rename=(SepalLength=Slope));                   /* independent variable */
   Label = catx(" ", put(Intercept, BestD5.), '+',       /* separate by blank */
                     put(Slope, BestD5.), '* SepalLength');
   keep Label Species Intercept Slope;
proc print noobs; run;

The LABELS data set contains a label for the regression line in each group. You can use other labels if you prefer. The following DATA step combines the labels with the original data. The SCATTER statement in PROC SGPLOT displays the data. The LINEPARM statement draws the lines and adds labels to the end of each line.

data Plot;
   set iris Labels;
title "Regression Lines Labeled with Slope and Intercept";
proc sgplot data=Plot;
  scatter x=SepalLength y=PetalLength / group=Species;
  lineparm x=0 y=Intercept slope=Slope / group=Label curvelabel 
               curvelabelloc=outside clip;
Label multiple regression lines in a graph in SAS by using PROC SGPLOT

Success! The regression line for each group is labeled by the formula for the line. For more information about displaying the formula for a regression line, see the SAS/STAT example "Adding Equations and Special Characters to Fit Plots."

Label the regression line for each group: The TEXT statement

The preceding method uses the LINEPARM statement, so it only works for lines. However, the user actually wanted to use the REG statement. With a little work, you can label curves that are produced by the REG statement or other curve-fitting statements. The idea is to obtain the data coordinates for the end of the curve, which will become the location of the label.

You might be thinking, if the curve is produced by a regression statement in PROC SGPLOT, how can we get the data coordinates out of the plot and into a data set? The answer is simple: You can use the ODS OUTPUT statement to write a data set that contains the data in any ODS graph. You can apply this trick to any ODS graph, including graphs created by SGPLOT. The following call to PROC SGPLOT uses an ODS OUTPUT statement to create a SAS data set that contains the data in the regression plot:

/* use ODS OUTPUT to find data coordinates of end of lines */
proc sgplot data=iris;
   ods output sgplot=RegPlot;    /* name of ODS table is 'sgplot' */
   reg x=SepalLength y=PetalLength / group=Species;
proc contents short varnum; run; /* find names used by graph */

The variable names that are automatically manufactured by SAS procedures can be long and unwieldy, as shown by the call to PROC CONTENTS. I usually rename the long names to simpler names such as X, Y, GROUP, and so on. You should look at the structure of the REGPLOT data set so that the next DATA step makes sense. The DATA step saves only the last coordinates along the curve for each group (species).

data Coords;
set RegPlot(
    where=(x ^=.));
by Group;
if last.Group;
keep x y Group;
proc print noobs; run;

The COORDS data contains the location (in data coordinates) of the end of each regression line. You can overlay labels at these coordinates to label the curves. From the preceding section, the labels are in the LABELS data set, so you can merge the two data sets, as follows:

/* combine the positions and labels with original data */
data A;
merge Labels Coords(rename=(Group=Species));
by Species;
data Plot;
set iris A;
/* optional: pad label with blanks on the left (if length is long enough) */
Label = "   " || Label;  
proc sgplot data=Plot;
  reg x=SepalLength y=PetalLength / group=Species;
  text x=x y=y text=Label / position=right;

The graph is shown at the top of this article.

Label multiple regression curves

If you study the previous section, you will see that the code does not rely on the linearity of the regression model. The same code works for polynomial regression and nonparametric regression curves such as are created by the LOESS and PBSPLINE statements in PROC SGPLOT. The following graph shows a PBSPLINE fit to the IRIS data. Because the penalized B-spline curve is nonparametric, there is no equation to display as a label. Instead, I use the Species name as a label and suppress the legend at the bottom of the graph. You can download the SAS program that creates this and all the graphs in this article.

Label multiple regression curves in a graph in SAS by using PROC SGPLOT

The post Label multiple regression lines in SAS appeared first on The DO Loop.