Statistical Graphics

7月 022018
 
Who was the greatest US president? Presidental greatness scores through 2018

Which president of the United States is ranked the greatest by presidential historians? This article visualizes the results of the 2018 Presidential Greatness Survey, which was created and administered by B. Rottinghaus and J. Vaughn. They analyzed 166 responses from experts in political science who ranked the 44 US presidents on a 0–100 "greatness scale" where 0 represents abject failure, 50 is average, and 100 is great. The survey results are presented in their 2018 report.

Table 1 of the report (p. 2-3) provides the average greatness scores of the presidents among all respondents. Just as "beauty is in the eye of the beholder," so too "greatness" is a subjective notion that depends on the values and beliefs of the assessor. Table 2A (p. 6-7) provides the average greatness scores for respondents by the political party and political ideology of the respondents. Of the respondents, 57.2% were Democrats, 30.1% were Independent or Other, and 12.7% were Republicans. Similarly, 58.4% self-identified as liberal or somewhat liberal, 24.1% self-identified as moderate, and 17.5% reported being conservative or somewhat conservative. The relatively large proportion of liberals should be considered when interpreting the survey results.

I imported the survey results into SAS and combined them to produce a visualization of the data. The graph at the right (click to enlarge) shows the presidents ranked by their overall mean greatness score, which is shown by an X. The mean scores for the conservative, moderate, and liberal respondents are shown as filled circles. The party of the president is indicated by a colored bar that shows the range of scores.

A few results are apparent from the graph:

  • Lincoln, Washington, FDR, Teddy Roosevelt, and Jefferson are the five greatest presidents according to these experts.
  • Obama and Reagan are two recent presidents who are ranked highly.
  • Statistically speaking, there are large groups of presidents for which the relative ranking is uncertain. For example, the presidents ranked 12–20 (Madison through Monroe) have mean scores that are similar.
  • The current president, Donald Trump, is ranked last. In fairness, Trump had not yet completed his first year in office at the time of the survey (December 2017). The authors of the report dedicate several pages (p. 10–13) analyzing a survey question about who was the Most Polarizing president, a category that Trump won handily.
  • Other presidents who scored low include James Buchanan (the "Do-Nothing President"), William Henry Harrison (who died 31 days into his term), and Franklin Pierce (a weak leader in a contentious pre-Civil War era).

You can also graph the data by using the political parties of the experts rather than their ideologies. The graph is very similar.

Bias and disagreement in ranking presidents

One of the most interesting aspects of these data is the relative widths of the ranges of most 20th and 21st century presidents. These wide intervals are caused when experts of one political ideology judge a president much differently than experts from another ideology. Typically, the conservative-liberal disparity amounts to 10–15 points on the "greatness" scale, but for several presidents (Coolidge and Obama) the average conservative score is more than 20 points different from the average liberal score.

This disparity is evidence of the partisan state of politics in the US. Even these experts, who presumably use historical facts rather than opinions to guide their ranking, are not immune from seeing current events through the lens of their own personal values and biases.

To better visualize the range of disagreement among the experts, I graphed the difference between the conservative and liberal scores for each president. Again, the color of the line indicates the political party of the president. The 20th century began with McKinley's presidency. For almost every president since McKinley, the gap for Democratic presidents is negative, indicating that the conservative experts judged the presidency less favorably than the liberal experts. For Republican presidents, the direction of the gap is reversed: the conservative experts ranked the presidency more favorably than their liberal counterparts. The exception is Teddy Roosevelt, whose progressive agenda (the "Square Deal" and conservation measures such as national parks and forests) appeals to liberal experts.

Summary

Rottinghaus and Vaughn's 2018 Presidential Greatness Survey analyzes the opinions of 166 experts as to which US presidents are considered great, average, or a failure. The first graph in this blog post shows that although there is general agreement among experts for many 18th and 19th century presidents (Lincoln and Washington were great; Pierce, W. H. Harrison, and Buchanan were not.), there is less agreement for modern presidents. As seen in the second graph, the variation tends to correlate along ideological grounds, with conservatives and liberals having very different views about which presidents were great.

You can download the SAS program that creates the graphs in this article. For a different visualization of these data, see Robert Lawrence's article "All the U.S. Presidents, Ranked by 'Greatness'."

The post Ranking US presidents appeared first on The DO Loop.

6月 132018
 

If you use PROC SGPLOT to create ODS graphics, "ATTRS" are everywhere. ATTRS is an abbreviation of "attributes." Most options that change the attributes of a graphical element end with the ATTRS suffix. For example, the MARKERATTRS option modifies attributes of markers, the LINEATTRS option modifies attributes of lines, and the FILLATTRS option modifies attributes of the filled area of a bar or other region. These options are easy to remember and use.

However, there are three "ATTRS" that you might find more confusing: CYCLEATTRS, ATTRPRIORITY, and STYLEATTRS. These options determine the colors, line patterns, and marker symbols that are used to represent groups in your data. They interact with each other when you use a GROUP= option to specify groups and also use multiple statements to overlay several graph types such as scatter plots and series plots.

This article summarizes the ATTRPRIORITY, CYCLEATTRS, and STYLEATTRS keywords and provides an example that shows how they interact with each other. The end of this article presents a list of references for further reading.

What are ATTRPRIORITY, CYCLEATTRS, and STYLEATTRS?

In PROC SGPLOT, many statements support the GROUP= option. In order to distinguish one group from another, the groups are assigned different attributes such as colors, line patterns, marker symbols, and so on. The colors, patterns, and other attributes are defined by the current ODS style (or by the STYLEATTRS statement or the DATTRMAP= data set), but the ATTRPRIORITY and CYCLEATTRS keywords determine how the colors, patterns, and symbols combine for each group. The following describes the syntax and purpose of each keyword:

  • The ATTRPRIORITY option is specified on the ODS GRAPHICS statement. If you specify ATTRPRIORITY=COLOR, then groups are represented by changing the color of markers and lines, but not line patterns or marker symbols. If you specify ATTRPRIORITY=NONE, groups are represented by changing the color, line patterns, and marker symbols for each group. The ATTRPRIORITY=NONE is essential for "no color" styles such as the Journal style.
  • The CYCLEATTRS (or NOCYCLEATTERS) option affect whether attributes change when you overlay multiple graphs by using two or more statements. Notice that the word "CYCLE" begins with an "S" sound, as does the word "statements." Say the following sentence out loud, emphasizing the common "S" sounds: "CYCLEATTRS affect Statements."
  • The STYLEATTRS statement in PROC SGPLOT overrides the colors, patterns, and symbols in the current style. You can use the STYLEATTRS statement to set the combination of attributes that appear for each group.

Before looking at the interaction between these keywords, recall how the ATTRPRIORITY= option affects a graph that has only one statement. The following calls to PROC SGPLOT are identical. The only difference is the ATTRPRIORITY= option.

title "Iris Data: AttrPriority=Color";
ods graphics / AttrPriority=Color;
proc sgplot data=Sashelp.Iris;
   reg x=PetalWidth y=SepalLength / group=Species markerattrs=(size=10) lineattrs=(thickness=4);
run;
 
title "Iris Data: AttrPriority=None";
ods graphics / AttrPriority=None;
proc sgplot data=Sashelp.Iris;
   reg x=PetalWidth y=SepalLength / group=Species markerattrs=(size=10) lineattrs=(thickness=4);
run;
ATTRPRIORITY=COLOR and ATTRPRIORITY=NONE options in ODS graphics in SAS PROC SGPLOT

The plot on the left shows the result of using ATTRPRIORITY=COLOR with the HtmlBlue style. The three groups are represented by using the first three colors in the style (in this case, blue, red, and green), but the line patterns and marker symbols do not change between groups. In contrast, the plot on the left is the result of using ATTRPRIORITY=NONE. The three groups are represented by using the first three colors in the style and also the first three line patterns (solid, dashed, and dash-dot) and the first three marker symbols (circle, plus sign, and "X"). The results might depend on the current ODS style.

The interaction between ATTRPRIORITY and CYCLEATTRS

The ATTRPRIORITY and CYCLEATTRS options interact when you use two or more statements to overlay graphs. In Warren Kuhfeld's short course on "Advanced ODS Graphics Examples in SAS," he presents a slide that shows how these options interact. The following example creates a panel of graphs that is inspired by Warren's slide. Each graph in the example shows the same data (the closing stock price for two companies in 2001–2003) displayed for each of the four combinations of the ATTRPRIORITY and CYCLEATTRS options. Although the LOESS statement can display the markers, I used a separate SCATTER statement because I want to illustrate how the CYCLEATTRS statement affects the attributes.

data Stocks;
set Sashelp.Stocks;
where '01JAN2001'd <= Date <= '31OCT2003'd and Stock in ("IBM", "Microsoft");
run;
 
/* The code for each call is the same except for the options. Wrap in a macro for brevity. */
%macro GraphIt(priority, cycle);
title "AttrPriority=&priority.; &cycle.";
ods graphics / AttrPriority=&priority.;
proc sgplot data=Stocks &cycle.;
   loess   x=Date y=Close / group=Stock lineattrs=(thickness=4) NoMarkers smooth=0.75;
   scatter x=Date y=Close / group=Stock markerattrs=(size=10);
   yaxis label="Stock Price";
run;
%mend;
 
ods graphics / width=400px height=300px;
ods layout gridded columns=2 advance=table column_gutter=0 row_gutter=0;
%GraphIt(Color, NoCycleAttrs);
%GraphIt(None,  NoCycleAttrs);
%GraphIt(Color, CycleAttrs);
%GraphIt(None,  CycleAttrs);
ods layout end;
Interaction between the ATTRPRIORITY=  and CYCLEATTRS options in ODS graphics in SAS PROC SGPLOT

The panel for the example shows the interaction between the ATTRPRIORITY= and CYCLEATTRS options.

  • The top-left graph shows the graph for ATTRPRIORITY=COLOR and NOCYCLEATTRS. Only colors are used to distinguish the groups (IBM and Microsoft). The second statement (SCATTER) uses the same attributes as the first statement (LOESS).
  • The top-right graph shows the graph for ATTRPRIORITY=NONE and NOCYCLEATTRS. Colors, line patterns, and symbols are used to distinguish the groups. The second statement (SCATTER) uses the same attributes as the first statement (LOESS).
  • The bottom-left graph shows the graph for ATTRPRIORITY=COLOR and CYCLEATTRS. Only colors are used to distinguish the groups (IBM and Microsoft). The first statement (LOESS) uses the first two colors in the style (blue and red) and the second statement (SCATTER) uses the third and fourth colors in the style (green and brown). Because the markers are not filled, it might be difficult to see that the marker colors are green and brown, but you can click on the graph to enlarge it.
  • The bottom-right graph shows the graph for ATTRPRIORITY=NONE and CYCLEATTRS. Colors, line patterns, and symbols are used to distinguish the groups. The first statement (LOESS) uses the first two attributes in the style (blue-solid and red-dashed) whereas the second statement (SCATTER) uses the third and fourth attributes (green-X and brown-triangle).

Depending on your needs, either graph in the first row would be appropriate for color graphs. The lower-right plot is most suitable for ODS styles that create monochrome graphs.

The STYLEATTRS statement

After you understand the interaction between the ATTRPRIORITY= and CYCLEATTRS options, it is straightforward to use the STYLEATTRS statement. The statement merely overrides the group attributes for the current style. If you specify fewer attributes than the number of groups, the attributes are cyclically reused. For example, the following STYLEATTRS statement specifies that all lines should be solid but specifies different colors and symbols for the groups. Because two attributes are specified, the second statement reuses the same colors and symbols even though the NOCYCLEATTRS option is set:

title "AttrPriority=None; NoCycleAttrs; StyleAttrs";
ods graphics / AttrPriority=None;
proc sgplot data=Stocks NoCycleAttrs;
   styleattrs datalinepatterns=(solid)
              datacontrastcolors=(SteelBlue DarkGreen)
              datasymbols=(CircleFilled TriangleFilled);
   loess   x=Date y=Close / group=Stock lineattrs=(thickness=4) NoMarkers smooth=0.75;
   scatter x=Date y=Close / group=Stock markerattrs=(size=10);
   yaxis label="Stock Price";
run;

Reset the default attribute priority

Here's another trick I learned from Warren's course: You can use the RESET= option to reset the default value of the ATTRPRIORITY option for the current style:

ods graphics / reset=attrpriority; /* reset the default value for the current style */

Summary and further reading

This article provides an example and summarizes the behavior of the ATTRPRIORITY= option (on the ODS GRAPHICS statement) and the CYCLEATTRS option (on the PROC SGPLOT statement). These options interact when you have several statements in PROC SGPLOT that each use the GROUP= option. You can use the STYLEATTRS statement to override the default attributes for the current ODS style.

Much more can be said about these options and how they interact. In addition to the documentation links in the article, I recommend the following:

The post Attrs, attrs, everywhere: The interaction between ATTRPRIORITY, CYCLEATTRS, and STYLEATTRS in ODS graphics appeared first on The DO Loop.

5月 312018
 

A previous article showed how to use a calibration plot to visualize the goodness-of-fit for a logistic regression model. It is common to overlay a scatter plot of the binary response on a predicted probability plot (below, left) and on a calibration plot (below, right):

The SAS program that creates these plots is shown in the previous article. Notice that the markers at Y=0 and Y=1 are displayed by using a scatter plot. Although the sample size is only 500 observations, the scatter plot for the binary response suffers from overplotting. For larger data sets, the scatter plot might appear as two solid lines of markers, which does not provide any insight into the distribution of the horizontal variable. You can plot partially transparent markers, but that does not help the situation much. A better visualization is to eliminate the scatter plot and instead use a binary fringe plot (also called a butterfly fringe plot) to indicate the horizontal positions for each observed response.

A predicted probability plot with binary fringe

A predicted probability plot with a binary fringe plot for logistic regression

A predicted probability plot with a binary fringe plot is shown to the right. Rather than use the same graph area to display the predicted probabilities and the observed responses, a small "butterfly fringe plot" is shown in a panel below the predicted probabilities. The lower panel indicates the counts of the responses by using lines of various lengths. Lines that point down indicate the number of counts for Y=0 whereas lines that point up indicate the counts for Y=1. For these data, the X values less than 1 have many downward-pointing lines whereas the X values greater than 1 have many upward-pointing lines.

To create this plot in SAS, you can do the following:

  1. Use PROC LOGISTIC to output the predicted probabilities and confidence limits for a logistic regression of Y on a continuous explanatory variable X.
  2. Compute the min and max of the continuous explanatory variable.
  3. Use PROC UNIVARIATE to count the number of X values in each of 100 bins in the range [min, max] for Y=0 and Y=1.
  4. Merge the counts with the predicted probabilities.
  5. Define a GTL template to define a panel plot. The main panel shows the predicted probabilities and the lower panel shows the binary fringe plot.
  6. Use PROC SGRENDER to display the panel.

You can download the SAS program that defines the GTL template and creates the predicted probability plot.

A calibration plot with binary fringe

A calibration plot with a binary fringe plot for logistic regression

By using similar steps, you can create a calibration plot with a binary fringe plot as shown to the right. The main panel is used for the calibration plot and a small binary fringe plot is shown in a panel below it. The lower panel shows the counts of the responses at various positions. Note that the horizontal variable is the predicted probability from the model whereas the vertical variable is the empirical probability as estimated by the LOESS procedure. For these simulated data, the fringe plot shows that most of the predicted probabilities are less than 0.2 and these small values mostly correspond to Y=0. The observations for Y=1 mostly have predicted probabilities that are greater than 0.5. The fringe plot reveals that about 77% of the observed responses are Y=0, a fact that was not apparent in the original plots that used a scatter plot to visualize the response variable.

To create this plot in SAS, you can do the following:

  1. Use PROC LOGISTIC to output the predicted probabilities for any logistic regression.
  2. Use PROC LOESS to regress Y onto the predicted probability. This estimates the empirical probability for each value of the predicted probability.
  3. Use PROC UNIVARIATE to count the number of predicted probabilities for each of 100 bins in the range [0, 1] for Y=0 and Y=1.
  4. Merge the counts with the predicted probabilities.
  5. Define a GTL template to define a panel plot. The main panel shows the calibration plot and the lower panel shows the binary fringe plot.
  6. Use PROC SGRENDER to display the panel.

You can download the SAS program that defines the GTL template and creates the calibration plot.

For both plots, the frequencies of the responses are shown by using "needles," but you can make a small change to the GTL to make the fringe plot use thin bars so that it looks more like a butterfly plot of two histograms. See the program for details.

What do you think of this plot? Do you like the way that the binary fringe plot visualizes the response variable, or do you prefer the classic plot that uses a scatter plot to show the positions of Y=0 and Y=1? Leave a comment.

The post Use a fringe plot to visualize binary data in logistic models appeared first on The DO Loop.

5月 232018
 
Butterfly plot of cholesterol by gender in SAS

This article shows how to construct a butterfly plot in SAS. A butterfly plot (also called a butterfly chart) is a comparative bar chart or histogram that displays the distribution of a variable for two subpopulations. A butterfly plot for the cholesterol readings of 5,057 patients in a medical study is shown to the right, where the distribution for the males is shown on the left side of the plot and the distribution for females is displayed on the right. (Click to enlarge.) The main contribution of this article is showing how to bin a continuous variable in SAS to form a butterfly chart.

The butterfly plot is similar to a comparative histogram because both enable you to compare the distribution of a continuous variable for subpopulations. The comparative histogram uses a panel to visualize several subpopulations in a panel, where each row represents a level of a classification variable. In contrast, the butterfly plot is limited to two levels and displays the distributions back-to-back.

Bin a continuous variable for each classification level

In a previous blog post, I constructed a butterfly chart that compares voice versus text usage by decade of age for cell phones. Similarly, there is a SAS Sample that shows how to create a butterfly chart for types of cancers by gender. In both of these examples, the butterfly chart is a comparative bar chart because the distribution shown is for a discrete variable ("decade of age" or "type of cancer"). This section shows how to start with a continuous variable (cholesterol) and bin it into intervals. You can then use the previous techniques to visualize the counts in each interval for each gender.

You can bin a continuous variable by using the BIN and TABULATE functions in SAS/IML or by using the OUTHIST= option on the HISTOGRAM statement in PROC UNIVARIATE. The following statements create an output data set (OutHist) that contains the counts of males and females that have cholesterol reading within bins of width 20. The bin width and the centers of the intervals are chosen automatically by PROC UNIVARIATE, but you can use the MIDPOINTS= option (shown in the comments) to control the placement of the intervals.

proc univariate data=Sashelp.Heart;
   class Sex;
   var Cholesterol;
   histogram cholesterol / nrows=2 outhist=OutHist 
                          /* midpoints=(80 to 560 by 40) */ /* control bin widths and locations */
                           odstitle="Cholesterol by Gender";
   ods select histogram;
run;
Comparative histogram in SAS: Cholesterol by Gender

Create a butterfly plot in SAS

The OutHist data set is in "long form." You need to convert it to "wide form" in order to construct a butterfly plot. The SAS code performs the following tasks:

  • Use a DATA step and WHERE clauses to convert the data from long to wide format.
  • Multiply the counts for the males by -1. The negative counts will be plotted on the left side of the butterfly plot.
  • Define a format that will display the absolute values of the counts. The axis for the formatted variables contains zero in the middle and increases in both directions.
  • Use the HBAR statements to plot the back-to-back bar charts for males and females.
/* convert data from long format to wide format */
data Butterfly;
   keep Cholesterol Males Females;
   label Males= Females= Cholesterol=; /* remove labels */
   merge OutHist(where=(sex="Female") rename=(_COUNT_=Females _MIDPT_=Cholesterol))
         OutHist(where=(sex="Male")   rename=(_COUNT_=Males   _MIDPT_=Cholesterol));
   by Cholesterol;
   Males = -Males;                     /* trick: reverse the direction of male counts */
run;
 
/* define format that displays the absolute value of a number */
proc format;
   picture positive low-<0="000,000"
   0<-high="000,000";
run;
 
ods graphics / reset;
title "Butterfly Plot of Cholesterol Counts By Gender";
proc sgplot data=Butterfly;
   format Males Females positive.;
   hbar Cholesterol / response=Males   legendlabel="Males";
   hbar Cholesterol / response=Females legendlabel="Females";
   xaxis label="Count" grid 
         min=-520 max=520 values=(-500 to 500 by 100) valueshint;
   yaxis label="Cholesterol" discreteorder=data;
run;

The graph is shown at the top of this article. You can see that the mode of the distribution is higher for males, and the distribution for males also has a longer tail.

A butterfly fringe plot

The butterfly plot is usually displayed with horizontal bars, as shown. However, you could use the VBAR statement to get a rotated version of the butterfly plot. As mentioned earlier, you can also use the MIDPOINTS= option on the HISTOGRAM statement to change the width of the histogram bins.

One useful variation in the butterfly plot is to use a very small bin width and replace the bar chart with a high-low plot. This creates a graph that I call a butterfly fringe plot. Recall that the usual fringe plot (also called a "rug plot") places a tick mark on an axis to show the distribution of data values. A fringe plot can suffer from overplotting when more than one observation has the same value. With a butterfly fringe plot, some ticks are higher than others (to represent repeated values) and bars that point up represent one binary value whereas bars that point down represent the other. The butterfly fringe plot provides a more complete visualization of the distribution of data for two levels of a response or classification variable.

ods graphics / width=640px height=180;
title "Butterfly Fringe Plot: Cholesterol Counts By Gender";
proc sgplot data=Butterfly;
   format Males Females positive.;
   highlow x=Cholesterol low=Males high=Females;
   refline 0 / axis=y;
   inset "Males"   / position=BottomLeft;
   inset "Females" / position=TopLeft;
   yaxis label="Count" grid;
   xaxis label="Cholesterol" discreteorder=data;
run;
Butterfly fringe plot of cholesterol by gender in SAS

The butterfly fringe plot was created by using a bin width of 5 for the cholesterol variable. That is, midpoints=(50 to 560 by 5).

In summary, this article shows how to create a butterfly plot for a continuous variable and a binary classification variable. When you bin the continuous variable, you obtain counts for each interval. You can then graph the counts back-to-back to form the butterfly chart. An interesting variation is the butterfly fringe plot, which combines a butterfly chart and a fringe plot.

The post A butterfly plot for comparing distributions appeared first on The DO Loop.

5月 162018
 

In my article about how to construct calibration plots for logistic regression models in SAS, I mentioned that there are several popular variations of the calibration plot. The previous article showed how to construct a loess-based calibration curve. Austin and Steyerberg (2013) recommend the loess-based curve on the basis of an extensive simulation study. However, some practitioners prefer to use a decile calibration plot. This article shows how to construct a decile-based calibration curve in SAS.

The decile calibration plot

The decile calibration plot is a graphical analog of the Hosmer-Lemeshow goodness-of-fit test for logistic regression models. The subjects are divided into 10 groups by using the deciles of the predicted probability of the fitted logistic model. Within each group, you compute the mean predicted probability and the mean of the empirical binary response. A calibration plot is a scatter plot of these 10 ordered pairs, although most calibration plots also include the 95% confidence interval for the proportion of the binary responses within each group. Many calibration plots connect the 10 ordered pairs with piecewise line segments, others use a loess curve or a least squares line to smooth the points.

Create the decile calibration plot in SAS

The previous article simulated 500 observations from a logistic regression model logit(p) = b0 + b1*x + b2*x2 where x ~ U(-3, 3). The following call to PROC LOGISTIC fits a linear model to these simulated data. That is, the model is intentionally misspecified. A call to PROC RANK creates a new variable (Decile) that identifies the deciles of the predicted probabilities for the model. This variable is used to compute the means of the predicted probabilities and the empirical proportions (and 95% confidence intervals) for each decile:

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim noprint;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
run;
 
/* To construct the decile calibration plot, identify deciles of the predicted prob. */
proc rank data=LogiOut out=LogiDecile groups=10;
   var PredProb;
   ranks Decile;
run;
 
/* Then compute the mean predicted prob and the empirical proportions (and CI) for each decile */
proc means data=LogiDecile noprint;
   class Decile;
   types Decile;
   var y PredProb;
   output out=LogiDecileOut mean=yMean PredProbMean
          lclm=yLower uclm=yUpper;
run;
 
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiDecileOut noautolegend aspect=1;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   *loess x=PredProbMean y=yMean;  /* if you want a smoother based on deciles */
   series x=PredProbMean y=yMean;  /* if you to connect the deciles */
   scatter x=PredProbMean y=yMean / yerrorlower=yLower yerrorupper=yUpper;
   yaxis label="Observed Probability of Outcome";
   xaxis label="Predicted Probability of Outcome";
run;
Decile calibration curve for a misspecified logistic regression model

The diagonal line is the line of perfect calibration. In a well-calibrated model, the 10 markers should lie close to the diagonal line. For this example, the graph indicates that the linear model does not fit the data well. For the first decile of the predicted probability (the lowest predicted-risk group), the observed probability of the event is much higher than the mean predicted probability. For the fourth, sixth, and seventh deciles, the observed probability is much lower than the mean predicted probability. For the tenth decile (the highest predicted-risk group), the observed probability is higher than predicted. By the way, this kind of calibration is sometimes called internal calibration because the same observations are used to fit and assess the model.

The decile calibration plot for a correctly specified model

You can fit a quadratic model to the data to see how the calibration plot changes for a correctly specified model. The results are shown below. In this graph, all markers are close to the diagonal line, which indicates a very close agreement between the predicted and observed probabilities of the event.

Decile calibration curve for a correctly specified logistic regression model

Should you use the decile calibration curve?

The decile-based calibration plot is popular, perhaps because it is so simple that it can be constructed by hand. Nevertheless, Austin and Steyerberg (2013) suggest using the loess-based calibration plot instead of the decile-based plot. Reasons include the following:

  • The use of deciles results in estimates that "display greater variability than is evident in the loess-based method" (p. 524).
  • Several researchers have argued that the use of 10 deciles is arbitrary. Why not use five? Or 15? In fact, the results of the Hosmer-Lemeshow test "can depend markedly on the number of groups, and there's no theory to guide the choice of that number." (P. Allison, 2013. "Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression")

Many leading researchers in logistic regression do not recommend the Hosmer-Lemeshow test for these reasons. The decile-based calibration curve shares the same drawbacks. Since SAS can easily create the loess-based calibration curve (see the previous article), there seems to be little reason to prefer the decile-based version.

The post Decile calibration plots in SAS appeared first on The DO Loop.

5月 142018
 

A logistic regression model is a way to predict the probability of a binary response based on values of explanatory variables. It is important to be able to assess the accuracy of a predictive model. This article shows how to construct a calibration plot in SAS. A calibration plot is a goodness-of-fit diagnostic graph. It enables you to qualitatively compare a model's predicted probability of an event to the empirical probability.

Calibration curves

There are two standard ways to assess the accuracy of a predictive model for a binary response: discrimination and calibration. To assess discrimination, you can use the ROC curve. Discrimination involves counting the number of true positives, false positive, true negatives, and false negatives at various threshold values. In contrast, calibration curves compare the predicted probability of the response to the empirical probability. Calibration can help diagnose lack of fit. It also helps to rank subjects according to risk.

This article discusses internal calibration, which is the agreement on the sample used to fit the model. External calibration, which involves comparing the predicted and observed probabilities on a sample that was not used to fit the model, is not discussed.

In the literature, there are two types of calibration plots. One uses a smooth curve to compare the predicted and empirical probabilities. The curve is usually a loess curve, but sometimes a linear regression curve is used. The other (to be discussed in a future article) splits the data into deciles. An extensive simulation study by Austin and Steyerberg (2013) concluded that loess-based calibration curves have several advantages. This article discusses how to use a loess fit to construct a calibration curve.

A calibration curve for a misspecified model

Calibration curves help to diagnose lack of fit. This example simulates data according to a quadratic model but create a linear fit to that data. The calibration curve should indicate that the model is misspecified.

Step 1: Simulate data for a quadratic logistic regression model

Austin and Steyerberg (2013) use the following simulated data, which is based on the earlier work of Hosmer et al. (1997). Simulate X ~ U(-3, 3) and define η = b0 + b1*X + b2*X2) to be a linear predictor, where b0 = -2.337, b1 = 0.8569, b2 = 0.3011. The logistic model is logit(p) = η, where p = Pr(Y=1 | x) is the probability of the binary event Y=1. The following SAS DATA step simulates N=500 observations from this logistic regression model:

/* simulate sample of size N=500 that follows a quadratic model from Hosmer et al. (1997) */
%let N = 500;                              /* sample size */ 
data LogiSim(keep=Y x);
call streaminit(1234);                     /* set the random number seed */
do i = 1 to &N;                            /* for each observation in the sample */
   /* Hosmer et al. chose x ~ U(-3,3). Use rand("Uniform",-3,3) for modern versions of SAS */
   x = -3 + 6*rand("Uniform");
   eta = -2.337 + 0.8569*x + 0.3011*x**2;  /* quadratic model used by Hosmer et al. */
   mu = logistic(eta);
   Y = rand("Bernoulli", mu);
   output;
end;
run;

For this simulated data, the "true" model is known to be quadratic. Of course, in practice, the true underlying model is unknown.

Step 2: Fit a logistic model

The next step is to fit a logistic regression model and save the predicted probabilities. The following call to PROC LOGISTIC intentionally fits a linear model. The calibration plot will indicate that the model is misspecified.

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim plots=effect;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
run;
Plot of predicted probabilities for a logistic regression model in SAS

As shown above, PROC LOGISTIC can automatically create a fit plot for a regression that involves one continuous variable. The fit plot shows the observed responses, which are plotted at Y=0 (failure) or Y=1 (success). The predicted probabilities are shown as a sigmoidal curve. The predicted probabilities are in the range (0, 0.8). The next section creates a calibration plot, which is a graph of the predicted probability versus the observed response.

Create the calibration plot

The output from PROC LOGISTIC includes the predicted probabilities and the observed responses. Copas (1983), Harrell et al. (1984), and others have suggested assessing calibration by plotting a scatter plot smoother and overlaying a diagonal line that represents the line of perfect calibration. If the smoother lies close to the diagonal, the model is well calibrated. If there is a systematic deviation from the diagonal line, that indicates that the model might be misspecified.

The following statements sort the data by the predicted probabilities and create the calibration plot. (The sort is optional for the loess smoother, but is useful for other analyses.)

proc sort data=LogiOut;  by PredProb;  run;
 
/* let the data choose the smoothing parameter */
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiOut noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm;   /* smoothing value by AICC (0.657) */
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
run;
Calibration plot for a misspecified logistic model

The loess smoother does not lie close to the diagonal line, which indicates that the linear model is not a good fit for the quadratic data. The loess curve is systematically above the diagonal line for very small and very large values of the predicted probability. This indicates that the empirical probability (the proportion of events) is higher than predicted on these intervals. In contrast, when the predicted probability is in the range [0.1, 0.45], the empirical probability is lower than predicted.

Several variations of the calibration plot are possible:

  • Add the NOMARKERS option to the LOESS statement to suppress the observed responses.
  • Omit the CLM option if you do not want to see a confidence band.
  • Use the SMOOTH=0.75 option if you want to specify the value for the loess smoothing parameter. The value 0.75 is the value that Austin and Steyerberg used in their simulation study, in part because that is the default parameter value in the R programming language.
  • Use a REG statement instead of the LOESS statement if you want a polynomial smoother.

A calibration plot for a correctly specified model

In the previous section, the model was intentionally misspecified. The calibration curve for the example did not lie close to the diagonal "perfect calibration" line, which indicates a lack of fit. Let's see how the calibration curve looks for the same data if you include a quadratic term in the model. First, generate the predicted probabilities for the quadratic model, then create the calibration curve for the model:

proc logistic data=LogiSim noprint;
   model Y(event='1') = x x*x;  /* fit a quadratic model */
   output out=LogiOut2 predicted=PredProb;
run;
 
proc sort data=LogiOut2;  by PredProb;  run;
 
title "Calibration Plot for a Correct Model";
proc sgplot data=LogiOut2 noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm nomarkers;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   yaxis grid; xaxis grid;
run;
Calibration plot for a correctly specified logistic model

This calibration plot indicates that the quadratic model fits the data well. The calibration curve is close to the diagonal reference line, which is the line of perfect calibration. The curve indicates that the predicted and empirical probabilities are close to each other for low-risk, moderate-risk, and high-risk subjects.

It is acceptable for the calibration curve to have small dips, bends, and wiggles. The extent that the curve wiggles is determined by the (automatically chosen) smoothing parameter for the loess algorithm, The value of the smoothing parameter is not displayed by PROC SGPLOT, but you can use PROC LOESS to display the smoothing parameter and other statistics for the loess smoother.

According to Austin and Steyerberg (2013), the calibration curve can detect misspecified models when the magnitude of the omitted term (or terms) is moderate to strong. The calibration curve also behaves better for large samples and when the incidence of outcomes is close to 50% (as opposed to having rare outcomes). Lastly, there is a connection between discrimination and calibration: The calibration curve is most effective in models for which the discrimination (as measured by the C-statistic or area under the ROC curve) is good.

In summary, this article shows how to construct a loess-based calibration curve for logistic regression models in SAS. To create a calibration curve, use PROC LOGISTIC to output the predicted probabilities for the model and plot a loess curve that regresses the observed responses onto the predicted probabilities.

My next article shows how to construct a popular alternative to the loess-based calibration curve. In the alternative version, the subjects are ranked into deciles according to the predicted probability or risk.

Further reading

The post Calibration plots in SAS appeared first on The DO Loop.

5月 022018
 

Order matters. When you create a graph that has a categorical axis (such as a bar chart), it is important to consider the order in which the categories appear. Most software defaults to alphabetical order, which typically gives no insight into how the categories relate to each other.

Alphabetical order rarely adds any insight to the data visualization. For a one-dimensional graph (such as a bar chart) software often provides alternative orderings. For example, in SAS you can use the CATEGORYORDER= option to sort categories of a bar chart by frequency. PROC SGPLOT also supports the DISCRETEORDER= option on the XAXIS and YAXIS statements, which you can use to specify the order of the categories.

It is much harder to choose the order of variables for a two-dimensional display such as a heat map (for example, of a correlation matrix) or a matrix of scatter plots. Both of these displays are often improved by strategically choosing an order for the variables so that adjacent levels have similar properties. Catherine Hurley ("Clustering Visualizations of Multidimensional Data", JCGS, 2004) discusses several ways to choose a particular order for p variables from among the p! possible permutations. This article implements the single-link clustering technique for ordering variables. The same technique is applicable for ordering levels of a nominal categorical variable.

Order variables in two-dimensional displays

I first implemented the single-link clustering method in 2008, which was before I started writing this blog. Last week I remembered my decade-old SAS/IML program and decided to blog about it. This article discusses how to use the program; see Hurley's paper for details about how the program works.

To illustrate ordering a set of variables, the following program creates a heat map of the correlation matrix for variables from the Sashelp.Cars data set. The variables are characteristics of motor vehicles. The rows and columns of the matrix display the variables in alphabetical order. The (i,j)th cell of the heat map visualizes the correlation between the i_th and j_th variable:

proc iml;
/* create correlation matric for variables in alphabetical order */
varNames = {'Cylinders' 'EngineSize' 'Horsepower' 'Invoice' 'Length' 
            'MPG_City' 'MPG_Highway' 'MSRP' 'Weight' 'Wheelbase'};
use Sashelp.Cars;  read all var varNames into Y;  close;
corr = corr(Y);
 
call HeatmapCont(corr) xvalues=varNames yvalues=varNames 
            colorramp=colors range={-1.01 1.01} 
            title="Correlation Matrix: Variables in Alphabetical Order";
Heat map of correlation matrix; variables in alphabetical order

The heat map shows that the fuel economy variables (MPG_City and MPG_Highway) are negatively correlated with most other variables. The size of the vehicle and the power of the engine are positively correlated with each other. The correlations with the price variables (Invoice and MSRP) are small.

Although this 10 x 10 heat map visualizes all pairwise correlations, it is possible to permute the variables so that highly correlated variables are adjacent to each other. The single-link cluster method rearranges the variables by using a symmetric "merit matrix." The merit matrix can be a correlation matrix, a distance matrix, or some other matrix that indicates how the i_th category is related to the j_th category.

The following statements assume that the SingleLinkFunction is stored in a module library. The statements load and call the SingleLinkCluster function. The "merit matrix" is the correlation matrix so the algorithm tries to put strongly positively correlated variables next to each other. The function returns a permutation of the indices. The permutation induces a new order on the variables. The correlation matrix for the new order is then displayed:

/* load the module for single-link clustering */
load module=(SingleLinkCluster);
permutation = SingleLinkCluster( corr );  /* permutation of 1:p */
V = varNames[permutation];                /* new order for variables */
R = corr[permutation, permutation];       /* new order for matrix */
call HeatmapCont(R) xvalues=V yvalues=V 
            colorramp=colors range={-1.01 1.01} 
            title="Correlation: Variables in Clustered Order";
Heat map of correlation matrix; variables in clustered order

In this permuted matrix, the variables are ordered according to their pairwise correlation. For this example, the single-link cluster algorithm groups together the three "size" variables (Length, Wheelbase, and Weight), the three "power" variables (EngineSize, Cylinders, and Horsepower), the two "price" variables (MSRP and Invoice), and the two "fuel economy" variables (MPG_Highway and MPG_City). In this representation, strong positive correlations tend to be along the diagonal, as evidenced by the dark green colors near the matrix diagonal.

The same order is useful if you want to construct a huge scatter plot matrix for these variables. Pairs of variables that are "interesting" tend to appear near the diagonal. In this case, the definition of "interesting" is "highly correlated," but you could choose some other statistic to build the "merit matrix" that is used to cluster the variables. The scatter plot matrix is shown below. Click to enlarge.

Scatter plot matrix of 10 variables. The highly correlated variable tend to appear together so that diagonal scatter plot exhibit high correlation.

Other merit matrices

I remembered Hurley's article while I was working on an article about homophily in married couples. Recall that homophily is the tendency for similar individuals to associate, or in this case to marry partners that have similar academic interests. My heat map used the data order for the disciplines, but I wondered whether the order could be improved by using single-link clustering. The data in the heat map are not symmetric, presumably because women and men using different criteria to choose their partners. However, you can construct the "average homophily matrix" in a natural way. If A is any square matrix then (A` + A)/2 is a symmetric matrix that averages the lower- and upper-triangular elements of A. The following image shows the "average homophily" matrix with the college majors ordered by clusters:

The heat map shows the "averaged data," but in a real study you would probably want to display the real data. You can see that green cells and red cells tend to occur in blocks and that most of the dark green cells are close to the diagonal. In particular, the order of adjacent majors on an axis gives information about affinity. For example, here are a few of the adjacent categories:

  • Biology and Medical/Health
  • Engineering tech, computers, and engineering
  • Interdisciplinary studies, philosophy, and theology/religion
  • Math/statistics and physical sciences
  • Communications and business

Summary

In conclusion, when you create a graph that has a nominal categorical axis, you should consider whether the graph can be improved by ordering the categories. The default order, which is often alphabetical, makes it easy to locate a particular category, but there are other orders that use the data to reveal additional structure. This article uses the single-link cluster method, but Hurley (2004) mentions several other methods.

The post Order variables in a heat map or scatter plot matrix appeared first on The DO Loop.

4月 232018
 

You've probably heard about the "80-20 Rule," which describes many natural and manmade phenomena. This rule is sometimes called the "Pareto Principle" because it was discovered by Vilfredo Pareto (1848–1923) who used it to describe the unequal distribution of wealth. Specifically, in his study, 80% of the wealth was held by 20% of the population. The unequal distribution of effort, resources, and other quantities can often be described in terms of the Pareto distribution, which has a parameter that controls the inequity. Whereas some data seem to obey an 80-20 rule, other data are better described as following a "70-30 rule" or a "60-40 rule," and so on. (Although the two numbers do not need to add to 100, it makes the statement pleasingly symmetric.)

I thought about the Pareto distribution recently when I was looking at some statistics for the SAS blogs. I found myself wondering whether 80% of the traffic to a blog is generated by 20% of its posts. (Definition: A blog is a collection of articles. Each individual article is a post.) As you can imagine, some posts are more popular than others. Some posts are timeless. They appear in internet searches when someone searches for a particular statistical or programming technique, even though they were published long ago. Other posts (such as Christmas-themed posts) do not get much traffic from internet searches. They generate traffic for a week or two and then fade into oblivion.

To better understand how various blogs at SAS follow the Pareto Principle, I downloaded data for seven blogs during a particular time period. I then kept only the top 100 posts for each blog.

The following line plot shows the pageviews for one blog. The horizontal axis indicates the posts for this blog, ranked by popularity. (The most popular post is 1, the next most popular post is 2, and so forth.) This blog has four very popular posts that each generated more than 7,000 pageviews during the time period. Another group of four posts were slightly less popular (between 4,000 and 5,000 pageviews). After those eight "blockbuster" posts are the rank-and-file posts that were viewed less than 3,000 times during the time period.

Long-tailed distribution of pageviews for a blog

The pageviews for the other blogs look similar. This distribution is also common in book-sales data: most books sell only a few thousand copies whereas the best-sellers (think John Grisham) sell hundreds of millions of copies. Movie revenue is another example that follows this distribution.

The distribution is a long-tailed distribution, a fact that becomes evident if you graph a histogram of the underlying quantity. For the blog, the following histogram shows the distribution of the pageviews for the top 100 posts:

Long-tailed distribution of pageviews for a blog

Notice the "power law" nature of the distribution for the first few bars of the histogram. The height of each bar is about 0.4 of the previous height. About 57% of the blog posts had less than 1,000 pageviews. Another 22% had between 1,000 and 2,000 pageviews. The number of rank-and-file posts in each category decreases like a power law, but then the blockbuster posts start to appear. These popular posts give the distribution a very long tail.

Because some blogs (like the one pictured) attract thousands of readers whereas other blogs have fewer readers, you need to standardize the data if you want to compare the distributions for several blogs. Recall that the Pareto Principle is a statement about cumulative percentages. The following graph shows the cumulative percentages of pageviews for seven blogs at SAS (based on the Top 100 posts):

Cumulative percentages of pageviews for seven blogs

The graph shows a dashed line with slope –1 that is overlaid on the cumulative percentage curves. The places where the dashed line intersects a curve are the "Pareto locations" for which Y% of the pageviews are generated by the first (1-Y)% most popular blog posts. In general, these blogs appear to satisfy a "70-30 rule" because about 70% of the pageviews are generated by the 30 / 100 most popular posts. There is some variation between blogs, with the upper curve following a "72-28 rule" and the lower curve satisfying a "63-37 rule."

All this is approximate and is based on only the top 100 posts for each blog. For a more rigorous analysis, you could use PROC UNIVARIATE or PROC NLMIXED to fit the parameters of the Pareto distribution to the data for each blog. However, I am happy to stop here. In general, blogs at SAS satisfy an approximate "70-30 rule," where 70% of the traffic is generated by the top 30% of the posts.

The post The 80-20 rule for blogs appeared first on The DO Loop.

3月 262018
 
Zipper plot for normally distributed data

Simulation studies are used for many purposes, one of which is to examine how distributional assumptions affect the coverage probability of a confidence interval. This article describes the "zipper plot," which enables you to compare the coverage probability of a confidence interval when the data do or do not follow certain distributional assumptions.

I saw the zipper plot in a tutorial about how to design, implement, and report the results of simulation studies (Morris, White, Crowther, 2017). An example of a zipper plot is shown to the right. The main difference between the zipper plot and the usual plot of confidence intervals is that the zipper plot sorts the intervals by their associated p-values, which causes the graph to look like a garment's zipper. The zipper plot in this blog post is slightly different from the one that is presented in Morris et al. (2017, p. 22), as discussed at the end of this article.

Coverage probabilities for confidence intervals

A formulas for a confidence interval (CI) is based on an assumption about the distribution of the parameter estimates on a simple random sample. For example, the two-sided 95% confidence interval for the mean of normally distributed data has upper and lower limits given by the formula
x̄ ± t* s / sqrt(n)
where x̄ is the sample mean, s is the sample standard deviation, n is the sample size, and t* is the 97.5th percentile of the t distribution with n – 1 degrees of freedom. You can confirm that the formula is a 95% CI by simulating many N(0,1) samples and computing the CI for each sample. About 95% of the samples should contain the population mean, which is 0.

The central limit theorem indicates that the formula is asymptotically valid for sufficiently large samples nonnormal data. However, for small nonnormal samples, the true coverage probability of those intervals might be different from 95%. Again, you can estimate the coverage probability by simulating many samples, constructing the intervals, and counting how many time the mean of the distribution is inside the confidence interval. A previous article shows how to use simulation to estimate the coverage probability of this formula.

Zipper plots and simulation studies of confidence intervals

Zipper plot that compares the empirical coverage probability of confidence intervals that assume normality. The data samples are size N=50 drawn from the normal or exponential distribution.

The graph to the right shows the zipper plot applied to the results of the two simulations in the previous article. The zipper plot on the left visualizes the estimate of the coverage probability 10,000 for normal samples of size n = 50. (The graph at the top of this article provides more detail.) The Monte Carlo estimate of the coverage probability of the CIs is 0.949, which is extremely close to the true coverage of 0.95. The Monte Carlo 95% confidence interval, shown by a blue band, is [0.945, 0.953], which includes 0.95. You can see that the width of the confidence intervals does not seem to depend on the sample mean: the samples whose means are "too low" tend to produce intervals that are about the same length as the intervals for the samples whose means are "too high."

The zipper plot on the right is for 10,000 random samples of size n = 50 that are drawn from an exponential distribution with mean 0 and unit scale parameter. (The graph at this link provides more detail.) The Monte Carlo estimate of the coverage probability is 0.9360. The Monte Carlo 95% confidence interval, shown by a blue band, is [0.931, 0.941] and does not include 0.95. Notice that the left side of the "zipper" tends to contain intervals that are shorter than those on the right side. This indicates that samples whose means are "too low" tend to produce shorter confidence intervals than the samples whose means are "too high."

The zipper plot enables you to compare the results of two simulations that generate data from different distributions. The zipper plot enables you to visualize the fact that the nominal 95% CIs do, in fact, have empirical coverage of about 95% for normal data, whereas the intervals have about 93.6% empirical coverage for the exponential data.

If you want to experiment with the zipper plot or use it for your own simulation studies in SAS, you can download the SAS program that generates the graphs in this article.

Differences from the "zip plot" of Morris et al.

There are some minor differences between the zipper plot in this article and the graph in Morris et al. (2017, p. 22).

  • Morris et al., use the term "zip plot." However, statisticians use "ZIP" for the zero-inflated Poisson distribution, so I prefer the term "zipper plot."
  • Morris et al. bin the p-values into 100 centiles and graph the CIs against the centiles. This has the advantage of being able to plot the CI's from thousands or millions of simulations in a graph that uses only a few hundred vertical pixels. In contrast, I plot the CI's for each fractional rank, which is the rank divided by the number of simulations. In the SAS program, I indicate how to compute and use the centiles.
  • Morris et al. plot all the centiles. Consequently, the interesting portion of the graph occupies only about 5-10% of the total graph area. In contrast, I display only the CIs whose fractional rank is less than some cutoff value, which is 0.2 in this article. Thus my zipper plots are a "zoomed in" version of the ones that appear in Morris et al. In the SAS program, you can set the FractionMaxDisplay macro variable to 1.0 to show all the centiles.

The post A zipper plot for visualizing coverage probability in simulation studies appeared first on The DO Loop.

2月 122018
 

Did you know that SAS can combine or "merge" a symbol and a line pattern into a single legend item, as shown below? This kind of legend is useful when you are overlaying a group of curves onto a scatter plot. It enables the reader to quickly associate values of a categorical variable with colors, line patterns, and marker symbols in a plot.

Merged legend that combines symbols and line patterns in SAS

Legend items for marker/curve overlays

When you use PROC SGPLOT and the GROUP= option to create a graph, the SGPLOT procedure automatically displays the group attributes (such a symbol, color, and line pattern) in a legend. If you overlay multiple plot types (such as a series plot on a scatter plot) the default behavior is to create a legend for the first plot statement. You can use the KEYLEGEND statement to control which plots contribute to the legend. In the following example, the KEYLEGEND statement creates a legend that shows the attributes for the scatter plot (the marker shapes) and also the series plot (line patterns):

data ScatCurve;    /* example data: scatter plot and curves */
call streaminit(1);
do Group = 1 to 2;
   do x = 0 to 5 by 0.2;
      Pred = Group + (1/Group)*x - (0.2*x)**2;
      y = Pred + rand("Normal",0,0.5);
      output;
   end;
end;
run;
 
ods graphics / antialias=on subpixel=on;
title "Legend Not Merged";
proc sgplot data=ScatCurve;
   scatter x=x y=y / group=Group name="obs" markerattrs=(symbol=CircleFilled);
   series x=x y=Pred / group=Group name="curves"; 
   keylegend "obs" "curves" / title="Group"; /* separate items for markers and lines */
run;
Legend that shows symbols and line patterns separately

The legend contains all the relevant information about symbols, colors, and line patterns, but it is longer than it needs to be. When you display curves and markers for the same groups, you can obtain a more compact representation by merging the symbols and line patterns into a single legend, as shown in the next sections.

The MERGEDLEGEND statement in GTL

If you are comfortable with the Graph Template Language (GTL) in SAS, then you can use the MERGEDLEGEND statement to create a merged legend. The following statements create a graph template that overlays a scatter plot and a series plot and creates a merged legend in which each item contains both a symbol and a line pattern:

proc template;
  define statgraph ScatCurveLegend;
  dynamic _X _Y _Curve _Group _Title;       /* dynamic variables */
  begingraph;
  entrytitle _Title;  
   layout overlay; 
     scatterplot x=_X y=_Y / group=_Group name="obs" markerattrs=(symbol=CircleFilled);
     seriesplot x=_X y=_Curve / group=_Group name="curves";
     /* specify exactly two names for the MERGEDLEGEND statement */
     mergedlegend "obs" "curves" / border=true title=_Group;
   endlayout;
  endgraph;
  end;
run;
 
proc sgrender data=ScatCurve template=ScatCurveLegend;
   dynamic _Title="Overlay Curves, Merged Legend"
           _X="x" _Y="y" _Curve="Pred" _Group="Group";
run;
Merged legend that combines symbols and line patterns in SAS

The graph is the same as before, but the legend is different. The MERGEDLEGEND statement is used in many graphs that are created by SAS regression procedures. As you can see, the legend shows the symbol and line pattern for each group.

The LEGENDITEM statement in PROC SGPLOT

Unfortunately, the MERGEDLEGEND statement is not supported by PROC SGPLOT. However, SAS 9.4M5 supports the LEGENDITEM statement in PROC SGPLOT. The LEGENDITEM statement enables you to construct a custom legend and gives you complete control over every item in a legend. The following example constructs a legend that uses the symbols, line patterns, and group values that are present in the graph. Notice that you have to specify these attributes manually, as shown in the following example:

title "Merged Legend by Using LEGENDITEM in SGPLOT";
proc sgplot data=ScatCurve;
   scatter x=x y=y / group=Group markeratters=(symbol=CircleFilled);
   series x=x y=Pred / group=Group; 
   legenditem type=markerline name="I1" /
              label="1" lineattrs=GraphData1 markerattrs=GraphData1(symbol=CircleFilled);
   legenditem type=markerline name="I2" /
              label="2" lineattrs=GraphData2 markerattrs=GraphData2(symbol=CircleFilled);
   keylegend "I1" "I2" / title="Group"; 
run;

The graph and legend are identical to the previous graph and are not shown.

The advantage of the LEGENDITEM statement is that you can layout the legend however you choose; the legend is not tied to the attributes in any previous graph component. However, this is also a disadvantage. If you change the marker attributes in the SCATTER statement, the legend will not reflect that change until you manually modify each LEGENDITEM statement. Although there is no denying the power of the LEGENDITEM statement, the MERGEDLEGEND statement in the GTL always faithfully and automatically reflects the attributes in the SCATTERPLOT and SERIESPLOT statements.

In summary, the SG procedures in SAS automatically create a legend. When you overlay multiple plots, you can use the KEYLEGEND statement to control which plots contribute to the legend. However, it is also possible to merge the symbols and line patterns into a single compact legend. In the GTL, you can use the MERGEDLEGEND statement. In SAS 9.4M5, PROC SGPLOT supports the LEGENDITEM statement to customize the items that appear in a legend.

The post Merged legends: Overlay a symbol and line in a legend item appeared first on The DO Loop.