Statistical Graphics

9月 102018
 

Given a rectangular grid with unit spacing, what is the expected distance between two random vertices, where distance is measured in the L1 metric? (Here "random" means "uniformly at random.") I recently needed this answer for some small grids, such as the one to the right, which is a 7 x 6 grid. The graph shows that the L1 distance between the points (2,6) and (5,2) is 7, the length of the shortest path that connects the two points. The L1 metric is sometimes called the "city block" or "taxicab" metric because it measures the distance along the grid instead of "as the bird flies," which is the Euclidean distance.

The answer to the analogous question for the continuous case (a solid rectangle) is difficult to compute. The main result is that the expected distance is less than half of the diameter of the rectangle. In particular, among all rectangles of a given area, a square is the rectangle that minimizes the expected distance between random points.

Although I don't know a formula for the expected distance on a discrete regular grid, the grids in my application were fairly small so this article shows how to compute all pairwise distances and explicitly find the average (expected) value. The DISTANCE function in SAS/IML makes the computation simple because it supports the L1 metric. It is also simple to perform computer experiments to show that among all grids that have N*M vertices, the grid that is closest to being square minimizes the expected L1 distance.

L1 distance on a regular grid

An N x M grid contains NM vertices. Therefore the matrix of pairwise distances is NM x NM. Without loss of generality, assume that the vertices have X coordinates between 1 and N and Y coordinates between 1 and M. Then the following SAS/IML function defines the distance matrix for vertices on an N x M grid. To demonstrate the computation, the distance matrix for a 4 x 4 grid is shown, along with the average distance between the 16 vertices:

proc iml;
start DistMat(rows, cols);
   s = expandgrid(1:rows, 1:cols);   /* create (x, y) ordered pairs */
   return distance(s, "CityBlock");  /* pairwise L1 distance matrix */
finish;
 
/* test function on a 4 x 4 grid */
D = DistMat(4, 4);
AvgDist = mean(colvec(D));
print D[L="L1 Distance Matrix for 4x4 Grid"];
print AvgDist[L="Avg Distance on 4x4 Grid"];
L1 Distance matrix for vertices on a regular 4 x 4 grid Average L1 distance on a 4 x 4 grid

For an N x M grid, the L1 diameter of the grid is the L1 distance between opposite corners. That distance is always (N-1)+(M-1), which equates to 6 units for a 4 x 4 grid. As for the continuous case, the expected L1 distance is less than half the diameter. In this case, E(L1 distance) = 2.5.

A comparison of distances for grids of different aspect ratios

As indicated previously, the expected distance between two random vertices on a grid depends on the aspect ratio of the grid. A grid that is nearly square has a smaller expected distance than a short-and-wide grid that contains the same number of vertices. You can illustrate this fact by computing the distance matrices for grids that each contain 36 vertices. The following computation computes the distances for five grids: a 1 x 36 grid, a 2 x 18 grid, a 3 x 12 grid, a 4 x 9 grid, and a 6 x 6 grid.

/* average L1 distance on 36 x 36 grid in several configurations */
N=36;
rows = {1, 2, 3, 4, 6};
cols = N / rows;
AvgDist = j(nrow(rows), 1, .);
do i = 1 to nrow(rows);
   D = DistMat(rows[i], cols[i]);
   AvgDist[i] = mean(colvec(D));
end;
/* show average distance as a decimal and as a fraction */
numer = AvgDist*108; AvgDistFract = char(numer) + " / 108"; 
print rows cols AvgDist AvgDistFract;

The table shows that short-and-wide tables have an average distance that is much greater than a nearly square grid that contains the same number of vertices. When the points are arranged in a 6 x 6 grid, the distance matrix naturally decomposes into a block matrix of 6 x 6 symmetric blocks, where each block corresponds to a row. When the points are arranged in a 3 x 12 grid, the distance matrix decomposes into a block matrix of 12 x 12 blocks The following heat maps visualize the patterns in the distance matrices:

D = DistMat(6, 6);
call heatmapcont(D) title="L1 Distance Between Vertices in a 6 x 6 Grid";
D = DistMat(3, 12);
call heatmapcont(D) title="L1 Distance Between Vertices in a 3 x 12 Grid";
Visualization of L1 distance matrix for items arranged on a 6 x 6 grid Visualization of L1 distance matrix for items arranged on a 3 x 12 grid

Expected (average) distance versus number of rows

You can demonstrate how the average distance depends on the number of rows by choosing the number of vertices to be a highly composite number such as 5,040. A highly composite number has many factors. The following computation computes the average distance between points on 28 grids that each contain 5,040 vertices. A line chart then displays the average distance as a function of the number of rows in the grid:

N = 5040;               /* 5040 is a highly composite number */
rows = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 18, 20, 
       21, 24, 28, 30, 35, 36, 40, 42, 45, 48, 56, 60, 63, 70};
cols = N / rows;
AvgDist = j(nrow(rows), 1, .);
do i = 1 to nrow(rows);
   D = DistMat(rows[i], cols[i]);
   AvgDist[i] = mean(colvec(D));
end;
 
title "Average L1 Distance Between 5040 Objects in a Regular Grid";
call series(rows, AvgDist) grid={x y} other="yaxis offsetmin=0 grid;";
Expected L1 distance between vertices on a regular grid with 5,040 points. The curve demonstraes that nearly square grids have a much lower average distance than short and wide grids.

Each point on the graph represents the average distance for an N x M grid where NM = 5,040. The horizontal axis displays the value of N (rows). The graph shows that nearly square grids (rows greater than 60) have a much lower average distance than very short and wide grids (rows less than 10). The scale of the graph makes it seem like there is very little difference between the average distance in a grid with 40 versus 70 rows, but that is somewhat of an illusion. A 40 x 126 grid (aspect ratio = 3.15) has an average distance of 55.3; a 70 x 72 grid (aspect ratio = 1.03) has an average distance of 47.3.

Applications and conclusions

In summary, you can use the DISTANCE function in SAS/IML to explicitly compute the expected L1 distance (the "city block" distance) between random points on a regular grid. You can minimize the average pairwise distance by making the grid as square as possible.

City planning provides a real-world application of the L1 distance. If you are tasked with designing a city with N buildings along a grid, then the average distance between buildings is smallest when the grid is square. Of course, in practice, some buildings have a higher probability of being visited than others (such as a school or a grocery). You should position those buildings in the center of town to shorten the average distance that people travel.

The post Distances on rectangular grids appeared first on The DO Loop.

8月 082018
 
Use the GROUP= option in PROC SGPLOT to plot lines for TWO categorical variables

The SGPLOT procedure in SAS makes it easy to create graphs that overlay various groups in the data. Many statements support the GROUP= option, which specifies that the graph should overlay group information. For example, you can create side-by-side bar charts and box plots, and you can overlay multiple scatter plots and series plots in the same graph. However, the GROUP= option takes only a single grouping variable! What can you do if you need to visualize combinations of two (or even three!) categorical variables? This article shows how to construct a new group variable that combines the levels of two or more existing categorical variables.

For concreteness, I will show how to overlay multiple series plots (line plots), as shown to the right. (Click to enlarge.) By using this technique, you can overlay curves that describe combinations of gender and race. Or you can plot response curves for control-vs-experimental groups and the severity of a disease. You can use this technique for other plot types, such as creating box plots that visualize a two-way ANOVA.

What is the problem? Why can't you use the GROUP= option?

To motivate the discussion, let's first see why the GROUP= option in the SERIES statement does not work for overlaying two categorical variables. The following DATA step creates two categorical variables. The G1 variable has the values 1, 2, and 3. The G2 variable has the values 'A' and 'B'. For each of the six combinations of (G1, G2), the DATA step creates a curve (actually a line) of (X, Y) values. Let's see what happens if you try to plot the lines by using a SERIES statement with the GROUP=G1 option:

data TwoGroups;
do G1=1 to 3;
   do G2='A', 'B';
      do X = 1 to 10;
         Y = 10 + G1 + 0.5*(G2='A') + G1*X/20;
         output;
      end;
   end;
end;
run;
 
title "First Attempt: Does Not Work";
proc sgplot data=TwoGroups;
   series x=x y=y / group=G1 curvelabel;
run;

The attempt is a failure. The graph does not show six individual curves. Instead, it shows three curves (the number of categories in the G1 variables) and each "curve" is Z-shaped because the graph traces the curve for G2='A' on the range [1, 10] and then draws the curve for G2='B' without "picking up the pen." This happens because the data are sorted by G1, then by G2, then by X.

There are two ways to handle this situation. One way is to try to convert the data from "long form" to "wide form." For these data, which have identical X values for every curve, you can create two new variables Y_A and Y_B that contain the coordinates of the three curves for G2='A' and G2='B', respectively. You can then use two SERIES statements, each using the GROUP=G1 option. Each statement will draw three curves for a total of six. You can use the NOCYCLEATTRS option to make sure that each statement uses the same line colors and patterns. However, this approach becomes complicated if each curve is evaluated at a different set of X values. In that case, it is better to keep the data in long form.

Forming a new group variable by concatenation

The problem would be solved if there were one categorical variable that had six levels instead of two categorical variables that have six joint levels, so let's write some SAS code to make that happen. First sort the data by the categorical variables and then by the X variable. (For the current example, the data are already sorted correctly.) Then write a DATA step that does either of the following options:

  • Option 1: Use the CATT (or CATX) function to concatenate the values of the existing group variables. The new categorical variable will have values that derived from the original variables.
  • Option 2: Use the FIRST.variable syntax to create a new group variable that has the values 1–6. Then use PROC FORMAT to assign a meaningful value to each level of the new categorical variable.

The first option (the CATT function) is automated and less prone to error, but for the sake of completeness both options are shown below:

/* create a new group variable by concatenating the two existing variables */
proc sort data=TwoGroups;
   by G1 G2 X;
run;
 
data Make2Groups;
set TwoGroups;
by G1 G2;
/* Option 1: automatically create joint levels from original levels */
Label = catt("G1=", G1) || "; " || catt("G2=", G2);
/* Option 2: create new categorical variable and use PROC FORMAT to assign values */
if first.G2 then 
   GroupID + 1;        /* GroupID = 1, 2, 3, ... numGroups */
run;
 
/* For Option 2: use PROC FORMAT to form labels that encode the two groups. See
https://blogs.sas.com/content/iml/2017/04/24/two-way-anova-viz.html
*/
proc format;                 
  value GroupFmt 1 = "G1=1; G2='A'"   2 = "G1=1; G2='B'"   3 = "G1=2; G2='A'"
                 4 = "G1=2; G2='B'"   5 = "G1=3; G2='A'"   6 = "G1=3; G2='B'";
run;

The following call to PROC SGPLOT creates a series plot for the Label variable, which corresponds to the joint levels of the original grouping variables. The GROUPLC= option (supported in SAS 9.4M2) colors the lines according to the value of the G2 variable.

title "Two Groups, One SERIES Statement";
proc sgplot data=Make2Groups;
/* format GroupID GroupFmt.; */            /* for Option 2 */
series x=x y=y / group=Label grouplc=G2    /* or group=GroupID for Option 2 */
                 lineattrs=(pattern=solid) curvelabel;
run;

The graph is shown at the top of this article. The new categorical variable has values that correspond to joint levels of the original two variables. The GROUP= option creates six curves when you specify the Label variable. The GROUPLC= option sets the colors of the lines according to the values of the G2 variable.

This technique generalizes to three categorical variables, but I will leave the details to the reader. You might want to use the GROUPLP= option to set the line patterns according to the value of a third categorical variable. Beyond three variables the display will begin to resemble a spaghetti plot. For many categorical variables, you might want to use panels and BY groups to visualize the curves.

This technique also generalizes to other plot types, such as box plots and scatter plots.

The post Plot curves for levels of two categorical variables in SAS appeared first on The DO Loop.

8月 062018
 
Graph of quantile regression curves in SAS

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

data Orig;   /* restrict to 5000 births; exclude extreme weight gains */
set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40));
run;
 
proc quantreg data=Orig
              algorithm=IPM            /* use IPM algorithm for splines and binned data */
              ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );  /* 9 knots, equally spaced */
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
run;
Fit plot of quantile regression curves from PROC QUANTREG in SAS

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

  1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
  2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
  3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

/* 1. Create score data set */
data Score;
/* Optionally define additional covariates here. See the example
   "Create a sliced fit plot manually by using the missing value trick"
   https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html
*/
Weight = .;               /* set response (Y) variable to missing */
do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */
   output;
end;
run;
 
/* 2. Append the score data to the original data. Use a binary variable to 
      indicate which observations are the scoring data */
data Combined;
set Orig                     /* original data */
    Score(in=_ScoreData);    /* scoring data  */
ScoreData = _ScoreData;      /* binary indicator variable. ScoreData=1 for scoring data */
run;
 
/* 3. Run the procedure on the combined (original + scoring) data */
ods select ModelInfo NObs;
proc quantreg data=Combined algorithm=IPM ci=none;
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
   output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */
              P=Pred / columnwise;         /* COLUMWISE option supports multiple quantiles */
run;

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

/* 4. If you want a fit plot, sort by the independent variable for each curve.
      Put QUANTILE and other covariates first, then the X variable. */
proc sort data=QRegOut out=ScoreData;
   by Quantile MomWtGain;
run;
 
/* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables
      for original data and scoring data */
data All;
set Orig                                      /* original data */
    ScoreData(rename=(MomWtGain=X Pred=Y));   /* scoring data  */
run;
 
title "Quantile Regression Curves";
footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30";
proc sgplot data=All;
   scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92;
   series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p";
   keylegend "p" / position=right sortorder=reverseauto title="Quantile";
   xaxis values=(-20 to 40 by 10) valueshint grid  label="Mother's Relative Weight Gain (lbs)";;
   yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)";
run;

The fit plot is shown at the top of this article.

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.

8月 062018
 
Graph of quantile regression curves in SAS

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

data Orig;   /* restrict to 5000 births; exclude extreme weight gains */
set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40));
run;
 
proc quantreg data=Orig
              algorithm=IPM            /* use IPM algorithm for splines and binned data */
              ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );  /* 9 knots, equally spaced */
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
run;
Fit plot of quantile regression curves from PROC QUANTREG in SAS

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

  1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
  2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
  3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

/* 1. Create score data set */
data Score;
/* Optionally define additional covariates here. See the example
   "Create a sliced fit plot manually by using the missing value trick"
   https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html
*/
Weight = .;               /* set response (Y) variable to missing */
do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */
   output;
end;
run;
 
/* 2. Append the score data to the original data. Use a binary variable to 
      indicate which observations are the scoring data */
data Combined;
set Orig                     /* original data */
    Score(in=_ScoreData);    /* scoring data  */
ScoreData = _ScoreData;      /* binary indicator variable. ScoreData=1 for scoring data */
run;
 
/* 3. Run the procedure on the combined (original + scoring) data */
ods select ModelInfo NObs;
proc quantreg data=Combined algorithm=IPM ci=none;
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
   output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */
              P=Pred / columnwise;         /* COLUMWISE option supports multiple quantiles */
run;

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

/* 4. If you want a fit plot, sort by the independent variable for each curve.
      Put QUANTILE and other covariates first, then the X variable. */
proc sort data=QRegOut out=ScoreData;
   by Quantile MomWtGain;
run;
 
/* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables
      for original data and scoring data */
data All;
set Orig                                      /* original data */
    ScoreData(rename=(MomWtGain=X Pred=Y));   /* scoring data  */
run;
 
title "Quantile Regression Curves";
footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30";
proc sgplot data=All;
   scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92;
   series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p";
   keylegend "p" / position=right sortorder=reverseauto title="Quantile";
   xaxis values=(-20 to 40 by 10) valueshint grid  label="Mother's Relative Weight Gain (lbs)";;
   yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)";
run;

The fit plot is shown at the top of this article.

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.

7月 022018
 
Who was the greatest US president? Presidental greatness scores through 2018

Which president of the United States is ranked the greatest by presidential historians? This article visualizes the results of the 2018 Presidential Greatness Survey, which was created and administered by B. Rottinghaus and J. Vaughn. They analyzed 166 responses from experts in political science who ranked the 44 US presidents on a 0–100 "greatness scale" where 0 represents abject failure, 50 is average, and 100 is great. The survey results are presented in their 2018 report.

Table 1 of the report (p. 2-3) provides the average greatness scores of the presidents among all respondents. Just as "beauty is in the eye of the beholder," so too "greatness" is a subjective notion that depends on the values and beliefs of the assessor. Table 2A (p. 6-7) provides the average greatness scores for respondents by the political party and political ideology of the respondents. Of the respondents, 57.2% were Democrats, 30.1% were Independent or Other, and 12.7% were Republicans. Similarly, 58.4% self-identified as liberal or somewhat liberal, 24.1% self-identified as moderate, and 17.5% reported being conservative or somewhat conservative. The relatively large proportion of liberals should be considered when interpreting the survey results.

I imported the survey results into SAS and combined them to produce a visualization of the data. The graph at the right (click to enlarge) shows the presidents ranked by their overall mean greatness score, which is shown by an X. The mean scores for the conservative, moderate, and liberal respondents are shown as filled circles. The party of the president is indicated by a colored bar that shows the range of scores.

A few results are apparent from the graph:

  • Lincoln, Washington, FDR, Teddy Roosevelt, and Jefferson are the five greatest presidents according to these experts.
  • Obama and Reagan are two recent presidents who are ranked highly.
  • Statistically speaking, there are large groups of presidents for which the relative ranking is uncertain. For example, the presidents ranked 12–20 (Madison through Monroe) have mean scores that are similar.
  • The current president, Donald Trump, is ranked last. In fairness, Trump had not yet completed his first year in office at the time of the survey (December 2017). The authors of the report dedicate several pages (p. 10–13) analyzing a survey question about who was the Most Polarizing president, a category that Trump won handily.
  • Other presidents who scored low include James Buchanan (the "Do-Nothing President"), William Henry Harrison (who died 31 days into his term), and Franklin Pierce (a weak leader in a contentious pre-Civil War era).

You can also graph the data by using the political parties of the experts rather than their ideologies. The graph is very similar.

Bias and disagreement in ranking presidents

One of the most interesting aspects of these data is the relative widths of the ranges of most 20th and 21st century presidents. These wide intervals are caused when experts of one political ideology judge a president much differently than experts from another ideology. Typically, the conservative-liberal disparity amounts to 10–15 points on the "greatness" scale, but for several presidents (Coolidge and Obama) the average conservative score is more than 20 points different from the average liberal score.

This disparity is evidence of the partisan state of politics in the US. Even these experts, who presumably use historical facts rather than opinions to guide their ranking, are not immune from seeing current events through the lens of their own personal values and biases.

To better visualize the range of disagreement among the experts, I graphed the difference between the conservative and liberal scores for each president. Again, the color of the line indicates the political party of the president. The 20th century began with McKinley's presidency. For almost every president since McKinley, the gap for Democratic presidents is negative, indicating that the conservative experts judged the presidency less favorably than the liberal experts. For Republican presidents, the direction of the gap is reversed: the conservative experts ranked the presidency more favorably than their liberal counterparts. The exception is Teddy Roosevelt, whose progressive agenda (the "Square Deal" and conservation measures such as national parks and forests) appeals to liberal experts.

Summary

Rottinghaus and Vaughn's 2018 Presidential Greatness Survey analyzes the opinions of 166 experts as to which US presidents are considered great, average, or a failure. The first graph in this blog post shows that although there is general agreement among experts for many 18th and 19th century presidents (Lincoln and Washington were great; Pierce, W. H. Harrison, and Buchanan were not.), there is less agreement for modern presidents. As seen in the second graph, the variation tends to correlate along ideological grounds, with conservatives and liberals having very different views about which presidents were great.

You can download the SAS program that creates the graphs in this article. For a different visualization of these data, see Robert Lawrence's article "All the U.S. Presidents, Ranked by 'Greatness'."

The post Ranking US presidents appeared first on The DO Loop.

6月 132018
 

If you use PROC SGPLOT to create ODS graphics, "ATTRS" are everywhere. ATTRS is an abbreviation of "attributes." Most options that change the attributes of a graphical element end with the ATTRS suffix. For example, the MARKERATTRS option modifies attributes of markers, the LINEATTRS option modifies attributes of lines, and the FILLATTRS option modifies attributes of the filled area of a bar or other region. These options are easy to remember and use.

However, there are three "ATTRS" that you might find more confusing: CYCLEATTRS, ATTRPRIORITY, and STYLEATTRS. These options determine the colors, line patterns, and marker symbols that are used to represent groups in your data. They interact with each other when you use a GROUP= option to specify groups and also use multiple statements to overlay several graph types such as scatter plots and series plots.

This article summarizes the ATTRPRIORITY, CYCLEATTRS, and STYLEATTRS keywords and provides an example that shows how they interact with each other. The end of this article presents a list of references for further reading.

What are ATTRPRIORITY, CYCLEATTRS, and STYLEATTRS?

In PROC SGPLOT, many statements support the GROUP= option. In order to distinguish one group from another, the groups are assigned different attributes such as colors, line patterns, marker symbols, and so on. The colors, patterns, and other attributes are defined by the current ODS style (or by the STYLEATTRS statement or the DATTRMAP= data set), but the ATTRPRIORITY and CYCLEATTRS keywords determine how the colors, patterns, and symbols combine for each group. The following describes the syntax and purpose of each keyword:

  • The ATTRPRIORITY option is specified on the ODS GRAPHICS statement. If you specify ATTRPRIORITY=COLOR, then groups are represented by changing the color of markers and lines, but not line patterns or marker symbols. If you specify ATTRPRIORITY=NONE, groups are represented by changing the color, line patterns, and marker symbols for each group. The ATTRPRIORITY=NONE is essential for "no color" styles such as the Journal style.
  • The CYCLEATTRS (or NOCYCLEATTERS) option affect whether attributes change when you overlay multiple graphs by using two or more statements. Notice that the word "CYCLE" begins with an "S" sound, as does the word "statements." Say the following sentence out loud, emphasizing the common "S" sounds: "CYCLEATTRS affect Statements."
  • The STYLEATTRS statement in PROC SGPLOT overrides the colors, patterns, and symbols in the current style. You can use the STYLEATTRS statement to set the combination of attributes that appear for each group.

Before looking at the interaction between these keywords, recall how the ATTRPRIORITY= option affects a graph that has only one statement. The following calls to PROC SGPLOT are identical. The only difference is the ATTRPRIORITY= option.

title "Iris Data: AttrPriority=Color";
ods graphics / AttrPriority=Color;
proc sgplot data=Sashelp.Iris;
   reg x=PetalWidth y=SepalLength / group=Species markerattrs=(size=10) lineattrs=(thickness=4);
run;
 
title "Iris Data: AttrPriority=None";
ods graphics / AttrPriority=None;
proc sgplot data=Sashelp.Iris;
   reg x=PetalWidth y=SepalLength / group=Species markerattrs=(size=10) lineattrs=(thickness=4);
run;
ATTRPRIORITY=COLOR and ATTRPRIORITY=NONE options in ODS graphics in SAS PROC SGPLOT

The plot on the left shows the result of using ATTRPRIORITY=COLOR with the HtmlBlue style. The three groups are represented by using the first three colors in the style (in this case, blue, red, and green), but the line patterns and marker symbols do not change between groups. In contrast, the plot on the left is the result of using ATTRPRIORITY=NONE. The three groups are represented by using the first three colors in the style and also the first three line patterns (solid, dashed, and dash-dot) and the first three marker symbols (circle, plus sign, and "X"). The results might depend on the current ODS style.

The interaction between ATTRPRIORITY and CYCLEATTRS

The ATTRPRIORITY and CYCLEATTRS options interact when you use two or more statements to overlay graphs. In Warren Kuhfeld's short course on "Advanced ODS Graphics Examples in SAS," he presents a slide that shows how these options interact. The following example creates a panel of graphs that is inspired by Warren's slide. Each graph in the example shows the same data (the closing stock price for two companies in 2001–2003) displayed for each of the four combinations of the ATTRPRIORITY and CYCLEATTRS options. Although the LOESS statement can display the markers, I used a separate SCATTER statement because I want to illustrate how the CYCLEATTRS statement affects the attributes.

data Stocks;
set Sashelp.Stocks;
where '01JAN2001'd <= Date <= '31OCT2003'd and Stock in ("IBM", "Microsoft");
run;
 
/* The code for each call is the same except for the options. Wrap in a macro for brevity. */
%macro GraphIt(priority, cycle);
title "AttrPriority=&priority.; &cycle.";
ods graphics / AttrPriority=&priority.;
proc sgplot data=Stocks &cycle.;
   loess   x=Date y=Close / group=Stock lineattrs=(thickness=4) NoMarkers smooth=0.75;
   scatter x=Date y=Close / group=Stock markerattrs=(size=10);
   yaxis label="Stock Price";
run;
%mend;
 
ods graphics / width=400px height=300px;
ods layout gridded columns=2 advance=table column_gutter=0 row_gutter=0;
%GraphIt(Color, NoCycleAttrs);
%GraphIt(None,  NoCycleAttrs);
%GraphIt(Color, CycleAttrs);
%GraphIt(None,  CycleAttrs);
ods layout end;
Interaction between the ATTRPRIORITY=  and CYCLEATTRS options in ODS graphics in SAS PROC SGPLOT

The panel for the example shows the interaction between the ATTRPRIORITY= and CYCLEATTRS options.

  • The top-left graph shows the graph for ATTRPRIORITY=COLOR and NOCYCLEATTRS. Only colors are used to distinguish the groups (IBM and Microsoft). The second statement (SCATTER) uses the same attributes as the first statement (LOESS).
  • The top-right graph shows the graph for ATTRPRIORITY=NONE and NOCYCLEATTRS. Colors, line patterns, and symbols are used to distinguish the groups. The second statement (SCATTER) uses the same attributes as the first statement (LOESS).
  • The bottom-left graph shows the graph for ATTRPRIORITY=COLOR and CYCLEATTRS. Only colors are used to distinguish the groups (IBM and Microsoft). The first statement (LOESS) uses the first two colors in the style (blue and red) and the second statement (SCATTER) uses the third and fourth colors in the style (green and brown). Because the markers are not filled, it might be difficult to see that the marker colors are green and brown, but you can click on the graph to enlarge it.
  • The bottom-right graph shows the graph for ATTRPRIORITY=NONE and CYCLEATTRS. Colors, line patterns, and symbols are used to distinguish the groups. The first statement (LOESS) uses the first two attributes in the style (blue-solid and red-dashed) whereas the second statement (SCATTER) uses the third and fourth attributes (green-X and brown-triangle).

Depending on your needs, either graph in the first row would be appropriate for color graphs. The lower-right plot is most suitable for ODS styles that create monochrome graphs.

The STYLEATTRS statement

After you understand the interaction between the ATTRPRIORITY= and CYCLEATTRS options, it is straightforward to use the STYLEATTRS statement. The statement merely overrides the group attributes for the current style. If you specify fewer attributes than the number of groups, the attributes are cyclically reused. For example, the following STYLEATTRS statement specifies that all lines should be solid but specifies different colors and symbols for the groups. Because two attributes are specified, the second statement reuses the same colors and symbols even though the NOCYCLEATTRS option is set:

title "AttrPriority=None; NoCycleAttrs; StyleAttrs";
ods graphics / AttrPriority=None;
proc sgplot data=Stocks NoCycleAttrs;
   styleattrs datalinepatterns=(solid)
              datacontrastcolors=(SteelBlue DarkGreen)
              datasymbols=(CircleFilled TriangleFilled);
   loess   x=Date y=Close / group=Stock lineattrs=(thickness=4) NoMarkers smooth=0.75;
   scatter x=Date y=Close / group=Stock markerattrs=(size=10);
   yaxis label="Stock Price";
run;

Reset the default attribute priority

Here's another trick I learned from Warren's course: You can use the RESET= option to reset the default value of the ATTRPRIORITY option for the current style:

ods graphics / reset=attrpriority; /* reset the default value for the current style */

Summary and further reading

This article provides an example and summarizes the behavior of the ATTRPRIORITY= option (on the ODS GRAPHICS statement) and the CYCLEATTRS option (on the PROC SGPLOT statement). These options interact when you have several statements in PROC SGPLOT that each use the GROUP= option. You can use the STYLEATTRS statement to override the default attributes for the current ODS style.

Much more can be said about these options and how they interact. In addition to the documentation links in the article, I recommend the following:

The post Attrs, attrs, everywhere: The interaction between ATTRPRIORITY, CYCLEATTRS, and STYLEATTRS in ODS graphics appeared first on The DO Loop.

5月 312018
 

A previous article showed how to use a calibration plot to visualize the goodness-of-fit for a logistic regression model. It is common to overlay a scatter plot of the binary response on a predicted probability plot (below, left) and on a calibration plot (below, right):

The SAS program that creates these plots is shown in the previous article. Notice that the markers at Y=0 and Y=1 are displayed by using a scatter plot. Although the sample size is only 500 observations, the scatter plot for the binary response suffers from overplotting. For larger data sets, the scatter plot might appear as two solid lines of markers, which does not provide any insight into the distribution of the horizontal variable. You can plot partially transparent markers, but that does not help the situation much. A better visualization is to eliminate the scatter plot and instead use a binary fringe plot (also called a butterfly fringe plot) to indicate the horizontal positions for each observed response.

A predicted probability plot with binary fringe

A predicted probability plot with a binary fringe plot for logistic regression

A predicted probability plot with a binary fringe plot is shown to the right. Rather than use the same graph area to display the predicted probabilities and the observed responses, a small "butterfly fringe plot" is shown in a panel below the predicted probabilities. The lower panel indicates the counts of the responses by using lines of various lengths. Lines that point down indicate the number of counts for Y=0 whereas lines that point up indicate the counts for Y=1. For these data, the X values less than 1 have many downward-pointing lines whereas the X values greater than 1 have many upward-pointing lines.

To create this plot in SAS, you can do the following:

  1. Use PROC LOGISTIC to output the predicted probabilities and confidence limits for a logistic regression of Y on a continuous explanatory variable X.
  2. Compute the min and max of the continuous explanatory variable.
  3. Use PROC UNIVARIATE to count the number of X values in each of 100 bins in the range [min, max] for Y=0 and Y=1.
  4. Merge the counts with the predicted probabilities.
  5. Define a GTL template to define a panel plot. The main panel shows the predicted probabilities and the lower panel shows the binary fringe plot.
  6. Use PROC SGRENDER to display the panel.

You can download the SAS program that defines the GTL template and creates the predicted probability plot.

A calibration plot with binary fringe

A calibration plot with a binary fringe plot for logistic regression

By using similar steps, you can create a calibration plot with a binary fringe plot as shown to the right. The main panel is used for the calibration plot and a small binary fringe plot is shown in a panel below it. The lower panel shows the counts of the responses at various positions. Note that the horizontal variable is the predicted probability from the model whereas the vertical variable is the empirical probability as estimated by the LOESS procedure. For these simulated data, the fringe plot shows that most of the predicted probabilities are less than 0.2 and these small values mostly correspond to Y=0. The observations for Y=1 mostly have predicted probabilities that are greater than 0.5. The fringe plot reveals that about 77% of the observed responses are Y=0, a fact that was not apparent in the original plots that used a scatter plot to visualize the response variable.

To create this plot in SAS, you can do the following:

  1. Use PROC LOGISTIC to output the predicted probabilities for any logistic regression.
  2. Use PROC LOESS to regress Y onto the predicted probability. This estimates the empirical probability for each value of the predicted probability.
  3. Use PROC UNIVARIATE to count the number of predicted probabilities for each of 100 bins in the range [0, 1] for Y=0 and Y=1.
  4. Merge the counts with the predicted probabilities.
  5. Define a GTL template to define a panel plot. The main panel shows the calibration plot and the lower panel shows the binary fringe plot.
  6. Use PROC SGRENDER to display the panel.

You can download the SAS program that defines the GTL template and creates the calibration plot.

For both plots, the frequencies of the responses are shown by using "needles," but you can make a small change to the GTL to make the fringe plot use thin bars so that it looks more like a butterfly plot of two histograms. See the program for details.

What do you think of this plot? Do you like the way that the binary fringe plot visualizes the response variable, or do you prefer the classic plot that uses a scatter plot to show the positions of Y=0 and Y=1? Leave a comment.

The post Use a fringe plot to visualize binary data in logistic models appeared first on The DO Loop.

5月 232018
 
Butterfly plot of cholesterol by gender in SAS

This article shows how to construct a butterfly plot in SAS. A butterfly plot (also called a butterfly chart) is a comparative bar chart or histogram that displays the distribution of a variable for two subpopulations. A butterfly plot for the cholesterol readings of 5,057 patients in a medical study is shown to the right, where the distribution for the males is shown on the left side of the plot and the distribution for females is displayed on the right. (Click to enlarge.) The main contribution of this article is showing how to bin a continuous variable in SAS to form a butterfly chart.

The butterfly plot is similar to a comparative histogram because both enable you to compare the distribution of a continuous variable for subpopulations. The comparative histogram uses a panel to visualize several subpopulations in a panel, where each row represents a level of a classification variable. In contrast, the butterfly plot is limited to two levels and displays the distributions back-to-back.

Bin a continuous variable for each classification level

In a previous blog post, I constructed a butterfly chart that compares voice versus text usage by decade of age for cell phones. Similarly, there is a SAS Sample that shows how to create a butterfly chart for types of cancers by gender. In both of these examples, the butterfly chart is a comparative bar chart because the distribution shown is for a discrete variable ("decade of age" or "type of cancer"). This section shows how to start with a continuous variable (cholesterol) and bin it into intervals. You can then use the previous techniques to visualize the counts in each interval for each gender.

You can bin a continuous variable by using the BIN and TABULATE functions in SAS/IML or by using the OUTHIST= option on the HISTOGRAM statement in PROC UNIVARIATE. The following statements create an output data set (OutHist) that contains the counts of males and females that have cholesterol reading within bins of width 20. The bin width and the centers of the intervals are chosen automatically by PROC UNIVARIATE, but you can use the MIDPOINTS= option (shown in the comments) to control the placement of the intervals.

proc univariate data=Sashelp.Heart;
   class Sex;
   var Cholesterol;
   histogram cholesterol / nrows=2 outhist=OutHist 
                          /* midpoints=(80 to 560 by 40) */ /* control bin widths and locations */
                           odstitle="Cholesterol by Gender";
   ods select histogram;
run;
Comparative histogram in SAS: Cholesterol by Gender

Create a butterfly plot in SAS

The OutHist data set is in "long form." You need to convert it to "wide form" in order to construct a butterfly plot. The SAS code performs the following tasks:

  • Use a DATA step and WHERE clauses to convert the data from long to wide format.
  • Multiply the counts for the males by -1. The negative counts will be plotted on the left side of the butterfly plot.
  • Define a format that will display the absolute values of the counts. The axis for the formatted variables contains zero in the middle and increases in both directions.
  • Use the HBAR statements to plot the back-to-back bar charts for males and females.
/* convert data from long format to wide format */
data Butterfly;
   keep Cholesterol Males Females;
   label Males= Females= Cholesterol=; /* remove labels */
   merge OutHist(where=(sex="Female") rename=(_COUNT_=Females _MIDPT_=Cholesterol))
         OutHist(where=(sex="Male")   rename=(_COUNT_=Males   _MIDPT_=Cholesterol));
   by Cholesterol;
   Males = -Males;                     /* trick: reverse the direction of male counts */
run;
 
/* define format that displays the absolute value of a number */
proc format;
   picture positive low-<0="000,000"
   0<-high="000,000";
run;
 
ods graphics / reset;
title "Butterfly Plot of Cholesterol Counts By Gender";
proc sgplot data=Butterfly;
   format Males Females positive.;
   hbar Cholesterol / response=Males   legendlabel="Males";
   hbar Cholesterol / response=Females legendlabel="Females";
   xaxis label="Count" grid 
         min=-520 max=520 values=(-500 to 500 by 100) valueshint;
   yaxis label="Cholesterol" discreteorder=data;
run;

The graph is shown at the top of this article. You can see that the mode of the distribution is higher for males, and the distribution for males also has a longer tail.

A butterfly fringe plot

The butterfly plot is usually displayed with horizontal bars, as shown. However, you could use the VBAR statement to get a rotated version of the butterfly plot. As mentioned earlier, you can also use the MIDPOINTS= option on the HISTOGRAM statement to change the width of the histogram bins.

One useful variation in the butterfly plot is to use a very small bin width and replace the bar chart with a high-low plot. This creates a graph that I call a butterfly fringe plot. Recall that the usual fringe plot (also called a "rug plot") places a tick mark on an axis to show the distribution of data values. A fringe plot can suffer from overplotting when more than one observation has the same value. With a butterfly fringe plot, some ticks are higher than others (to represent repeated values) and bars that point up represent one binary value whereas bars that point down represent the other. The butterfly fringe plot provides a more complete visualization of the distribution of data for two levels of a response or classification variable.

ods graphics / width=640px height=180;
title "Butterfly Fringe Plot: Cholesterol Counts By Gender";
proc sgplot data=Butterfly;
   format Males Females positive.;
   highlow x=Cholesterol low=Males high=Females;
   refline 0 / axis=y;
   inset "Males"   / position=BottomLeft;
   inset "Females" / position=TopLeft;
   yaxis label="Count" grid;
   xaxis label="Cholesterol" discreteorder=data;
run;
Butterfly fringe plot of cholesterol by gender in SAS

The butterfly fringe plot was created by using a bin width of 5 for the cholesterol variable. That is, midpoints=(50 to 560 by 5).

In summary, this article shows how to create a butterfly plot for a continuous variable and a binary classification variable. When you bin the continuous variable, you obtain counts for each interval. You can then graph the counts back-to-back to form the butterfly chart. An interesting variation is the butterfly fringe plot, which combines a butterfly chart and a fringe plot.

The post A butterfly plot for comparing distributions appeared first on The DO Loop.

5月 162018
 

In my article about how to construct calibration plots for logistic regression models in SAS, I mentioned that there are several popular variations of the calibration plot. The previous article showed how to construct a loess-based calibration curve. Austin and Steyerberg (2013) recommend the loess-based curve on the basis of an extensive simulation study. However, some practitioners prefer to use a decile calibration plot. This article shows how to construct a decile-based calibration curve in SAS.

The decile calibration plot

The decile calibration plot is a graphical analog of the Hosmer-Lemeshow goodness-of-fit test for logistic regression models. The subjects are divided into 10 groups by using the deciles of the predicted probability of the fitted logistic model. Within each group, you compute the mean predicted probability and the mean of the empirical binary response. A calibration plot is a scatter plot of these 10 ordered pairs, although most calibration plots also include the 95% confidence interval for the proportion of the binary responses within each group. Many calibration plots connect the 10 ordered pairs with piecewise line segments, others use a loess curve or a least squares line to smooth the points.

Create the decile calibration plot in SAS

The previous article simulated 500 observations from a logistic regression model logit(p) = b0 + b1*x + b2*x2 where x ~ U(-3, 3). The following call to PROC LOGISTIC fits a linear model to these simulated data. That is, the model is intentionally misspecified. A call to PROC RANK creates a new variable (Decile) that identifies the deciles of the predicted probabilities for the model. This variable is used to compute the means of the predicted probabilities and the empirical proportions (and 95% confidence intervals) for each decile:

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim noprint;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
run;
 
/* To construct the decile calibration plot, identify deciles of the predicted prob. */
proc rank data=LogiOut out=LogiDecile groups=10;
   var PredProb;
   ranks Decile;
run;
 
/* Then compute the mean predicted prob and the empirical proportions (and CI) for each decile */
proc means data=LogiDecile noprint;
   class Decile;
   types Decile;
   var y PredProb;
   output out=LogiDecileOut mean=yMean PredProbMean
          lclm=yLower uclm=yUpper;
run;
 
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiDecileOut noautolegend aspect=1;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   *loess x=PredProbMean y=yMean;  /* if you want a smoother based on deciles */
   series x=PredProbMean y=yMean;  /* if you to connect the deciles */
   scatter x=PredProbMean y=yMean / yerrorlower=yLower yerrorupper=yUpper;
   yaxis label="Observed Probability of Outcome";
   xaxis label="Predicted Probability of Outcome";
run;
Decile calibration curve for a misspecified logistic regression model

The diagonal line is the line of perfect calibration. In a well-calibrated model, the 10 markers should lie close to the diagonal line. For this example, the graph indicates that the linear model does not fit the data well. For the first decile of the predicted probability (the lowest predicted-risk group), the observed probability of the event is much higher than the mean predicted probability. For the fourth, sixth, and seventh deciles, the observed probability is much lower than the mean predicted probability. For the tenth decile (the highest predicted-risk group), the observed probability is higher than predicted. By the way, this kind of calibration is sometimes called internal calibration because the same observations are used to fit and assess the model.

The decile calibration plot for a correctly specified model

You can fit a quadratic model to the data to see how the calibration plot changes for a correctly specified model. The results are shown below. In this graph, all markers are close to the diagonal line, which indicates a very close agreement between the predicted and observed probabilities of the event.

Decile calibration curve for a correctly specified logistic regression model

Should you use the decile calibration curve?

The decile-based calibration plot is popular, perhaps because it is so simple that it can be constructed by hand. Nevertheless, Austin and Steyerberg (2013) suggest using the loess-based calibration plot instead of the decile-based plot. Reasons include the following:

  • The use of deciles results in estimates that "display greater variability than is evident in the loess-based method" (p. 524).
  • Several researchers have argued that the use of 10 deciles is arbitrary. Why not use five? Or 15? In fact, the results of the Hosmer-Lemeshow test "can depend markedly on the number of groups, and there's no theory to guide the choice of that number." (P. Allison, 2013. "Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression")

Many leading researchers in logistic regression do not recommend the Hosmer-Lemeshow test for these reasons. The decile-based calibration curve shares the same drawbacks. Since SAS can easily create the loess-based calibration curve (see the previous article), there seems to be little reason to prefer the decile-based version.

The post Decile calibration plots in SAS appeared first on The DO Loop.

5月 142018
 

A logistic regression model is a way to predict the probability of a binary response based on values of explanatory variables. It is important to be able to assess the accuracy of a predictive model. This article shows how to construct a calibration plot in SAS. A calibration plot is a goodness-of-fit diagnostic graph. It enables you to qualitatively compare a model's predicted probability of an event to the empirical probability.

Calibration curves

There are two standard ways to assess the accuracy of a predictive model for a binary response: discrimination and calibration. To assess discrimination, you can use the ROC curve. Discrimination involves counting the number of true positives, false positive, true negatives, and false negatives at various threshold values. In contrast, calibration curves compare the predicted probability of the response to the empirical probability. Calibration can help diagnose lack of fit. It also helps to rank subjects according to risk.

This article discusses internal calibration, which is the agreement on the sample used to fit the model. External calibration, which involves comparing the predicted and observed probabilities on a sample that was not used to fit the model, is not discussed.

In the literature, there are two types of calibration plots. One uses a smooth curve to compare the predicted and empirical probabilities. The curve is usually a loess curve, but sometimes a linear regression curve is used. The other (to be discussed in a future article) splits the data into deciles. An extensive simulation study by Austin and Steyerberg (2013) concluded that loess-based calibration curves have several advantages. This article discusses how to use a loess fit to construct a calibration curve.

A calibration curve for a misspecified model

Calibration curves help to diagnose lack of fit. This example simulates data according to a quadratic model but create a linear fit to that data. The calibration curve should indicate that the model is misspecified.

Step 1: Simulate data for a quadratic logistic regression model

Austin and Steyerberg (2013) use the following simulated data, which is based on the earlier work of Hosmer et al. (1997). Simulate X ~ U(-3, 3) and define η = b0 + b1*X + b2*X2) to be a linear predictor, where b0 = -2.337, b1 = 0.8569, b2 = 0.3011. The logistic model is logit(p) = η, where p = Pr(Y=1 | x) is the probability of the binary event Y=1. The following SAS DATA step simulates N=500 observations from this logistic regression model:

/* simulate sample of size N=500 that follows a quadratic model from Hosmer et al. (1997) */
%let N = 500;                              /* sample size */ 
data LogiSim(keep=Y x);
call streaminit(1234);                     /* set the random number seed */
do i = 1 to &N;                            /* for each observation in the sample */
   /* Hosmer et al. chose x ~ U(-3,3). Use rand("Uniform",-3,3) for modern versions of SAS */
   x = -3 + 6*rand("Uniform");
   eta = -2.337 + 0.8569*x + 0.3011*x**2;  /* quadratic model used by Hosmer et al. */
   mu = logistic(eta);
   Y = rand("Bernoulli", mu);
   output;
end;
run;

For this simulated data, the "true" model is known to be quadratic. Of course, in practice, the true underlying model is unknown.

Step 2: Fit a logistic model

The next step is to fit a logistic regression model and save the predicted probabilities. The following call to PROC LOGISTIC intentionally fits a linear model. The calibration plot will indicate that the model is misspecified.

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim plots=effect;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
run;
Plot of predicted probabilities for a logistic regression model in SAS

As shown above, PROC LOGISTIC can automatically create a fit plot for a regression that involves one continuous variable. The fit plot shows the observed responses, which are plotted at Y=0 (failure) or Y=1 (success). The predicted probabilities are shown as a sigmoidal curve. The predicted probabilities are in the range (0, 0.8). The next section creates a calibration plot, which is a graph of the predicted probability versus the observed response.

Create the calibration plot

The output from PROC LOGISTIC includes the predicted probabilities and the observed responses. Copas (1983), Harrell et al. (1984), and others have suggested assessing calibration by plotting a scatter plot smoother and overlaying a diagonal line that represents the line of perfect calibration. If the smoother lies close to the diagonal, the model is well calibrated. If there is a systematic deviation from the diagonal line, that indicates that the model might be misspecified.

The following statements sort the data by the predicted probabilities and create the calibration plot. (The sort is optional for the loess smoother, but is useful for other analyses.)

proc sort data=LogiOut;  by PredProb;  run;
 
/* let the data choose the smoothing parameter */
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiOut noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm;   /* smoothing value by AICC (0.657) */
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
run;
Calibration plot for a misspecified logistic model

The loess smoother does not lie close to the diagonal line, which indicates that the linear model is not a good fit for the quadratic data. The loess curve is systematically above the diagonal line for very small and very large values of the predicted probability. This indicates that the empirical probability (the proportion of events) is higher than predicted on these intervals. In contrast, when the predicted probability is in the range [0.1, 0.45], the empirical probability is lower than predicted.

Several variations of the calibration plot are possible:

  • Add the NOMARKERS option to the LOESS statement to suppress the observed responses.
  • Omit the CLM option if you do not want to see a confidence band.
  • Use the SMOOTH=0.75 option if you want to specify the value for the loess smoothing parameter. The value 0.75 is the value that Austin and Steyerberg used in their simulation study, in part because that is the default parameter value in the R programming language.
  • Use a REG statement instead of the LOESS statement if you want a polynomial smoother.

A calibration plot for a correctly specified model

In the previous section, the model was intentionally misspecified. The calibration curve for the example did not lie close to the diagonal "perfect calibration" line, which indicates a lack of fit. Let's see how the calibration curve looks for the same data if you include a quadratic term in the model. First, generate the predicted probabilities for the quadratic model, then create the calibration curve for the model:

proc logistic data=LogiSim noprint;
   model Y(event='1') = x x*x;  /* fit a quadratic model */
   output out=LogiOut2 predicted=PredProb;
run;
 
proc sort data=LogiOut2;  by PredProb;  run;
 
title "Calibration Plot for a Correct Model";
proc sgplot data=LogiOut2 noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm nomarkers;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   yaxis grid; xaxis grid;
run;
Calibration plot for a correctly specified logistic model

This calibration plot indicates that the quadratic model fits the data well. The calibration curve is close to the diagonal reference line, which is the line of perfect calibration. The curve indicates that the predicted and empirical probabilities are close to each other for low-risk, moderate-risk, and high-risk subjects.

It is acceptable for the calibration curve to have small dips, bends, and wiggles. The extent that the curve wiggles is determined by the (automatically chosen) smoothing parameter for the loess algorithm, The value of the smoothing parameter is not displayed by PROC SGPLOT, but you can use PROC LOESS to display the smoothing parameter and other statistics for the loess smoother.

According to Austin and Steyerberg (2013), the calibration curve can detect misspecified models when the magnitude of the omitted term (or terms) is moderate to strong. The calibration curve also behaves better for large samples and when the incidence of outcomes is close to 50% (as opposed to having rare outcomes). Lastly, there is a connection between discrimination and calibration: The calibration curve is most effective in models for which the discrimination (as measured by the C-statistic or area under the ROC curve) is good.

In summary, this article shows how to construct a loess-based calibration curve for logistic regression models in SAS. To create a calibration curve, use PROC LOGISTIC to output the predicted probabilities for the model and plot a loess curve that regresses the observed responses onto the predicted probabilities.

My next article shows how to construct a popular alternative to the loess-based calibration curve. In the alternative version, the subjects are ranked into deciles according to the predicted probability or risk.

Further reading

The post Calibration plots in SAS appeared first on The DO Loop.