rick wicklin

12月 172018
 

Many data analysts use a quantile-quantile plot (Q-Q plot) to graphically assess whether data can be modeled by a probability distribution such as the normal, lognormal, or gamma distribution. You can use the QQPLOT statement in PROC UNIVARIATE to create a Q-Q plot for about a dozen common distributions. However, it can be useful to use a variant of the Q-Q plot called the probability plot, which enables you to graphically assess how well a model fits the tails of the data. A probability plot can also be created in PROC UNIVARIATE. It is essentially a Q-Q plot in which the X axis is labeled nonlinearly by using percentiles of the model distribution. This article describes how to create and interpret a probability plot in SAS.

Use a probability plot to compare empirical and theoretical percentiles

When fitting a distribution by using maximum likelihood estimation or some other method, you might notice that the model fits well in the center of the data but not in the tails. This may or may not be a problem. If you want the model your "typical" customers or patients, the model might be useful even if it does not fit perfectly in the tails.

Goodness-of-fit (GOF) tests indicate how well the model fits the data everywhere. Deviations in the tail can cause a GOF test to reject the hypothesis that the model fits well. You can use a probability plot to determine the percentiles at which the model begins to deviate from the data.

The following example is from the PROC UNIVARIATE documentation. The data are the thickness of the copper plating of 100 circuit boards. To illustrate that a model might fail to fit the tails of the data, I have artificially created four fake outliers and appended them to the end of the data. The example fits a normal distribution to the data and creates two Q-Q plots and a probability plot:

data Trans;
   input Thick @@;
   label Thick = 'Plating Thickness (mils)';
   if _N_ <= 100 then Group="Real Data";
   else Group = "Fake Data";   /* The last four observations are FAKE outliers */
   datalines;
3.468 3.428 3.509 3.516 3.461 3.492 3.478 3.556 3.482 3.512
3.490 3.467 3.498 3.519 3.504 3.469 3.497 3.495 3.518 3.523
3.458 3.478 3.443 3.500 3.449 3.525 3.461 3.489 3.514 3.470
3.561 3.506 3.444 3.479 3.524 3.531 3.501 3.495 3.443 3.458
3.481 3.497 3.461 3.513 3.528 3.496 3.533 3.450 3.516 3.476
3.512 3.550 3.441 3.541 3.569 3.531 3.468 3.564 3.522 3.520
3.505 3.523 3.475 3.470 3.457 3.536 3.528 3.477 3.536 3.491
3.510 3.461 3.431 3.502 3.491 3.506 3.439 3.513 3.496 3.539
3.469 3.481 3.515 3.535 3.460 3.575 3.488 3.515 3.484 3.482
3.517 3.483 3.467 3.467 3.502 3.471 3.516 3.474 3.500 3.466
3.624 3.367 3.625 3.366
;
 
title 'Analysis of Plating Thickness';
proc univariate data=Trans noprint;
   qqplot   Thick / normal grid odstitle="Q-Q Plot";
   qqplot   Thick / normal PCTLAXIS(grid) odstitle="Q-Q Plot with PCTLAXIS Option";
   probplot Thick / normal grid odstitle="Probability Plot";
run;
SAS quantile-quantile plot (Q-Q plot) that compares quantiles of the data with quantiles of the normal distribution

The first Q-Q plot indicates whether the model (in this case, the normal distribution) fits the data. If the bulk of the data falls along a straight line, then there is evidence that the model fits. In this case, the model fits most of the data but not the four (artificial) points that I added.

If these were all real data, you might wrestle with whether you should accept the normal model or choose an alternative model. The Q-Q plot indicates that the normal data seems to fit the central portion of the data.

It would be useful if the plot displayed the percentiles of the normal distribution. The next image shows the Q-Q plot with the PCTLAXES option (in the background) and a probability plot (in the foreground). Notice that the positions of the markers are the same as for the Q-Q plot; only the labels on the axes have changed. The PCTLAXES option for the second QQPLOT statement creates a graph that displays percentiles of the normal distribution at the top of the plot. The probability plot eliminates the quantiles altogether and displays only the normal percentiles

A probability plot is a variation of a Q-Q plot in which the quantiles of the model are replaced with probabilities

By using the probability plot, you can estimate that the model fits well for the central 95% of the data. Only the upper and lower 2.5th tails of the model do not fit the data. If your goal is to model only the central 95% of the data, it might be fine to ignore the extreme data.

Create a custom probability plot

I have previously shown how to construct a Q-Q plot from first principles. If you want to create a probability plot, you first create a Q-Q plot but then use the QUANTILE function to find the quantiles that correspond to the probabilities that you want to display on the axis. For example, the following SAS/IML statements print the quantiles for a set of typical probabilities:

proc iml;
p = {0.001 0.01 0.05 0.10 0.25 0.5 0.75 0.9 0.95 0.99 0.999};
qntl = quantile("Normal", p);         /* compute tick locations */
print (qntl // 100*p)[r={"Quantiles" "Percentiles"} F=Best4.];
quit;
Standard normal quantiles and the associated probabilities

You can use the VALUES= and VALUESDISPLAY= options on the XAXIS statement in PROC SGPLOT to display the probability values at the locations of the corresponding quantiles. The following DATA step creates the coordinates for a Q-Q plot but uses the previous table of quantile values to specify the values and labels on the X axis. This can be useful, for example, if you want to customize the probability plot. For example, the following call sets the colors and symbols of the markers, adds a legend, and sets the labels for the axes.

proc sort data=Trans; by Thick; run;
 
data ProbPlot;
set Trans nobs=nobs;
y = Thick;                           /* for convenience, call variable Y */
v = (_N_ - 0.375) / (nobs + 0.25);   /* Blom (1958) */
q = quantile("Normal", v);           
run;
 
title "Custom Probability Plot";
proc sgplot data=ProbPlot;
scatter x=q y=y;
yaxis grid label="Plating Thickness (mils)";
xaxis values=(-3.1 -2.3 -1.6 -1.3 -.67 0.00 0.67 1.28 1.64 2.33 3.09)
      valuesdisplay=('0.1' '1' '5' '10' '25' '50' '75' '90' '95' '99' '99.9')
      grid label="Normal Percentiles" fitpolicy=none;
run;
A custom probability plot in SAS

The VALUES= and VALUESDISPLAY= options are very useful. I use them regularly to customize the location of tick marks and the values that are displayed at each tick.

In summary, you can use the QQPLOT statement (with the PCTLAXIS option) or the PROBPLOT statement in PROC UNIVARIATE to create a probability plot. A probability plot is essentially the same as a Q-Q plot except that the X axis displays the percentiles of a model distribution instead of quantiles. If you want additional customization (or want to examine a model that is not supported by PROC UNIVARIATE), then you can create the Q-Q plot manually. You can use the VALUES= and VALUESDISPLAY= options on the XAXIS statement in PROC SGPLOT to display the percentiles of the model distribution. For more about how to interpret a probability plot (especially for a non-normal reference distribution), see the PROC UNIVARIATE documentation.

The post Create a probability plot in SAS appeared first on The DO Loop.

12月 122018
 

This article describes best practices and techniques that every data analyst should know before bootstrapping in SAS. The bootstrap method is a powerful statistical technique, but it can be a challenge to implement it efficiently. An inefficient bootstrap program can take hours to run, whereas a well-written program can give you an answer in an instant. If you prefer "instants" to "hours," this article is for you! I’ve compiled dozens of resources that explain how to compute bootstrap statistics in SAS.

Overview: What is the bootstrap method?

Recall that a bootstrap analysis enables you to investigate the sampling variability of a statistic without making any distributional assumptions about the population. For example, if you compute the skewness of a univariate sample, you get an estimate for the skewness of the population. You might want to know the range of skewness values that you might observe from a second sample (of the same size) from the population. If the range is large, the original estimate is imprecise. If the range is small, the original estimate is precise. Bootstrapping enables you to estimate the range by using only the observed data.

In general, the basic bootstrap method consists of four steps:

  1. Compute a statistic for the original data.
  2. Use the DATA step or PROC SURVEYSELECT to resample (with replacement) B times from the data. The resampling process should respect the null hypothesis or reflect the original sampling scheme. For efficiency, you should put all B random bootstrap samples into a single data set.
  3. Use BY-group processing to compute the statistic of interest on each bootstrap sample. The BY-group approach is much faster than using macro loops. The union of the statistic is the bootstrap distribution, which approximates the sampling distribution of the statistic under the null hypothesis. Don't forget to turn off ODS when you run BY-group processing!
  4. Use the bootstrap distribution to obtain estimates for the bias and standard error of the statistic and confidence intervals for parameters.

The links in the previous list provide examples of best practices for bootstrapping in SAS. In particular, do not fall into the trap of using a macro loop to "resample, analyze, and append." You will eventually get the correct bootstrap estimates, but you might wait a long time to get them!

The remainder of this article is organized by the three ways to perform bootstrapping in SAS:

  • Programming: You can write a SAS DATA step program or a SAS/IML program that resamples from the data and analyzes each (re)sample. The programming approach gives you complete control over all aspects of the bootstrap analysis.
  • Macros: You can use the %BOOT and %BOOTCI macros that are supplied by SAS. The macros handle a wide variety of common bootstrap analyses.
  • Procedures: You can use bootstrap options that are built into several SAS procedures. The procedure internally implements the bootstrap method for a particular set of statistics.

Programming the basic bootstrap in SAS

The articles in this section describe how to program the bootstrap method in SAS for basic univariate analyses, for regression analyses, and for related resampling techniques such as the jackknife and permutation tests. This section also links to articles that describe how to generate bootstrap samples in SAS.

Examples of basic bootstrap analyses in SAS

  • The basic bootstrap in SAS: SAS enables you to resample the data by using PROC SURVEYSELECT. When coupled with BY-group processing, you can perform a very efficient bootstrap analysis in SAS, including the estimate of standard errors and percentile-based confidence intervals.
  • The basic bootstrap in SAS/IML: The SAS/IML language provides a compact language for bootstrapping, as shown in this basic bootstrap example.
  • The smooth bootstrap: As originally conceived, a bootstrap sample contains replicates of the data. However, there are situations when "jittering" the data provides a better approximation of the sampling distribution.
  • Bias-corrected and adjusted (BCa) confidence interval: For highly skewed data, the percentile-based confidence intervals are less efficient than the BCa confidence interval.
  • Bootstrap the difference of means between two groups: This example shows how to bootstrap a statistic in a two-sample t test.

Examples of bootstrapping for regression statistics

When you bootstrap regression statistics, you have two choices for generating the bootstrap samples:

  • Case resampling: You can resample the observations (cases) to obtain bootstrap samples of the responses and the explanatory variables.
  • Residual resampling: Alternatively, you can bootstrap regression parameters by fitting a model and resampling from the residuals to obtain new responses.

Jackknife and permutation tests in SAS

  • The jackknife method: The jackknife in an alternative nonparametric method for obtaining standard errors for statistics. It is deterministic because it uses leave-one-out samples rather than random samples.
  • Permutation tests: A permutation test is a resampling technique that is closely related to the bootstrap. You permute the observations between two groups to test whether the groups are significantly different.

Generate bootstrap sampling

An important part of a bootstrapping is generating multiple bootstrap samples from the data. In SAS, there are many ways to obtain the bootstrap samples:

  • Sample with replacement: The most common resampling technique is to randomly sample with replacement from the data. You can use the SAS DATA step, the SURVEYSELECT procedure, or the SAMPLE function in SAS/IML.
  • Samples in random order: It is sometimes useful to generate random samples in which the order of the observations is randomly permuted.
  • Balanced bootstrap resampling: Instead of random samples, some experts advocate a resampling algorithm in which each observation appears exactly B times in the union of the B bootstrap samples.

Bootstrap macros in SAS

The SAS-supplied macros %BOOT, %JACK, and %BOOTCI, can perform basic bootstrap analyses and jackknife analyses. However, they require a familiarity with writing and using SAS macros. If you are interested, I wrote an example that shows how to use the %BOOT and %BOOTCI macros for bootstrapping. The documentation also provides several examples.

SAS procedures that support bootstrapping

Many SAS procedures not only compute statistics but also provide standard errors or confidence intervals that enable you to infer whether an estimate is precise. Many confidence intervals are based on distributional assumptions about the population. ("If the errors are normally distributed, then....") However, the following SAS procedures provide an easy way to obtain a distribution-free confidence interval by using the bootstrap. See the SAS/STAT documentation for the syntax for each procedure.

  • PROC CAUSALMED introduced the BOOTSTRAP statement in SAS/STAT 14.3 (SAS 9.4M5). The statement enables you to compute bootstrap estimates of standard errors and confidence intervals for various effects and percentages of total effects.
  • PROC MULTTEST supports the BOOTSTRAP and PERMUTATION options, which enable you to compute estimates of p-values that make no distributional assumptions.
  • PROC NLIN supports the BOOTSTRAP statement, which computes bootstrap confidence intervals for parameters and bootstrap estimates of the covariance of the parameter estimates.
  • PROC QUANTREG supports the CI=RESAMPLING option to construct confidence intervals for regression quantiles.
  • The SURVEYMEANS, SURVEYREG, SURVEYLOGISTIC, SURVEYPHREG, SURVEYIMPUTE and SURVEYFREQ procedures introduced the VARMETHOD=BOOTSTRAP option SAS 9.4M5. The option enables you to compute bootstrap estimates of variance. With the exception of SURVEYIMPUTE, these procedures also support jackknife estimates. The jackknife is similar to the bootstrap but uses a leave-one-out deterministic scheme rather than random resampling.
  • PROC TTEST introduced the BOOTSTRAP statement in SAS/STAT 14.3. The statement enables you to compute bootstrap standard error, bias estimates, and confidence limits for means and standard deviations in t tests. In SAS/STAT 15.1 (SAS 9.4M6), the TTEST procedure provides extensive graphics that visualize the bootstrap distribution.

Summary

Resampling techniques such as bootstrap methods and permutation tests are widely used by modern data analysts. But how you implement these techniques can make a huge difference between getting the results in a few seconds versus a few hours. This article summarizes and consolidates many previous articles that demonstrate how to perform an efficient bootstrap analysis in SAS. Bootstrapping enable you to investigate the sampling variability of a statistic without making any distributional assumptions. In particular, the bootstrap is often used to estimate standard errors and confidence intervals for parameters.

Further Reading

The post The essential guide to bootstrapping in SAS appeared first on The DO Loop.

12月 102018
 
The best way to spread Christmas cheer
is singing loud for all to hear!
-Buddy in Elf

In the Christmas movie Elf (2003), Jovie (played by Zooey Deschanel) must "spread Christmas cheer" to help Santa. She chooses to sing "Santa Claus is coming to town," and soon all of New York City is singing along.

The best sing-along songs are short and have lyrics that repeat. Jovie's choice, "Santa Claus is coming to town," satisfies both criteria. The musical structure of the song is simple:

  • Verse 1: You better watch out / You better not cry / Better not pout / I'm telling you why
  • Tag line: Santa Claus is coming to town
  • Verse 2: He's making a list / And checking it twice; / Gonna find out / Who's naughty and nice
  • Tag line repeats
  • Bridge: He sees you when you're sleeping / He knows when you're awake / He knows if you've been bad or good / So be good for goodness sake! / O!
  • Verse 1 repeats
  • Partial tags and final tag: Santa Claus is coming / Santa Claus is coming / Santa Claus is coming to town

There is a fun way to visualize repetition in song lyrics. For a song that has N words, you can define the repetition matrix to be the N x N matrix where the (i,j)th cell has the value 1 if the i_th word is the same as the j_th word. Otherwise, the (i,j)th cell equals 0. You can visualize the matrix by using a two-color heat map. Colin Morris has a web site devoted to these visualizations.

The following image visualizes the lyrics of "Santa Claus is coming to town." I have added some vertical and horizontal lines to divide the lyrics into seven sections: the verses (V1 and V2), the tag line (S), and the bridge (B).

The image shows the structure of the repetition in the song lyrics:

  • The first verse contains the repetition of the words 'you', 'better', and 'not'.
  • The second verse repeats only the word 'out' from Verse 1.
  • The bridge repeats the word 'you', which appeared three times in Verse 1. It also repeats several words ('when', 'knows', 'good', ...) within the bridge.
  • The tag line "Santa Claus is coming [to town]" is repeated a total of five times.

Now that you understand what a repetition matrix looks like and how to interpret it, let's visualize a few other classic Christmas songs that contain repetitive lyrics! To help "spread Christmas cheer," I'll use shades of red and green to visualize the lyrics, rather than the boring white and black colors.

The Twelve Days of Christmas

If you make a list of Christmas songs that have repetition, chances are "The Twelve Days of Christmas" will be at the top of the list. The song is formulaic: each new verse adds a few new words before repeating the words from the previous verse. As a result, the repetition matrix is almost boring in its regularity. Here is the visualization of the classic song (click to enlarge):

Little Drummer Boy

Another highly repetitive Christmas song is "The Little Drummer Boy," which features an onomatopoeic phrase (Pa rum pum pum pum) that alternates with the other lyrics. A visualization of the classic song is shown below:

Silver Bells

In addition to repeating the title, "Silver Bells" repeats several phrases. Most notably, the phrase "Soon it will be Christmas Day" is repeated multiple times at the end of the song. Because only certain phrases are repeated, the visualization has a pleasing structure that complements the song's lyrical qualities:

Silent Night

To contrast the hustle, bustle, and commercialism of Christmas, I enjoy hearing songs that are musically simple. One of my favorites is "Silent Night." Each verse is distinct, yet each begins with "Silent night, holy night!" and ends by repeating a phrase. The resulting visualization is devoid of clutter. It is visually empty and matches the lyrical imagery, "all is calm, all is bright."

Your turn!

You can download the SAS program that creates these images. The program also computes visualizations of some contemporary songs such as "Last Christmas" by Wham!, "Someday at Christmas" (Stevie Wonder version), "Rockin' Around the Christmas Tree" (Brenda Lee version), and "Happy XMas (War Is Over)" by John Lennon and Yoko Ono. If you have access to SAS, you can even add your own favorite lyrics to the program! If you don't have access to SAS, Colin Morris's website enables you to paste in the lyrics and see the visualization.

In a little-known "deleted scene" from Elf, Buddy says that the second-best way to spread Christmas cheer is posting images for all to share! So post a comment and share your favorite visualization of a Christmas song!

Happy holidays to all my readers. I am grateful for you. Merry Christmas to all, and to all a good night!

The post Visualize Christmas songs appeared first on The DO Loop.

12月 052018
 

Recently a SAS programmer wanted to obtain a table of counts that was based on a histogram. I showed him how you can use the OUTHIST= option on the HISTOGRAM statement in PROC UNIVARIATE to obtain that information. For example, the following call to PROC UNIVARIATE creates a histogram for the MPG_City variable in the Sashelp.Cars data set. The histogram has 11 bins. The OUTHIST= option writes the counts for each bin to a SAS data set:

proc univariate data=Sashelp.Cars noprint;
   var MPG_City;
   histogram MPG_City / barlabel=count outhist=MidPtOut;
run;
 
proc print data=MidPtOut label;
   label _MIDPT_ = "Midpoint" _COUNT_="Frequency";
   var _MIDPT_ _COUNT_;
run;

Endpoints versus midpoints

As I've previously discussed, PROC UNIVARIATE supports two options for specifying the locations of bins. The MIDPOINTS option specifies that "nice" numbers (for example, multiples of 2, 5, or 10) are used for the midpoints of the bins; the ENDPOINTS option specifies that nice numbers are used for the endpoints of the bins; By default, midpoints are used, as shown in the previous section. The following call to PROC UNIVARIATE uses the ENDPOINTS option and writes the new bin counts to a data set. The histogram is not shown.

proc univariate data=Sashelp.Cars noprint;
   var MPG_City;
   histogram MPG_City / barlabel=count endpoints outhist=EndPtOut;
run;
 
proc print data=EndPtOut;
   label _MINPT_ = "Left Endpoint" _COUNT_="Frequency";
   var _MINPT_ _COUNT_;
run;

Tabulating counts in the SAS/IML language

If you want to "manually" count the number of observations in each bin, you have a few choices. If you already know the bin width and anchor position for the bins, then you can use a DATA step array to accumulate the counts. You can also use PROC FORMAT to define a format to bin the observations and use PROC FREQ to tabulate the counts.

The harder problem is when you do not have a prior set of "nice" values to use as the endpoints of bins. It is usually not satisfactory to use the minimum and maximum data values as endpoints of the binning intervals because that might result in intervals whose endpoints are long decimal values such as [3.4546667 4.0108333].

Fortunately, the SAS/IML language provides the GSCALE subroutine, which computes "nice" values from a vector of data and the number of bins. The GSCALE routine returns a three-element vector. The first element is the minimum value of the leftmost interval, the second element is the maximum value of the rightmost interval, and the third element is the bin width. For example, the following SAS/IML statements compute nice intervals for the data in the MPG_City variable:

proc iml;
use Sashelp.Cars;
   read all var "MPG_City" into X;
close;
 
/* GSCALE subroutine computes "nice" tick values: s[1]<=min(x); s[2]>=max(x) */
call gscale(s, x, 10);  /* ask for about 10 intervals */
print s[rowname={"Start" "Stop" "Increment"}];

The output from the GSCALE subroutine suggests that a good set of intervals to use for binning the data are [10, 15), [15, 20), ..., [55, 60]. These are the same endpoints that are generated by using the ENDPOINTS option in PROC UNIVARIATE. (Actually, the procedure uses half-open intervals for all bins, so it adds the extra interval [60, 65) to the histogram.)

I've previously shown how to use the BIN and TABULATE functions in SAS/IML to count the observations in a set of bins. The following statements use the values from the GSCALE routine to form evenly spaced cutpoints for the binning:

cutPoints = do(s[1], s[2], s[3]);    /* use "nice" cutpoints from GSCALE */
*cutPoints = do(s[1], s[2]+s[3], s[3]);  /* ALTERNATIVE: add additional cutpoint to match UNIVARIATE */
b = bin(x, cutPoints);               /* find bin for each obs */
call tabulate(bins, freq, b);        /* count how many obs in each bin */
binLabels = char(cutPoints[bins]);   /* use left endpoint as labels for bins */
print freq[colname = binLabels label="Count"];

Except for the last interval, the counts are the same as for the ENDPOINTS option in PROC UNIVARIATE. It is a matter of personal preference whether you want to treat the last interval as a closed interval or whether you want all intervals to be half open. If you want to exactly match PROC UNIVARIATE, you can modify the definition of the cutPoints variable, as indicated in the program comments.

Notice that the TABULATE routine only reports the bins that have nonzero counts. If you prefer to obtain counts for ALL bins—even bins with zero counts—you can use the TabulateLevels module, which I described in a previous blog post.

In summary, you can use PROC UNIVARIATE or SAS/IML to create a tabular representation of a histogram. Both procedures provide a way to obtain "nice" values for the bin endpoints. If you already know the endpoints for the bins, you can use other techniques in SAS to produce the table.

The post When is a histogram not a histogram? When it's a table! appeared first on The DO Loop.

12月 032018
 

When a graph includes several markers or line styles, it is often useful to create a legend that explains the relationship between the data and the symbols, color, and line styles in the graph. The SGPLOT procedure does a good job of automatically creating and placing a legend for most graphs. However, sometimes it is useful to override the procedure's default choices. This article describes five tips that you can use to customize the content and placement of legends. The tips are:

  1. Suppress the legend by using the NOAUTOLEGEND option.
  2. Choose which components of the graph appear in the legend by using a KEYLEGEND statement and the NAME= option.
  3. Position the legend by using the LOCATION= and POSITION= option on the KEYLEGEND statement.
  4. Exclude one or more items from a legend by using the EXCLUDE= option on the KEYLEGEND statement (requires SAS 9.4M3).
  5. Consolidate one or more items by using the LEGENDITEM statement (requires SAS 9.4M5).

1. Suppress the legend

By default, the SGPLOT procedure displays a legend when there are multiple plots that are overlaid in the graph. This can be caused by multiple statements or by using the GROUP= option on a statement. If the information in the default legend is redundant, and you might want to suppress it. For example, the following legend is unnecessary because the title explains the data and the regression line. You can uncomment the NOAUTOLEGEND option to suppress the legend.

title "Linear Regression for Weight and Height";
title2 "The legend is unnecessary";
proc sgplot data=Sashelp.Class /* NOAUTOLEGEND */;
   scatter x=Height y=Weight;
   reg x=Height y=Weight / nomarkers;
   footnote J=L "Use the NOAUTOLEGEND option to suppress the legend";
run;
footnote;
You can use the NOAUTOLEGEND option on the PROC SGPLOT statement in SAS to suppress a legend

2. Choose which components appear in the legend

In some graphs that overlay multiple components, some components are self -explanatory and do not need to appear in the legend. You can choose which components appear in the legend by using the NAME= option on the statements and using the KEYLEGEND statement to specify the contents of the legend. For example, the following statements create a graph that consists of a scatter plot, a confidence ellipse, and a regression line. If you only want the confidence ellipse and regression line to appear in the legend, use the NAME= option to identify each component and use the KEYLEGEND statement to specify the contents of the legend:

title "Weight versus Height";
title2 "Overlay Least Squares Fit and Confidence Ellipse";
proc sgplot data=Sashelp.Class;
   scatter x=Height y=Weight / name="scatter";
   ellipse x=Height y=Weight / name="ellipse";
   reg x=Height y=Weight     / name="reg" nomarkers lineattrs=GraphData2;
   keylegend "reg" "ellipse"; /* list item in the order you want them */
run;
In the SGPLOT procedure in SAS, you can use the NAME= option to name components and the KEYLEGEND statement to specify which components should appear in a legend

3. Position the legend

The KEYLEGEND statement supports the LOCATION= and POSTITION= options, which enable you to place the legend almost anywhere in the graph. The LOCATION= option controls whether the legend appears inside or outside of the graph area. The POSITION= option controls the placement of the legend on the graph (left, right, top, bottom,...). However, I can never remember which option controls which attribute! Therefore, I created a mnemonic, which I hope will help you remember, too:

  • The LOCATION= option contains the substring 'CAT'. A CAT likes to go INSIDE and OUTSIDE the house. Therefore, the valid keywords for the LOCATION= option are INSIDE and OUTSIDE.
  • The POSITION= option contains the substring 'SIT'. You can SIT on the LEFT or RIGHT side of a couch. (Also, "position" can be used as a verb to mean "place on a page.") Therefore, the valid keywords for the POSITION= option are BOTTOM, BOTTOMLEFT, BOTTOMRIGHT, LEFT, RIGHT, TOP, TOPLEFT, and TOPRIGHT. (Some other graphical elements support a CENTER position, but not the legend.)

The following graph is the same as in the previous example, except that the location of the legend is inside the graph area and the position of the legend is in the lower-right corner. When you move the legend to the left or right side of the graph, it is often useful to use the ACROSS=1 option to force the legend to list the items vertically. Similarly, if you position the legend at the top or bottom of a graph, you might want to use the DOWN=1 option to list the items horizontally.

   keylegend "reg" "ellipse" / location=inside position=bottomright across=1;
In the SGPLOT procedure in SAS, you can use the LOCATION= and POSITION= options to position a legend

4. Exclude items from a legend

When you use the GROUP= option to display groups, you might want to exclude some of the group categories from the legend. The KEYLEGEND statement supports the EXCLUDE= option that you can use to exclude certain items. Three situations come to mind:

  • The group levels contain missing values. You might want to exclude the missing values from the legend by using KEYLEGEND / EXCLUDE=(" ");.
  • The purpose of the graph is to focus on one or two subgroups. If so, it might make sense to label only those groups. For example, if the purpose of a graph is to show income disparity between blacks and whites, you might decide not to include Asians or Hispanics in the legend: EXCLUDE=("Asian" "Hispanic");.
  • The group is binary. If a graph shows the results of a clinical trial and the legend includes the marker shape for the patients who died, it should be clear that the other marker shape represents patients who survived: EXCLUDE=("Alive");. An example is shown below
ods graphics / attrpriority=none;
title "Patient Status";
proc sgplot data=Sashelp.Heart(obs=200 where=(Systolic<=200));
   styleattrs datasymbols=(X CircleFilled);
   scatter x=Systolic y=Diastolic / group=Status;
   keylegend / exclude=("Alive");
run;
In the SGPLOT procedure in SAS, you can use the EXCLUDE= option on the KEYLEGEND statement to suppress a legend item, such as a category in a group

5. Customize items in a legend

The previous section shows how to exclude one or more levels in a categorical variable that is specified on the GROUP= option. You also might want to customize the items that appear in the legend in order to combine, for example, marker and line attributes. A situation where this comes up is when you want to overlay a group of curves on a scatter plot.

The LEGENDITEM statement (supported in SAS 9.4M5) enables you to specify what combination of markers and line patterns you want to appear for every item in a legend. It is a "super customization" statement that gives you complete control over the legend items.

The following statements show how to use the LEGENDITEM statement to create a customized legend. By default, if you use the REG statement with the GROUP= option, the legend will show only the colors and line patterns for the regression lines. In the following example, I have used the ATTRPRIORITY=NONE option to force the marker symbols to differ between groups. I want the legend to show not only the colors and patterns of the regression lines but also the marker symbols for each group:

/* ensure order of BP_Status is High, Normal, Optimal */
proc sort data=Sashelp.Heart(obs=200 where=(Systolic<=200)) out=Heart;
   by BP_Status;
run;
 
ods graphics / attrpriority=none;
title "Patients by Blood Pressure Status";
proc sgplot data=Heart;
   styleattrs datalinepatterns=(solid) ;
   reg x=Systolic y=Diastolic / group=BP_Status;   
   legenditem type=markerline name="H" /
      label="High" lineattrs=GraphData1 markerattrs=GraphData1; 
   legenditem type=markerline name="N" /
      label="Normal" lineattrs=GraphData2 markerattrs=GraphData2; 
   legenditem type=markerline name="O" /
      label="Optimal" lineattrs=GraphData3 markerattrs=GraphData3; 
   keylegend "O" "N" "H" / title="BP Status"; 
run;
In the SGPLOT procedure in SAS, you can use the LEGENDITEM statement to create customized items in a legend

In summary, PROC SGPLOT in SAS supports several ways to create, suppress, position, and customize the items in a legend. Do you have a favorite way to customize a legend in PROC SGPLOT? Leave a comment!

The post 5 tips for customizing legends in PROC SGPLOT in SAS appeared first on The DO Loop.

11月 282018
 

I remember the first time I used PROC GLM in SAS to include a classification effect in a regression model. I thought I had done something wrong because the parameter estimates table was followed by a scary-looking note:

Note: The X'X matrix has been found to be singular, and a generalized inverse 
      was used to solve the normal equations. Terms whose estimates are 
      followed by the letter 'B' are not uniquely estimable. 

Singular matrix? Generalized inverse? Estimates not unique? "What's going on?" I thought.

In spite of the ominous note, nothing is wrong. The note merely tells you that the GLM procedure has computed one particular estimate; other valid estimates also exist. This article explains what the note means in terms of the matrix computations that are used to estimate the parameters in a linear regression model.

The GLM parameterization is a singular parameterization

The note is caused by the fact that the GLM model includes a classification variable. Recall that a classification variable in a regression model is a categorical variable whose levels are used as explanatory variables. Examples of classification variables (called CLASS variables in SAS) are gender, race, and treatment. The levels of the categorical variable are encoded in the columns of a design matrix. The columns are often called dummy variables. The design matrix is used to form the "normal equations" for least squares regression. In terms of matrices, the normal equations are written as (X`*X)*b = X`*Y, where X is a design matrix, Y is the vector of observed responses, and b is the vector of parameter estimates, which must be computed.

There are many ways to construct a design matrix from classification variables. If X is a design matrix that has linearly dependent columns, the crossproduct matrix X`X is singular. Some ways of creating the design matrix always result in linearly dependent columns; these constructions are said to use a singular parameterization.

The simplest and most common parameterization encodes each level of a categorical variable by using a binary indicator column. This is known as the GLM parameterization. It is a singular parameterization because if X1, X2, ..., Xk are the binary columns that indicate the k levels, then Σ Xi = 1 for each observation.

Not surprisingly, the GLM procedure in SAS uses the GLM parameterization. Here is an example that generates the "scary" note. The data are a subset of the Sashelp.Heart data set. The levels of the BP_Status variable are "Optimal", "Normal", and "High":

data Patients;
   set Sashelp.Heart;
   keep BP_Status Cholesterol;
   if ^cmiss(BP_Status, Cholesterol); /* discard any missing values */
run;
 
proc glm data=Patients plots=none;
   class BP_Status;
   model Cholesterol =  BP_Status / solution;
quit;

If you change the reference levels, you get a different estimate

If you have linearly dependent columns among the explanatory variables, the parameter estimates are not unique. The easiest way to see this is to change the reference level for a classification variable. In PROC GLM, you can use the REF=FIRST or REF=LAST option on the CLASS statement to change the reference level. However, the following example uses PROC GLMSELECT (without variable selection) because you can simultaneously use the OUTDESIGN= option to write the design matrix to a SAS data set. The first call writes the design matrix that PROC GLM uses (internally) for the default reference levels. The second call writes the design matrix for an alternate reference level:

/* GLMSELECT can fit the data and output a design matrix in one step */
title "Estimates for GLM Paremeterization";
title2 "Default (Last) Reference Levels";
ods select ParameterEstimates(persist);
proc glmselect data=Patients outdesign(fullmodel)=GLMDesign1;
   class BP_Status;
   model Cholesterol =  BP_Status / selection=none;  
quit;
 
/* Change reference levels. Different reference levels result in different estimates. */ 
title2 "Custom Reference Levels";
proc glmselect data=Patients outdesign(fullmodel)=GLMDesign2;
   class BP_Status(ref='Normal');
   model Cholesterol =  BP_Status / selection=none;  
quit;
ods select all;
 
/* compare a few rows of the design matrices */
proc print data=GLMDesign1(obs=10 drop=Cholesterol); run;
proc print data=GLMDesign2(obs=10 drop=Cholesterol); run;

The output shows that changing the reference level results in different parameter estimates. (However, the predicted values are identical for the two estimates.) If you use PROC PRINT to display the first few observations in each design matrix, you can see that the matrices are the same except for the order of two columns. Thus, if you have linearly dependent columns, the GLM estimates might depend on the order of the columns.

The SWEEP operator produces a generalized inverse that is not unique

You might wonder why the parameter estimates change when you change reference levels (or, equivalently, change the order of the columns in the design matrix). The mathematical answer is that there is a whole family of solutions that satisfy the (singular) regression equations, and from among the infinitely many solutions, the GLM procedure chooses the solution for which the estimate of the reference level is exactly zero.

Last week I discussed generalized inverses, including the SWEEP operator and the Moore-Penrose inverse. The SWEEP operator is used by PROC GLM to obtain the parameter estimates. The SWEEP operator produces a generalized inverse that is not unique. In particular, the SWEEP operator computes a generalized inverse that depends on the order of the columns in the design matrix.

The following SAS/IML statements read in the design matrices for each GLM parameterization and use the SWEEP function to reproduce the parameter estimates that are computed by the GLM procedure. For each design matrix, the program computes solutions to the normal equations (X`*X)*b = (X`*Y). The program also computes the Moore-Penrose solution for each design matrix.

proc iml;
/* read design matrix and form X`X and X`*Y */
use GLMDesign1; read all var _NUM_ into Z[c=varNames]; close;
p = ncol(Z);
X = Z[, 1:(p-1)];  Y = Z[, p];  vNames = varNames[,1:(p-1)];
A = X`*X;  c = X`*Y;
 
/* obtain G2 and Moore-Penrose solution for this design matrix */
Sweep1 = sweep(A)*c;
GInv1  = ginv(A)*c;
print Sweep1[r=vNames], GInv1;
 
/* read other design matrix and form X`X and X`*Y */
use GLMDesign2; read all var _NUM_ into Z[c=varNames]; close;
p = ncol(Z);
X = Z[, 1:(p-1)];  Y = Z[, p]; vNames = varNames[,1:(p-1)];
A = X`*X;  c = X`*Y;
 
/* obtain G2 and Moore-Penrose solution for this design matrix */
Sweep2 = sweep(A)*c;
GInv2 = ginv(A)*c;
print Sweep2[r=vNames], GInv2;

The results demonstrate that the SWEEP solution depends on the order of columns in a linearly dependent design matrix. However, the Moore-Penrose solution does not depend on the order. The Moore-Penrose solution is the same no matter which reference levels you choose for the GLM parameterization of classification effects.

In summary, the scary note that PROC GLM produces reminds you of the following mathematical facts:

  • When you include classification effects in a linear regression model and use the GLM parameterization to construct the design matrix, the design matrix has linearly dependent columns.
  • The X`X matrix is singular when X has linearly dependent columns. Consequently, the parameter estimates for least squares regression are not unique.
  • From among the infinitely many solutions to the normal equations, the solution that PROC GLM (and other SAS procedures) computes is based on a generalized inverse that is computed by using the SWEEP operator.
  • The solution obtained by the SWEEP operator depends on the reference levels for the CLASS variables.

If the last fact bothers you (it shouldn't), an alternative estimate is available by using the GINV function to compute the Moore-Penrose inverse. The corresponding estimate is unique and does not depend on the reference level.

The post Singular parameterizations, generalized inverses, and regression estimates appeared first on The DO Loop.

11月 262018
 
Funnel plot of the proportion of unimmunized students in NC kindergarten classes

Last week my colleague, Robert Allison, visualized data regarding immunization rates for kindergarten classes in North Carolina. One of his graphs was a scatter plot that displayed the proportion of unimmunized students versus the size of the class for 1,885 kindergarten classes in NC. This scatter plot is the basis for a statistical plot that is known as a funnel plot for proportions. The plot to the right shows a funnel plot that I created based on Robert's analysis.

The basic idea of a funnel plot is that small random samples have more variation than large random samples from the same population. If students are randomly chosen from a population in which some small proportion of children have an attribute, it might not be unusual if 40% of the students in a five-student class have the attribute (that's 2 out of 5) whereas it might be highly unusual to see such a high proportion in a 100-student class. The funnel plot enhances a scatter plot by adding curves that indicate a reasonable range of values for the proportion, given the size of a random sample.

For a discussion of funnel plots and how to create a funnel plot in SAS, see the article "Funnel plots for proportions." You can download the immunization data and the SAS program that I used to create the funnel plot for proportions.

A funnel plot for proportions

The preceding funnel plot contains the following statistical features:
  • A scatter plot of the data. Each marker represents a school.
  • A horizontal reference line. The line indicates the average proportion for the data. For these data, the statewide average proportion of unimmunized kindergarteners is 4.58%.
  • A curve that indicates an upper confidence limit for the proportion of unimmunized students, assuming that the classes are a random sample from a population in which 4.58% of the students are unimmunized. When a marker (school) appears above the curve, the proportion of unimmunized students in that school is significantly higher than the statewide proportion.

This funnel plot uses the shape (and color) of a marker to indicate whether the school is public (circle), private (upward-pointing triangle), or a charter school (right-pointing triangle). The plot includes tool tips so that you can hover the mouse over a marker and see the name and county of the school.

The graph indicates that there are dozens of schools for which the proportion of unimmunized students far exceeds the state average. A graph like this enables school administrators and public health officials to identify schools that have a larger-than-expected proportion of unimmunized students. Identifying schools is the first step to developing initiatives that can improve the immunization rate in school-age children.

Funnel plots for each school district

You can use a WHERE statement or BY-group processing in PROC SGPLOT to create a funnel plot for each county. A graph that shows only the schools in a particular district is more relevant for local school boards and administrators. The following graphs show the proportion of unimmunized students in Mecklenburg County (near Charlotte) and Wake County (near Raleigh), which are the two largest school districts in NC.

Proportion of unimmunized students in Mecklenburg County kindergarten classes
Proportion of unimmunized students in Wake County kindergarten classes

The first graph shows that Mecklenburg County has several schools that contain more than 60 kindergarten students and for which 25% or more of the students are unimmunized. In fact, some large schools have more than 40% unimmunized! In Wake County, fewer schools have extreme proportions, but there are still many schools for which the proportion of unimmunized students is quite large relative to the statewide average.

As Robert pointed out in his blog post, these are not official figures, so it is possible that some of the extreme proportions are caused by data entry errors rather than by hordes of unimmunized students.

Funnel plots for public, private, and charter schools

The following graph shows a panel of plots for public, private, and charter schools. There are many public schools whose proportion of unimmunized students is well above the statewide average. For the private and charter schools, about 10 schools stand out in each group.

I think the plot of private schools is particularly interesting. When the media reports on outbreaks of childhood diseases in schools, there is often a mention of a "religious exemption," which is when a parent or guardian states that a child will not be immunized because of their religious beliefs. The report often mentions that private schools are often affiliated with a particular religion or church. I've therefore assumed that private schools have a larger proportion of unimmunized students because of the religious exemption. These data do not indicate which students are unimmunized because of a religious exemption, but the panel of funnel plots indicates that, overall, not many private schools have an abnormally large proportion of unimmunized students. In fact, the private schools show smaller deviations from the expected value than the public and charter schools.

Summary

In summary, I started with one of Robert Allison's graphs and augmented it to create a funnel plot for proportions. A funnel plot shows the average proportion and confidence limits for proportions (given the sample size). If the students in the schools were randomly sampled from a population where 4.58% of students are unimmunized, then few schools would be outside of the confidence curve. Of course, in reality, schools are not random samples. Many features of school performance—including unimmunized students—depend on local socioeconomic conditions. By taking into account the size of the classes, the funnel plot identifies schools that have an exceptionally large proportion of unimmunized students. A funnel plot can help school administrators and public health officials identify schools that might benefit from intervention programs and additional resources, or for which the data were incorrectly entered.

The post A funnel plot for immunization rates appeared first on The DO Loop.

11月 212018
 

A data analyst asked how to compute parameter estimates in a linear regression model when the underlying data matrix is rank deficient. This situation can occur if one of the variables in the regression is a linear combination of other variables. It also occurs when you use the GLM parameterization of a classification variable. In the GLM parameterization, the columns of the design matrix are linearly dependent. As a result, the matrix of crossproducts (the X`X matrix) is singular. In either case, you can understand the computation of the parameter estimates learning about generalized inverses in linear systems. This article presents an overview of generalized inverses. A subsequent article will specifically apply generalized inverses to the problem of estimating parameters for regression problems with collinearities.

What is a generalized inverse?

Recall that the inverse matrix of a square matrix A is a matrix G such as A*G = G*A = I, where I is the identity matrix. When such a matrix exists, it is unique and A is said to be nonsingular (or invertible). If there are linear dependencies in the columns of A, then an inverse does not exist. However, you can define a series of weaker conditions that are known as the Penrose conditions:

  1. A*G*A = A
  2. G*A*G = G
  3. (A*G)` = A*G
  4. (G*A)` = G*A

Any matrix, G, that satisfies the first condition is called a generalized inverse (or sometimes a "G1" inverse) for A. A matrix that satisfies the first and second condition is called a "G2" inverse for A. The G2 inverse is used in statistics to compute parameter estimates for regression problems (see Goodnight (1979), p. 155). A matrix that satisfies all four conditions is called the Moore-Penrose inverse or the pseudoinverse. When A is square but singular, there are infinitely many matrices that satisfy the first two conditions, but the Moore-Penrose inverse is unique.

Computations with generalized inverses

In regression problems, the parameter estimates are obtained by solving the "normal equations." The normal equations are the linear system (X`*X)*b = (X`*Y), where X is the design matrix, Y is the vector of observed responses, and b is the parameter estimates to be solved. The matrix A = X`*X is symmetric. If the columns of the design matrix are linearly dependent, then A is singular. The following SAS/IML program defines a symmetric singular matrix A and a right-hand-side vector c, which you can think of as X`*Y in the regression context. The call to the DET function computes the determinant of the matrix. A zero determinant indicates that A is singular and that there are infinitely many vectors b that solve the linear system:

proc iml;
A = {100  50 20 10,
      50 106 46 23,
      20  46 56 28,
      10  23 28 14 };
c = {130, 776, 486, 243};
 
det = det(A);         /* demonstrate that A is singular */
print det;

For nonsingular matrices, you can use either the INV or the SOLVE functions in SAS/IML to solve for the unique solution of the linear system. However, both functions give errors when called with a singular matrix. SAS/IML provides several ways to compute a generalized inverse, including the SWEEP function and the GINV function. The SWEEP function is an efficient way to use Gaussian elimination to solve the symmetric linear systems that arise in regression. The GINV function is a function that computes the Moore-Penrose pseudoinverse. The following SAS/IML statements compute two different solutions for the singular system A*b = c:

b1 = ginv(A)*c;       /* solution even if A is not full rank */
b2 = sweep(A)*c;
print b1 b2;

The SAS/IML language also provides a way to obtain any of the other infinitely many solutions to the singular system A*b = c. Because A is a rank-1 matrix, it has a one-dimensional kernel (null space). The HOMOGEN function in SAS/IML computes a basis for the null space. That is, it computes a vector that is mapped to the zero vector by A. The following statements compute the unit basis vector for the kernel. The output shows that the vector is mapped to the zero vector:

xNull = homogen(A);   /* basis for nullspace of A */
print xNull (A*xNull)[L="A*xNull"];

All solutions to A*b = c are of the form b + α*xNull, where b is any particular solution.

Properties of the Moore-Penrose solution

You can verify that the Moore-Penrose matrix GINV(A) satisfies the four Penrose conditions, whereas the G2 inverse (SWEEP(A)) only satisfies the first two conditions. I mentioned that the singular system has infinitely many solutions, but the Moore-Penrose solution (b1) is unique. It turns out that the Moore-Penrose solution is the solution that has the smallest Euclidean norm. Here is a computation of the norm for three solutions to the system A*b = c:

/* GINV gives the estimate that has the smallest L2 norm: */
GINVnorm  = norm(b1);
sweepNorm = norm(b2);
/* you can add alpha*xNull to any solution to get another solution */
b3 = b1 + 2*xNull;  /* here's another solution (alpha=2) */
otherNorm = norm(b3);
print ginvNorm sweepNorm otherNorm;

Because all solutions are of the form b1 + α*xNull, where xNull is the basis for the nullspace of A, you can graph the norm of the solutions as a function of α. The graph is shown below and indicates that the Moore-Penrose solution is the minimal-norm solution:

alpha = do(-2.5, 2.5, 0.05);
norm = j(1, ncol(alpha), .);
do i = 1 to ncol(alpha);
   norm[i] = norm(b1 + alpha[i] * xNull);
end;
title "Euclidean Norm of Solutions b + alpha*xNull";
title2 "b = Solution from Moore-Penrose Inverse";
title3 "xNull = Basis for Nullspace of A";
call series(alpha, norm) 
     other="refline 0 / axis=x label='b=GINV';refline 1.78885 / axis=x label='SWEEP';";
Graph of norm of solutions to the singular system A*b=c. The norm is plotted for vectors b + alpha*x_Null where b is the Moore-Penrose solution and x_Null is a basis for the nullspace of A.

In summary, a singular linear system has infinitely many solutions. You can obtain a particular solution by using the sweep operator or by finding the Moore-Penrose solution. You can use the HOMOGEN function to obtain the full family of solutions. The Moore-Penrose solution is expensive to compute but has an interesting property: it is the solution that has the smallest Euclidean norm. The sweep solution is more efficient to compute and is used in SAS regression procedures such as PROC GLM to estimate parameters in models that include classification variables and use a GLM parameterization. The next blog post explores regression estimates in more detail.

The post Generalized inverses for matrices appeared first on The DO Loop.

11月 192018
 

You might know that you can use the ODS SELECT statement to display only some of the tables and graphs that are created by a SAS procedure. But did you know that you can use a WHERE clause on the ODS SELECT statement to display tables that match a pattern? This article shows how to use wildcards, regular expressions, and pattern matching to select ODS tables in SAS.

ODS SELECT: Filter the output from SAS procedures

A SAS procedure might produce a dozen or more tables. You might be interested in displaying a subset of those tables. Recall that you can use the ODS TRACE ON statement to obtain a list of all the tables and graphs that a procedure creates. You can then use the ODS SELECT or the ODS EXCLUDE statement to control which tables and graphs are displayed.

Here's an example from the SAS/STAT documentation. The following PROC LOGISTIC call creates 27 tables and graphs, most of which are related to ROC curves. The ODS TRACE ON statement displays the names of each output object in the SAS log:

data roc;
   input alb tp totscore popind @@;
   totscore = 10 - totscore;
   datalines;
3.0 5.8 10 0   3.2 6.3  5 1   3.9 6.8  3 1   2.8 4.8  6 0
3.2 5.8  3 1   0.9 4.0  5 0   2.5 5.7  8 0   1.6 5.6  5 1
3.8 5.7  5 1   3.7 6.7  6 1   3.2 5.4  4 1   3.8 6.6  6 1
4.1 6.6  5 1   3.6 5.7  5 1   4.3 7.0  4 1   3.6 6.7  4 0
2.3 4.4  6 1   4.2 7.6  4 0   4.0 6.6  6 0   3.5 5.8  6 1
3.8 6.8  7 1   3.0 4.7  8 0   4.5 7.4  5 1   3.7 7.4  5 1
3.1 6.6  6 1   4.1 8.2  6 1   4.3 7.0  5 1   4.3 6.5  4 1
3.2 5.1  5 1   2.6 4.7  6 1   3.3 6.8  6 0   1.7 4.0  7 0
3.7 6.1  5 1   3.3 6.3  7 1   4.2 7.7  6 1   3.5 6.2  5 1
2.9 5.7  9 0   2.1 4.8  7 1   2.8 6.2  8 0   4.0 7.0  7 1
3.3 5.7  6 1   3.7 6.9  5 1   3.6 6.6  5 1
;
 
ods graphics on;
ods trace on;
proc logistic data=roc;
   model popind(event='0') = alb tp totscore / nofit;
   roc 'Albumin' alb;
   roc 'K-G Score' totscore;
   roc 'Total Protein' tp;
   roccontrast reference('K-G Score') / estimate e;
run;

The SAS log displays the names of the tables and graphs. A portion of log is shown below:

Output Added:
-------------
Name:       OddsRatios
Label:      Odds Ratios
Template:   Stat.Logistic.OddsRatios
Path:       Logistic.ROC3.OddsRatios
-------------

Output Added:
-------------
Name:       ROCCurve
Label:      ROC Curve
Template:   Stat.Logistic.Graphics.ROC
Path:       Logistic.ROC3.ROCCurve
-------------

Output Added:
-------------
Name:       ROCOverlay
Label:      ROC Curves
Template:   Stat.Logistic.Graphics.ROCOverlay
Path:       Logistic.ROCComparisons.ROCOverlay
-------------

Output Added:
-------------
Name:       ROCAssociation
Label:      ROC Association Statistics
Template:   Stat.Logistic.ROCAssociation
Path:       Logistic.ROCComparisons.ROCAssociation
-------------

Output Added:
-------------
Name:       ROCContrastCoeff
Label:      ROC Contrast Coefficients
Template:   Stat.Logistic.ROCContrastCoeff
Path:       Logistic.ROCComparisons.ROCContrastCoeff
-------------

Only a few of the 27 ODS objects are shown here. Notice that each ODS object has four properties: a name, a label, a template, and a path. Most of the time, the name is used on the ODS SELECT statement to filter the output. For example, if you want to display only the ROC curves and the overlay of the ROC curves, you can put the following statement prior to the RUN statement in the procedure:

   ods select ROCCurve ROCOverlay;     /* specify the names literally */

Use a WHERE clause in the ODS SELECT statement

Often the ODS objects that you want to display are related to each other. In the LOGISTIC example, you might want to display all the information about ROC curves. Fortunately, the SAS developers often use a common prefix or suffix, such as 'ROC', in the names of the ODS objects. That means that you can display all ROC-related tables and graphs be selecting the ODS objects whose name (or path) contains 'ROC' as a substring.

You can use the WHERE clause to select ODS objects whose name (or label or path) matches a particular pattern. The object's name is available in a special variable named _NAME_. Similarly, the object's label and path are available in variables named _LABEL_ and _PATH_, respectively. You cannot match patterns in the template string; there is no _TEMPLATE_ variable.

In SAS, the following operators and functions are useful for matching strings:

  • The CONTAINS keyword matches strings that contains a specified substring. The question mark (?) is an equivalent way to specify the CONTAINS operator.
  • The LIKE keyword matches strings to a pattern. The underscore (_) is a wildcard that matches any character. The percent sign (%) is a wildcard that matches one or more characters.
  • The "begins with" operators (=: and in:) match strings that begin with a certain pattern or set of patterns, respectively.
  • SAS functions such as FIND, INDEX, and SUBSTR can be used to match patterns.
  • The SAS PRXMATCH function, which enables you to use Perl regular expressions to match patterns.

For example, the following statements select ODS tables and graphs from the previous PROC LOGISTIC call. You can put one of these statements before the RUN statement in the procedure:

   /* use any one of the following statements inside the PROC LOGISTIC call */
   ods select where=(_name_ =: 'ROC');                 /* name starts with 'ROC' */
   ods select where=(_name_ like 'ROC%');              /* name starts with 'ROC' */
   ods select where=(_path_ ? 'ROC');                  /* path contains 'ROC' */
   ods select where=(_label_ ? 'ROC');                 /* label contains 'ROC' */
   ods select where=(_name_ in: ('Odds', 'ROC'));      /* name starts with 'Odds' or 'ROC' */
   ods select where=(substr(_name_,4,8)='Contrast');   /* name has subtring 'Contrast' at position 4 */

For additional examples of using pattern matching to select ODS objects, see Warren Kuhfeld's graphics-focused blog post and the section of the SAS/STAT User's Guide that discusses selecting ODS graphics.

Use PRXMATCH to match regular expressions

Although the CONTAIN and LIKE operators are often sufficient for selecting a table, SAS provides the powerful PRXMATCH function for more complex pattern-matching tasks. The PRXMATCH function uses Perl regular expressions to match strings. SAS provides a Perl Regular Expression "cheat sheet" that summarizes the syntax and commons search queries for the PRXMATCH function.

You can put any of the following statements inside the PROC LOGISTIC call:

   /* use any one of the following PRXMATCH expressions inside the PROC LOGISTIC call */
   ods select where=(prxmatch('/ROC/',       _name_)); /* name contains 'ROC' anywhere */
   ods select where=(prxmatch('/^ROC/',      _name_)); /* name starts with 'ROC' */
   ods select where=(prxmatch('/Odds|^ROC/', _name_)); /* name contains 'Odds' anywhere or 'ROC' at the beginning */
   ods select where=(prxmatch('/ROC/',      _name_)=0);     /* name does NOT contain 'ROC' anywhere */
   ods select where=(prxmatch('/Logistic\.ROC2/', _path_)); /* escape special wildcard character '.' */

In summary, the WHERE= option on the ODS SELECT (and ODS EXCLUDE) statement is quite powerful. Many SAS programmers know how to list the names of tables and graphs on the ODS SELECT statement to display only a subset of the output. However, the WHERE= option enables you to use wildcards and regular expressions to select objects whose names or paths match a certain pattern. This can be a quick and efficient way to select tables that are related to each other and share a common prefix or suffix in their name.

The post Select ODS tables by using wildcards and regular expressions in SAS appeared first on The DO Loop.

11月 142018
 

An ROC curve graphically summarizes the tradeoff between true positives and true negatives for a rule or model that predicts a binary response variable. An ROC curve is a parametric curve that is constructed by varying the cutpoint value at which estimated probabilities are considered to predict the binary event. Most SAS data analysts know that you can fit a logistic model in PROC LOGISTIC and create an ROC curve for that model, but did you know that PROC LOGISTIC enables you to create and compare ROC curves for ANY vector of predicted probabilities regardless of where the predictions came from? This article shows how!

If you want to review the basic constructions of an ROC curve, you can see a previous article that constructs an empirical ROC curve from first principles. The PROC LOGISTIC documentation provides formulas used for constructing an ROC curve.

Produce an ROC plot by using PROC LOGISTIC

Before discussing how to create an ROC plot from an arbitrary vector of predicted probabilities, let's review how to create an ROC curve from a model that is fit by using PROC LOGISTIC. The following data and model are taken from the the PROC LOGISTIC documentation. The data are for 43 cancer patients who also had an intestinal obstruction. The response variable popInd is a postoperative indicator variable: popInd = 1 for patients who died within two months after surgery. The explanatory variables are three pre-operative screening tests. The goal of the study is to determine patients who might benefit from surgery, where "benefit" is measured by postoperative survival of at least two months.

data roc;
   input alb tp totscore popind @@;
   totscore = 10 - totscore;
   datalines;
3.0 5.8 10 0   3.2 6.3  5 1   3.9 6.8  3 1   2.8 4.8  6 0
3.2 5.8  3 1   0.9 4.0  5 0   2.5 5.7  8 0   1.6 5.6  5 1
3.8 5.7  5 1   3.7 6.7  6 1   3.2 5.4  4 1   3.8 6.6  6 1
4.1 6.6  5 1   3.6 5.7  5 1   4.3 7.0  4 1   3.6 6.7  4 0
2.3 4.4  6 1   4.2 7.6  4 0   4.0 6.6  6 0   3.5 5.8  6 1
3.8 6.8  7 1   3.0 4.7  8 0   4.5 7.4  5 1   3.7 7.4  5 1
3.1 6.6  6 1   4.1 8.2  6 1   4.3 7.0  5 1   4.3 6.5  4 1
3.2 5.1  5 1   2.6 4.7  6 1   3.3 6.8  6 0   1.7 4.0  7 0
3.7 6.1  5 1   3.3 6.3  7 1   4.2 7.7  6 1   3.5 6.2  5 1
2.9 5.7  9 0   2.1 4.8  7 1   2.8 6.2  8 0   4.0 7.0  7 1
3.3 5.7  6 1   3.7 6.9  5 1   3.6 6.6  5 1
;
 
ods graphics on;
proc logistic data=roc plots(only)=roc;
   LogisticModel: model popind(event='0') = alb tp totscore;
   output out=LogiOut predicted=LogiPred;       /* output predicted value, to be used later */
run;
ROC curve for linear logistic model fitted in PROC LOGISTIC in SAS

You can see the documentation for details about how to interpret the output from PROC LOGISTIC, but the example shows that you can use the PLOTS=ROC option (or the ROC statement) to create an ROC curve for a model that is fit by PROC LOGISTIC. For this model, the area under the ROC curve is 0.77. Because a random "coin flip" prediction has an expected area of 0.5, this model predicts the survival of surgery patients better than random chance.

Create an ROC curve for any prediction rule

A logistic model is not the only way to predict a binary response. You could also use a decision tree, a generalized mixed model, a nonparametric regression model, or even ask a human expert for her opinion. An ROC curve only requires two quantities: for each observation, you need the observed binary response and a predicted probability. In fact, if you carefully read the PROC LOGISTIC documentation, you will find these sentences:

  • In the "Details" section: "ROC curves can be created ... from the specified model in the MODEL statement, from specified models in ROC statements, or from input variables which act as [predicted probabilities]." (Emphasis added.)
  • In the documentation of the ROC statement: "The PRED= option enables you to input a criterion produced outside PROC LOGISTIC; for example, you can fit a random-intercept model by using PROC GLIMMIX or use survey weights in PROC SURVEYLOGISTIC, then use the predicted values from those models to produce an ROC curve for the comparisons."

In other words, you can use PROC LOGISTIC to create an ROC curve regardless of how the predicted probabilities are obtained! For argument's sake, let's suppose that you ask a human expert to predict the probability of each patient surviving for at least two months after surgery. (Notice that there is no statistical model here, only a probability for each patient.) The following SAS DATA step defines the predicted probabilities, which are then merged with the output from the earlier PROC LOGISTIC call:

data ExpertPred;
   input ExpertPred @@;
   datalines;
0.95 0.2  0.05 0.3  0.1  0.6  0.8  0.5 
0.1  0.25 0.1  0.2  0.05 0.1  0.05 0.1 
0.4  0.1  0.2  0.25 0.4  0.7  0.1  0.1 
0.3  0.2  0.1  0.05 0.1  0.4  0.4  0.7
0.2  0.4  0.1  0.1  0.9  0.7  0.8  0.25
0.3  0.1  0.1 
;
data Survival;
   merge LogiOut ExpertPred;
run;
 
/* create ROC curve from a variable that contains predicted values */
proc logistic data=Survival;
   model popind(event='0') = ExpertPred / nofit;
   roc 'Expert Predictions' pred=ExpertPred;
   ods select ROCcurve;
run;
ROC curve from external predictions, create with PROC LOGISTIC in SAS

Notice that you only need to supply two variables on the MODEL statements: the observed responses and the variable that contains the predicted values. On the ROC statement, I've used the PRED= option to indicate that the ExpertPred variable is not being fitted by the procedure. Although PROC LOGISTIC creates many tables, I've used the ODS SELECT statement to suppress all output except for the ROC curve.

Overlay and compare ROC curves from different models or rules

You might want to overlay and compare ROC curves from multiple predictive models (either from PROC LOGISTIC or from other sources). PROC LOGISTIC can do that as well. You just need to merge the various predicted probabilities into a single SAS data set and then specify multiple ROC statements, as follows:

/* overlay two or more ROC curves by using variables of predicted values */
proc logistic data=Survival;
   model popind(event='0') = LogiPred ExpertPred / nofit;
   roc 'Logistic' pred=LogiPred;
   roc 'Expert'   pred=ExpertPred;
   ods select ROCOverlay;
   /* optional: for a statistical comparison, use ROCCONTRAST stmt and remove the ODS SELECT stmt */
   *roccontrast reference('Expert Model') / estimate e;
run;
Compare ROC curves by using PROC LOGISTIC in SAS to overlay the ROC curves

This ROC overlay shows that the "expert" prediction is almost always superior or equivalent to the logistic model in terms of true and false classification rates. As noted in the comments of the previous call to PROC LOGISTIC, you can use the ROCCONTRAST statement to obtain a statistical analysis of the difference between the areas under the curves (AUC).

In summary, you can use the ROC statement in PROC LOGISTIC to generate ROC curves for models that were computed outside of PROC LOGISTIC. All you need are the predicted probabilities and observed response for each observation. You can also overlay and compare two or more ROC curves and use the ROCCONTRAST statement to analyze the difference between areas under the curves.

The post Create and compare ROC curves for any predictive model appeared first on The DO Loop.