2月 272019
 

In this post, we continue our discussion of geography variables, the foundation of Visual Analytics Geo maps. This time we will look at Custom Coordinates.  As with any statistical graph, understanding your data is key.  But when using Custom Coordinates for geographic maps, this understanding becomes even more important.

Use the Custom Coordinate geography variable when your data does not match one of VA’s predefined geography types (see previous post, Fundamentals of SAS Visual Analytics geo maps).  For Custom coordinates, your data set must include latitude and longitude values as separate variables.   These values should be sourced from trustworthy providers and validated for accuracy prior to loading into VA.

When using Custom Coordinates, the Coordinate Space must also be considered.  The coordinate space defines the grid used to plot your data.  The underlying map is also based on a grid.  In order for your data to display correctly on a map, these grids must match.  Visual Analytics uses the World Geodetic System (WGS84) as the default coordinate space (grid).  This will work for most scenarios, including the example below.

Once you have selected a dataset and confirmed it contains the required spatial information, you can now create a Custom Geography variable.  In this example, I am using the variable Business Address from the dataset Wake_Co_Pizza.  Let’s get started.

  1. Begin by opening VA and navigate to the Data panel on the left of the application.
  2. Select the dataset and locate the variable that you wish to map. Click the down arrow to the right of the variable and chose ‘Geography’ from the Classification dropdown menu.
  3. The ‘Edit Geography Item’ window appears. Select Custom coordinates in the ‘Geography data type’ dropdown.   Three new dropdown lists appear that are specific to the Custom coordinates data type: ‘Latitude (y)’, ‘Longitude (x)’ and ‘Coordinate Space’.

When using the Custom coordinates data type, we must tell VA where to find the spatial data in our dataset.  We do this using the Latitude (y) and Longitude (x) dropdown lists.  They contain all measures from your dataset.  In this example, the variable ‘Latitude World Geodetic System’ contains our latitude values and the variable  ‘Longitude World Geodetic System’ contains our longitude values.   The ‘Coordinate Space’ dropdown defaults to World Geodetic System (WGS84) and is the correct choice for this example.

  1. Click the OK button to complete the setup once the latitude and longitude variables have been selected from their respective dropdown lists. You should see a new ‘Geography’ section in the Data panel.  The name of the variable (or its edited value) will be displayed beside a globe icon to indicate it is a geography variable.  In this case we see the variable Business Address.

 

Congratulations!  You have now created a custom geography variable and are ready to display it on a map.  To do this, simply drag it from the Data panel and drop it on the report canvas.  The auto-map feature of VA will recognize it as a geography variable and display the data as a bubble map with an OpenStreetMap background.

In this post, we created a custom geography variable using the default Coordinate Space.  Using a custom geography variable gives you the flexibility of mapping data sets that contain valid latitude and longitude values.  Next time, we will take our exploration of the geography variable one step further and explore using custom polygons in your maps.

Using Custom Coordinates for map creation in SAS Visual Analytics was published on SAS Users.

2月 272019
 

Box plots are a great way to compare the distributions of several subpopulations of your data. For example, box plots are often used in clinical studies to visualize the response of patients in various cohorts. This article describes three techniques to visualize responses when the cohorts have a nested or hierarchical structure, such as experimental treatments nested inside of clinics. The techniques are:

  • Use PROC SGPLOT (or PROC SGPANEL): Use the VBOX statement to visualize the nested structure. You can use the CATEGORY= option to specify the "outer" variable and the GROUP= option to specify the "inner" (nested) variable.
  • Use PROC GLM: For a model that has exactly two categorical variables, one nested in the other, PROC GLM automatically creates a nested box plot.
  • Use PROC BOXPLOT: You can use PROC BOXPLOT to create a nested box plot. The procedure supports several options that can enhance the visualization.

An example of nested data: Leaves on plants

Did you know that turnip greens are an excellent source of calcium? The following example is from the PROC NESTED documentation and is based on data analyzed in Snedecor and Cochran (Statistical Methods, 6th ed., 1967, p. 286). The original data is from an experiment in which four random turnip green plants are selected, then three leaves are randomly selected on each plant. From each leaf, two 100-mg samples were selected and used to determine the amount of calcium. (The units for calcium are percentage of dry weight.) Because the box plot for a two-element sample is trivial, I made up four additional fake measurements of calcium for each leaf, as shown in the following DATA step:

/* PROC NESTED example. First two measurements are real data. The last four are fake. */
data Turnip;
do Plant=1 to 4;
   do Leaf=1 to 3;
      do Sample=1 to 6;
         input Calcium @@;  output;
      end;
   end;
end;
/* 
|--REAL--|  |----- FAKE ------|  */
datalines;
3.28 3.09   3.26 3.19 3.27 3.01 
3.52 3.48   3.53 3.47 3.50 3.49 
2.88 2.80   2.98 2.70 2.28 2.81 
2.46 2.44   2.58 2.30 2.61 2.48 
1.87 1.92   1.97 1.90 1.86 1.90 
2.19 2.19   2.21 2.17 2.23 2.19 
2.77 2.66   2.79 2.63 2.72 2.69 
3.74 3.44   3.75 3.45 3.73 3.34 
2.55 2.55   2.59 2.51 2.49 2.61 
3.78 3.87   3.79 3.81 3.88 3.76 
4.07 4.12   4.23 4.27 4.01 4.08 
3.31 3.31   3.30 3.34 3.33 3.30 
;
 
title 'Calcium Concentration in Turnip Leaves';
title2 'Leaves Nested Within Plants';
footnote J=L 'Based on Snedecor and Cochran (1967, p. 286)';
proc sgplot data=Turnip;
   vbox Calcium / Group=Leaf category=Plant;
   xaxis discreteorder=data;
run;
Nested boxplots created by PROC SGPLOT in SAS

This simple visualization uses the VBOX statement in PROC SGPLOT. As I explained previously, you can use the CATEGORY= and GROUP= options to display the distribution of calcium for the joint levels of the two categorical variables. The CATEGORY= option specifies the horizontal variable; the GROUP= option specifies the levels of a second variable. In this case, the levels of the LEAF variable are nested inside the levels of the PLANT variable. The result is shown. Colors are used to identify the level of the GROUP= variable.

If there were additional levels of nesting (for examples, multiple farms), you could use the SGPANEL procedure and include the additional variables on the PANELBY statement.

Visualize a nested ANOVA model

You can improve the previous graph by visually dividing the leaves in one plant from the leaves in another. You can use PROC GLM to automatically display the divisions when you analyze a model for which the explanatory categorical variables are nested. The following call to PROC GLM specifies a nested mode. The "NestPlot" graph is created automatically when you use ODS GRAPHICS:

ods graphics on;
proc glm data=Turnip;
   class Plant Leaf;
   model Calcium = Leaf(Plant);   /* Leaf nested in Plant */
quit;
Nested boxplots created by PROC GLM in SAS

As you can see, the graph is the same except that it contains vertical lines that divide one plant from another. I like this version of the graph better.

Box plots for independent units

If you think about it, the color in the previous plot is not necessary. You might even consider it misleading because the leaf values are unrelated across plants. "Leaf 1" for "Plant 1" has no relationship to "Leaf 1" for the other plants. To emphasize that fact, you could relabel the leaves by using the values 1 through 12, where leaves 1–3 are from Plant=1, leaves 4–6 are from Plant=2, and so forth.

Labeling the leaves that way is necessary if you want to use to BOXPLOT procedure to visualize the data. The BOXPLOT procedure supports nested categories and the syntax is similar to the GLM syntax:

data Turnip2;
set Turnip;
LeafID = Leaf + (Plant-1)*3;   /* label samples 1, 2, 3, ..., 12 */
run;
 
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote;
run;
Nested boxplots created by PROC BOXPLOT in SAS

The output from the BOXPLOT procedure uses column headers to indicate the plants. This is similar to the visualization that I used to label categories for tropical storms and hurricanes.

You can use three features of SAS/STAT 15.1 (SAS 9.4M6) to add vertical reference lines, to color the background, and to center the plant labels in the column headers. The options are the BLOCKREF option, the BLOCKREFFILL option, and the BLOCKVALUEPOS= option, respectively.

/* Options for extending the column headers into the plot region.
   These options require SAS/STAT 15.1 */
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote
                      blockref blockreffill blockvaluepos=center;
run;
Nested boxplots created by PROC BOXPLOT in SAS

I like this graph because it uses colors sparingly and does not require a legend. Also, if the number of samples is large (such as 30 or 50), PROC BOXPLOT will automatically produce a series of graphs, each displaying a portion of the data.

In summary, this article shows three ways to create box plots of responses for nested categorical data. The simplest is to use the VBOX statement in PROC SGPLOT, although if there are additional categorical variables in the model, you can create a lattice of box plots by using the PANELBY statement in PROC SGPANEL. The GLM procedure enables you to perform an ANOVA analysis for the data and also create a visualization. The BOXPLOT procedure provides nice headers and (in SAS/STAT 15.1) colored strips for each level of the "outer" categories.

The post 3 ways to create nested box plots in SAS appeared first on The DO Loop.

2月 272019
 

Box plots are a great way to compare the distributions of several subpopulations of your data. For example, box plots are often used in clinical studies to visualize the response of patients in various cohorts. This article describes three techniques to visualize responses when the cohorts have a nested or hierarchical structure, such as experimental treatments nested inside of clinics. The techniques are:

  • Use PROC SGPLOT (or PROC SGPANEL): Use the VBOX statement to visualize the nested structure. You can use the CATEGORY= option to specify the "outer" variable and the GROUP= option to specify the "inner" (nested) variable.
  • Use PROC GLM: For a model that has exactly two categorical variables, one nested in the other, PROC GLM automatically creates a nested box plot.
  • Use PROC BOXPLOT: You can use PROC BOXPLOT to create a nested box plot. The procedure supports several options that can enhance the visualization.

An example of nested data: Leaves on plants

Did you know that turnip greens are an excellent source of calcium? The following example is from the PROC NESTED documentation and is based on data analyzed in Snedecor and Cochran (Statistical Methods, 6th ed., 1967, p. 286). The original data is from an experiment in which four random turnip green plants are selected, then three leaves are randomly selected on each plant. From each leaf, two 100-mg samples were selected and used to determine the amount of calcium. (The units for calcium are percentage of dry weight.) Because the box plot for a two-element sample is trivial, I made up four additional fake measurements of calcium for each leaf, as shown in the following DATA step:

/* PROC NESTED example. First two measurements are real data. The last four are fake. */
data Turnip;
do Plant=1 to 4;
   do Leaf=1 to 3;
      do Sample=1 to 6;
         input Calcium @@;  output;
      end;
   end;
end;
/* 
|--REAL--|  |----- FAKE ------|  */
datalines;
3.28 3.09   3.26 3.19 3.27 3.01 
3.52 3.48   3.53 3.47 3.50 3.49 
2.88 2.80   2.98 2.70 2.28 2.81 
2.46 2.44   2.58 2.30 2.61 2.48 
1.87 1.92   1.97 1.90 1.86 1.90 
2.19 2.19   2.21 2.17 2.23 2.19 
2.77 2.66   2.79 2.63 2.72 2.69 
3.74 3.44   3.75 3.45 3.73 3.34 
2.55 2.55   2.59 2.51 2.49 2.61 
3.78 3.87   3.79 3.81 3.88 3.76 
4.07 4.12   4.23 4.27 4.01 4.08 
3.31 3.31   3.30 3.34 3.33 3.30 
;
 
title 'Calcium Concentration in Turnip Leaves';
title2 'Leaves Nested Within Plants';
footnote J=L 'Based on Snedecor and Cochran (1967, p. 286)';
proc sgplot data=Turnip;
   vbox Calcium / Group=Leaf category=Plant;
   xaxis discreteorder=data;
run;
Nested boxplots created by PROC SGPLOT in SAS

This simple visualization uses the VBOX statement in PROC SGPLOT. As I explained previously, you can use the CATEGORY= and GROUP= options to display the distribution of calcium for the joint levels of the two categorical variables. The CATEGORY= option specifies the horizontal variable; the GROUP= option specifies the levels of a second variable. In this case, the levels of the LEAF variable are nested inside the levels of the PLANT variable. The result is shown. Colors are used to identify the level of the GROUP= variable.

If there were additional levels of nesting (for examples, multiple farms), you could use the SGPANEL procedure and include the additional variables on the PANELBY statement.

Visualize a nested ANOVA model

You can improve the previous graph by visually dividing the leaves in one plant from the leaves in another. You can use PROC GLM to automatically display the divisions when you analyze a model for which the explanatory categorical variables are nested. The following call to PROC GLM specifies a nested mode. The "NestPlot" graph is created automatically when you use ODS GRAPHICS:

ods graphics on;
proc glm data=Turnip;
   class Plant Leaf;
   model Calcium = Leaf(Plant);   /* Leaf nested in Plant */
quit;
Nested boxplots created by PROC GLM in SAS

As you can see, the graph is the same except that it contains vertical lines that divide one plant from another. I like this version of the graph better.

Box plots for independent units

If you think about it, the color in the previous plot is not necessary. You might even consider it misleading because the leaf values are unrelated across plants. "Leaf 1" for "Plant 1" has no relationship to "Leaf 1" for the other plants. To emphasize that fact, you could relabel the leaves by using the values 1 through 12, where leaves 1–3 are from Plant=1, leaves 4–6 are from Plant=2, and so forth.

Labeling the leaves that way is necessary if you want to use to BOXPLOT procedure to visualize the data. The BOXPLOT procedure supports nested categories and the syntax is similar to the GLM syntax:

data Turnip2;
set Turnip;
LeafID = Leaf + (Plant-1)*3;   /* label samples 1, 2, 3, ..., 12 */
run;
 
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote;
run;
Nested boxplots created by PROC BOXPLOT in SAS

The output from the BOXPLOT procedure uses column headers to indicate the plants. This is similar to the visualization that I used to label categories for tropical storms and hurricanes.

You can use three features of SAS/STAT 15.1 (SAS 9.4M6) to add vertical reference lines, to color the background, and to center the plant labels in the column headers. The options are the BLOCKREF option, the BLOCKREFFILL option, and the BLOCKVALUEPOS= option, respectively.

/* Options for extending the column headers into the plot region.
   These options require SAS/STAT 15.1 */
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote
                      blockref blockreffill blockvaluepos=center;
run;
Nested boxplots created by PROC BOXPLOT in SAS

I like this graph because it uses colors sparingly and does not require a legend. Also, if the number of samples is large (such as 30 or 50), PROC BOXPLOT will automatically produce a series of graphs, each displaying a portion of the data.

In summary, this article shows three ways to create box plots of responses for nested categorical data. The simplest is to use the VBOX statement in PROC SGPLOT, although if there are additional categorical variables in the model, you can create a lattice of box plots by using the PANELBY statement in PROC SGPANEL. The GLM procedure enables you to perform an ANOVA analysis for the data and also create a visualization. The BOXPLOT procedure provides nice headers and (in SAS/STAT 15.1) colored strips for each level of the "outer" categories.

The post 3 ways to create nested box plots in SAS appeared first on The DO Loop.

2月 252019
 

When I run a bootstrap analysis, I create graphs to visualize the distribution of the bootstrap statistics. For example, in my article about how to bootstrap the difference of means in a two-sample t test, I included a histogram of the bootstrap distribution and added reference lines to indicate a confidence interval for the difference of means.

For t tests, the TTEST procedure supports the BOOTSTRAP statement, which automates the bootstrap process for standard one- and two-sample t tests. A new feature in SAS/STAT 15.1 (SAS 9.4M6) is that the TTEST procedure supports the PLOTS=BOOTSTRAP option, which automatically creates histograms, Q-Q plots, and scatter plots of various bootstrap distributions.

To demonstrate the new PLOTS=BOOTSTRAP option, I will use the same example that I used to demonstrate the BOOTSTRAP statement. The data are the sedans and SUVs in the Sashelp.Cars data. The research question is to estimate the difference in fuel efficiency, as measured by miles per gallon during city driving. The bootstrap analysis enables you to visualize the approximate sampling distribution of the difference-of-mean statistic and its standard deviation. The following statements create the data and run PROC TTEST to generate the analysis. The PLOTS=BOOTSTRAP option generates the bootstrap graphs. The BOOTSTRAP statement request 5,000 bootstrap resamples and 95% confidence intervals, based on the percentiles of the bootstrap distribution:

/* create data set that has two categories: 'Sedan' and 'SUV' */
data Sample;
set Sashelp.Cars(keep=Type MPG_City);
if Type in ('Sedan' 'SUV');
run;
 
ods trace on;
title "Bootstrap Estimates with Percentile CI";
proc ttest data=Sample plots=bootstrap;
   class Type;
   var MPG_City;
   bootstrap / seed=123 nsamples=5000 bootci=percentile;
run;

The TTEST procedure creates seven tables and eight graphs. The previous article displayed and discussed several tables, so here I display only two of the graphs.

Graph of the bootstrap distribution for the difference of means. The green region indicates the 95% percentile confidence interval, based on the bootstrap samples. Computed by using the PLOTS=BOOSTRAP option in PROC TTEST in SAS/STAT 15.1.

The first graph is a histogram of the bootstrap distribution of the difference between the sample means for each vehicle type. The distribution appears to be symmetric and approximately normal. The middle of the distribution is close to -5, which is the point estimate for the difference in MPG_City between the SUVs and the sedans in the data. How much should we trust that point estimate? The green region indicates that 95% of the bootstrap samples had differences that were in the green area underneath the histogram, which is approximately [-5.9, -4.1]. This is a 95% confidence interval for the difference.

Similar histograms (not shown) are displayed for two other statistics: the pooled standard deviation (of the difference between means) and the Satterthwaite standard deviation. The procedure also creates Q-Q plots of the bootstrap distributions.

Scatter plot of the joint bootstrap distribution for the difference of means and the standard deviation of the difference of means. Computed by using the PLOTS=BOOSTRAP option in PROC TTEST in SAS/STAT 15.1.

The second graph is a scatter plot that shows the joint bootstrap distribution of the mean difference and the pooled standard deviations of the difference, based on 5,000 bootstrap samples. You can see that the statistics are slightly correlated. A sample that has a large (absolute) mean difference also tends to have a relatively large standard deviation. The 95% prediction region for the joint distribution indicates how these two statistics co-vary among random samples.

In summary, the TTEST procedure in SAS/STAT 15.1 supports a new PLOTS=BOOTSTRAP option, which automatically creates many graphs that help you to visualize the bootstrap distributions of the statistics. If you are conducting a bootstrap analysis for a t test, I highly recommend using the plots to visualize the bootstrap distributions and results

The post Graphs of bootstrap statistics in PROC TTEST appeared first on The DO Loop.

2月 252019
 

There will be more than 55 billion IoT devices by 2025 – more than four devices for every person on earth. That means that big data is only getting bigger. All of that high frequency, high velocity data from connecting the physical world to the digital world is available, and [...]

Unlocking the value of IoT data with the artificial intelligence of things was published on SAS Voices by Tim Clark

2月 252019
 

In my previous blog post, I explored how reinforcement learning is taking the guesswork out of marketing to deliver great experiences. Let’s take a look at two additional areas where AI is transforming the customer experience: recommendations and natural language processing. Personalised recommendations get a boost from AI We’ve all [...]

Two more ways AI is transforming the customer experience was published on Customer Intelligence Blog.

2月 222019
 

Have you heard? SAS recently announced a new practical programming credential, SAS® Certified Specialist: Base Programming Using SAS® 9.4. This new practical programming credential is different than our previous exams. Now it requires you to take a performance-based exam, in which you access a SAS environment and then write and execute SAS code. Practical, right? At the end, your answers are scored for correctness.

This new SAS Certified Specialist credential will run in parallel with the current SAS Certified Base Programmer credential until June 2019. So make sure to check out the complete exam content guide for a list of objectives that are tested on the exam!

A new certification guide


SAS is also releasing an accompanying certification guide to help you prepare for the new programming credential: SAS® Certified Specialist Prep Guide: Base Programming Using SAS® 9.4. This new certification guide has been streamlined to remove redundancy throughout the chapters and has been aligned with the courses, SAS® Programming 1: Essentials and SAS® Programming 2: Data Manipulation Techniques, along with the exam content guide.

The new certification guide also includes a workbook that provides programming scenarios to help you prepare for the performance-based portion of the exam. Make sure to also review the SAS® 9.4 Base Programming Exam Experience Tutorial to help you prepare!

Here are some of the changes made to the certification guide:

    • Removal of INFILE/INPUT statements. These statements are no longer taught in SAS® Programming 1 and SAS® Programming 2 courses or tested in the exam. Thus, these statements have been replaced with the SET statement and the IMPORT procedure.
    • Removal of arrays. Arrays are no longer taught in SAS® Programming 1 and SAS® Programming 2 courses or tested in the exam.
    • Addition of the TRANSPOSE and EXPORT procedures, as well as macro variables. These additions are tested and are taught in the SAS® Programming 1 and SAS® Programming 2 courses.
    • Updating of existing examples and the addition of more annotated examples that are easier to follow and have better quality graphics. These changes help to improve the readability of examples, and there is a closer relationship between the sample questions in the book and the questions that appear on the actual exam.
    • Installation of sample data, and the way you work with SAS libraries has been simplified.
    • A new companion piece, the Quick Syntax Reference Guide, is also available for download.

More opportunities to test your skills

For even more programming practice check out Ron Cody’s Learning SAS by Example. This is full of practical examples, and exercises with solutions which allow you to test your programming skills for the exam.

Not ready for the certification exam just yet? You can also browse our books for getting started with SAS. There you will find, of course, The Little SAS Book: A Primer 5th Edition and Exercises and Projects for The Little SAS® Book, Fifth Edition. Both are great guides for new users wanting to learn SAS and practice before getting to the New Base Programming Specialist Exam!

Thinking about getting SAS® certified? Check out the new SAS certification guide! was published on SAS Users.

2月 202019
 

Last year I published a series of blogs posts about how to create a calibration plot in SAS. A calibration plot is a way to assess the goodness of fit for a logistic model. It is a diagnostic graph that enables you to qualitatively compare a model's predicted probability of an event to the empirical probability. I am happy to report that in SAS/STAT 15.1 (SAS 9.4M6), you can create a calibration plot automatically by using the PLOTS=CALIBRATION option on the PROC LOGISTIC statement.

Calibration plots for a model of a binary response

To demonstrate how to create a calibration plot by using PROC LOGISTIC, consider the simulated data that I analyzed in "Calibration plots in SAS." The data contain a binary response variable, Y, which depends quadratically on a uniformly distributed explanatory variable, X. The following call to PROC LOGISTIC fits a quadratic the model to the data. The new GOF option requests an extensive set of goodness-of-fit statistics and the PLOTS=CALIBRATION option requests a calibration plot:

/* NEW in SAS/STAT 15.1 (SAS 9.4M6): PLOTS=CALIBRATION option in PROC LOGISTIC */
title "Calibration Plot for a Quadratic Model";
title2 "Created by PROC LOGISTIC";
proc logistic data=LogiSim plots=calibration(CLM ShowObs);
   model y(Event='1') = x x*x / GOF;      /* New in 15.1: More goodness-of-fit statistics */
run;
Calibration plot for a quadratic logistic model, created by PROC LOGISTIC in SAS

The calibration plot is shown. (Click to enlarge.) The plot contains a gray diagonal line, which represents perfect calibration. If most of the predicted responses agree with the observed responses, then the blue curve should be close to the diagonal line. That is the case in this example. The light blue band is a 95% confidence region for the loess fit and is created by using the CLM option.

Because I used the SHOWOBS option, the calibration plot displays tiny histograms along the top and bottom of the plot. The histograms indicate the distribution of the Y=0 and Y=1 responses. The article "Use a fringe plot to visualize binary data in logistic models" explains more about how fringe plots can add insight to graphs that involve a binary response variable.

The lower right corner of the calibration plot contains one of the many goodness-of-fit statistics that are computed when you use the GOF option on the MODEL statement. A small p-value would indicate a lack of fit. In this case, there is no reason to suspect a lack of fit. The following table shows other goodness-of-fit tests. None of the p-values are small, so none of the tests indicate lack of fit.

Goodness-of-fit statistics for a quadratic logistic model, created by PROC LOGISTIC in SAS

Calibration plots for a polytomous response

An exciting feature of the calibration plots in PROC LOGISTIC is that you can use them for a polytomous response model. Derr (2013) fits a proportional odds model that predicts the probability of the severity of black-lung disease from the length of exposure to coal dust in 371 coal miners. The response variable, Severity, has the levels 'Severe', 'Moderate', and 'Normal'. The following statement create the data and model and request calibration plots for the model.

/* Data, from McCullagh and Nelder (1989, p. 179), used in Derr (2013, p. 8-10).
   The severity of pneumoconiosis (black lung disease) in coal miners
   and the number of years of exposure.
*/
data Coal; 
input Severity $ @@; 
do i=1 to 8; 
   input Exposure freq @@; 
   log10Exposure=log10(Exposure); 
   output; 
end; 
datalines; 
Normal   5.8 98 15 51 21.5 34 27.5 35 33.5 32 39.5 23 46 12 51.5 4 
Moderate 5.8  0 15  2 21.5  6 27.5  5 33.5 10 39.5  7 46  6 51.5 2 
Severe   5.8  0 15  1 21.5  3 27.5  8 33.5  9 39.5  8 46 10 51.5 5 
;
 
title 'Severity of Black Lung vs Log10(Years Exposure)';
proc logistic data=Coal rorder=data plots=Calibration(CLM);
   freq freq; 
   model Severity(descending) = log10Exposure; 
   effectplot / noobs individual;
run;
Panel of calibration plots for a polytomous proportional-odd model, created by PROC LOGISTIC in SAS

Derr (2013) discusses the results of the analysis, which are not shown here. I've displayed only the calibration plot for the model. Notice that PROC LOGISTIC creates a panel of three calibration plots, one for each response level. The calibration curves all lie close to the diagonal, so the diagnostic plots do not indicate a lack of calibration for any part of the model.

Summary

In summary, the PLOTS=CALIBRATION option in SAS/STAT 15.1 enables you to automatically create a calibration plot. The calibration plot is a diagnostic plot that qualitatively compares a model's predicted and empirical probabilities. You can use the PLOTS=CALIBRATION option on the PROC LOGISTIC statement to create a calibration plot. The CALIBRATION option supports several suboptions, which you can read about in the documentation for the PROC LOGISTIC statement.

You can download the SAS code used in this article, which includes SAS code that demonstrates how to create a calibration plot manually.

The post An easier way to create a calibration plot in SAS appeared first on The DO Loop.