Statistical Graphics

1月 142019
 

When you overlay two series in PROC SGPLOT, you can either plot both series on the same axis or you can assign one series to the main axis (Y) and another to a secondary axis (Y2). If you use the Y and Y2 axes, they are scaled independently by default, which is usually what you want. However, if the measurements for the two series are linearly related to each other, then you might want to specify the tick values for the Y2 axis so that they align with the corresponding tick marks for the Y axis. This article shows how to align the Y and Y2 axes in PROC SGPLOT in SAS for two common situations.

Different scales for one set of measurements

The simplest situation is a single set of data that you want to display in two different units. For example, you might use one axis to display the data in imperial units (pounds, gallons, degrees Fahrenheit, etc.) and the other axis to display the data in metric units (kilograms, liters, degrees Celsius, etc.).

To plot the data, define one variable for each unit. For example, the Sashelp.Class data records the weight for 19 students in pounds. The following DATA view creates a new variable that records the same data in kilograms. The subsequent call to PROC SGPLOT plots the pounds on the Y axis (left axis) and the kilograms on the Y2 axis (right axis). However, as you will see, there is a problem with the default scaling of the two axes:

data PoundsKilos / view=PoundsKilos;
   set Sashelp.Class(rename=(Weight=Pounds));
   Kilograms = 0.453592 * Pounds;            /* convert pounds to kilos */
run;
 
title "Independent Axes";
title2 "Markers Do Not Align Correctly!";   /* the tick marks on each axis are independent */
proc sgplot data=PoundsKilos;
   scatter x=Height y=Pounds;
   scatter x=Height y=Kilograms / Y2Axis;
run;

The markers for the kilogram measurements should exactly overlap the markers for pounds, but they don't. The Y and Y2 axes are independently scaled because PROC SGPLOT does not know that pounds and kilograms are linearly related. The SGPLOT procedure displays each variable by using a range of round numbers (multiples of 10 or 20). The range for the Y2 axis is [20, 70] kilograms, which corresponds to a range of [44.1, 154.3] pounds. However, the range for the Y axis is approximately [50, 150] pounds. Because the axes display different ranges, the markers do not overlap.

To improve this graph, use the VALUES= and VALUESDISPLAY= options on the YAXIS statement (or Y2AXIS statement) to force the ticks marks on one axis to align with the corresponding tick marks on the other axis. In the following DATA step, I use the kilogram scale as the standard and compute the corresponding pounds.

data Ticks;
do Kilograms = 20 to 70 by 10;     /* for each Y2 tick */
   Pounds = Kilograms / 0.453592;  /* convert kilos to pounds */
   Approx = round(Pounds, 0.1);    /* use rounded values to display tick values */
   output;
end;
run;
proc print; run;

You can use the Pounds column in the table to set the VALUES= list on the YAXIS statement. You can use the Approx column to set the VALUESDISPLAY= list, as follows:

/* align tick marks on each axis */
title "Both Axes Use the Same Scale";
proc sgplot data=PoundsKilos noautolegend;
   scatter x=Height y=Pounds;
   /* Make sure the plots overlay exactly! Then you can set SIZE=0 */
   scatter x=Height y=Kilograms / markerattrs=(size=0) Y2Axis;
   yaxis grid values=(44.092 66.139 88.185 110.231 132.277 154.324)
       valuesdisplay=('44.1' '66.1' '88.2' '110.2' '132.3' '154.3');
run;

Success! The markers for the two variables align exactly. After verifying that they align, you can use the MARKERATTRS=(SIZE=0) option to suppress the display of one of the markers.

Notice that the Y axis (pounds) no longer displays "nice numbers" because I put the tick marks at the same vertical heights on both axes. A different way to solve the misalignment problem is to use the MIN=, MAX=, THRESHOLDMIN=, and THRESHOLDMAX= options on both axes. This will enable both axes to use "nice numbers" while still aligning the data. If you want to try this approach, here are the YAXIS and Y2AXIS statements:

   /* set the axes ranges to coresponding values */
   yaxis  grid thresholdmin=0 thresholdmax=0 min=44.1 max=154.3;
   y2axis grid thresholdmin=0 thresholdmax=0 min=20   max=70;

Different scales for different measurements

Another situation that requires two Y axes is the case of two series that use different units. For example, you might want to plot the revenue for a US company (in dollars) and the revenue for a Japanese company (in yen) for a certain time period. You can use the conversion rate between yen and dollars to align the values on the axes. Of course, the conversion from Japanese yen to the US dollars changes each day, but you can use an average conversion rate to set the correspondence between the axes.

This situation also occurs when two devices use different methods to measure the same quantity. The following example shows measurements for a patient who receives a certain treatment. The quantity of a substance in the patient's blood is measured at baseline and for every hour thereafter. The quantity is measured in two ways: by using a traditional blood test and by using a new noninvasive device that measures electrical impedance. The following statements define and plot the data. The two axes are scaled by using the default method:

data BloodTest1;
label t="Hours after Medication"  x="micrograms per deciliter"  y="kiloOhms";
input x y @@;
t = _N_ - 1;
datalines;
169.0 45.5 130.8 33.4 109.0 23.8 94.1 19.8 86.3 20.4 78.4 18.7 
 76.1 16.1  72.2 16.7  70.0 11.9 69.8 14.6 69.5 10.6 68.7 12.7 67.3 16.9 
;
 
title "Overlay Measurements for Two Medical Devices";
title2 "Default Scaling";
proc sgplot data=BloodTest1;
   series x=t y=x / markers legendlabel="Standard Lab Value";
   series x=t y=y / markers Y2Axis legendlabel="New Device";
   xaxis values=(0 to 12 by 2);
   yaxis grid label="micrograms per deciliter";
   y2axis grid label="kiloOhms";
run;

In this graph, the Y axes are scaled independently. However, the company that manufactures the device used Deming regression to establish that the measurements from the two devices are linearly related by the equation Y = –10.56415 + 0.354463*X, where X is the measurement from the blood test. You can use this linear equation to set the scales for the two axes.

The following DATA step uses the Deming regression estimates to convert the tick marks on the Y axis into values for the Y2 axis. (Click here for the PROC PRINT output.) The call to PROC SGPLOT creates a graph in which the Y2 axis is aligned with the Y axis according to the Deming regression estimates.

data Ticks;
do Y1 = 60 to 160 by 20;
   /* use Deming regression to find one set of ticks in terms of the other */
   Y2 =  -10.56415 + 0.354463 * Y1;  /* kiloOhms as a function of micrograms/dL */
   Approx = round(Y2, 0.1);
   output;
end;
run;
 
proc print; run;
 
title "Align Y Axes for Different Series";
title2 "Measurements are Linearly Related";
proc sgplot data=BloodTest1;
   series x=t y=x / markers legendlabel="Standard Lab Value";
   series x=t y=y / markers Y2Axis legendlabel="New Device";
   xaxis values=(0 to 12 by 2);
   yaxis grid label="micrograms per deciliter" offsetmax=0.1
      values=(60 to 160 by 20);
   /* the same offsets must be used in both YAXIS and Y2AXIS stmts */
   y2axis grid label="kiloOhms" offsetmax=0.1
      values=(10.7036 17.7929 24.8822 31.9714 39.0607 46.1499)
      valuesdisplay=('10.7' '17.8' '24.9' '32.0' '39.1' '46.1'); 
run;

In this new graph, the measurements are displayed on compatible scales and the reference lines connect round numbers on one axis to the corresponding values on the other axis.

The post How to align the Y and Y2 axes in PROC SGPLOT appeared first on The DO Loop.

1月 092019
 

Numbers don't lie, but sometimes they don't reveal the full story. Last week I wrote about the most popular articles from The DO Loop in 2018. The popular articles are inevitably about elementary topics in SAS programming or statistics because those topics have broad appeal. However, I also write about advanced topics, which are less popular but fill an important niche in the SAS community. Not everyone needs to know how to fit a Pareto distribution in SAS or how to compute distance-based measures of correlation in SAS. Nevertheless, these topics are interesting to think about.

I believe that learning should not stop when we leave school. If you, too, are a lifelong learner, the following topics deserve a second look. I've included articles from four different categories.

Data Visualization

  • Fringe plot: When fitting a logistic model, you can plot the predicted probabilities versus a continuous covariate or versus the empirical probability. You can use a fringe plot to overlay the data on the plot of predicted probabilities. The SAS developer of PROC LOGISTIC liked this article a lot, so look for fringe plots in a future release of SAS/STAT software!
  • Order variables in a correlation matrix or scatter plot matrix: When displaying a graph that shows many variables (such as a scatter plot matrix), you can make the graph more understandable by ordering the variables so that similar variables are adjacent to each other. The article uses single-link clustering to order the variables, as suggested by Hurley (2004).
  • A stacked band plot, created in SAS by using PROC SGPLOT
  • Stacked band plot: You can use PROC SGPLOT to automatically create a stacked bar plot. However, when the bars represent an ordered categorical variable (such as months or years), you might want to create a stacked band plot instead. This article shows how to create a stacked band plot in SAS.

Statistics and Data Analysis

Random numbers and resampling methods

Process flow diagram shows how to resample data to create a bootstrap distribution.

Optimization

These articles are technical but provide tips and techniques that you might find useful. Choose a few topics that are unfamiliar and teach yourself something new in this New Year!

Do you have a favorite article from 2018 that I did not include on the list? Share it in a comment!

The post 10 posts from 2018 that deserve a second look appeared first on The DO Loop.

1月 022019
 

Last year, I wrote more than 100 posts for The DO Loop blog. Of these, the most popular articles were about data visualization, SAS programming tips, and statistical data analysis. Here are the most popular articles from 2018 in each category.

Data Visualization

General SAS programming techniques

Statistics and Data Analysis

I write this blog because I love to learn new things and share what I know with others. If you want to learn something new, read (or re-read!) these popular articles from 2018. Then share this page with one of your colleagues. Happy New Year! I hope we both have many opportunities to learn and share in 2019!

The post Top posts from <em>The DO Loop</em> in 2018 appeared first on The DO Loop.

12月 192018
 

I regularly see questions on a SAS discussion forum about how to visualize the predicted values for a mixed model that has at least one continuous variable, a categorical variable, and possibly an interaction term. SAS procedures such as GLM, GENMOD, and LOGISTIC can automatically produce plots of the predicted values versus the explanatory variables. These are called "fit plots" or "interaction plots" or "sliced fit plots." Although PROC MIXED does not automatically produce a "fit plot" for a mixed model, you can use the output from the procedure to construct a fit plot. In fact, two graphs are possible: one that incorporates the random effects for each subject in the predicted values and another that does not.

Use PROC PLM to visualize the fixed-effect model

Because the MIXED (and GLIMMIX) procedure supports the STORE statement, you can write the model to an item store and then use the EFFECTPLOT statement in PROC PLM to visualize the predicted values. The resulting graph visualizes the fixed effects. The random effects are essentially "averaged out" or shown at their expected value, which is zero.

As an example, consider the following repeated measures example from the PROC MIXED documentation. The data are measurements for 11 girls and 16 boys recorded when the children were 8, 10, 12, and 14 years old. According to Pothoff and Roy (1964), "Each measurement is the distance, in millimeters, from the centre of the pituitary to the pteryomaxillary fissure. The reason why there is an occasional instance where this distance decreases with age is that the distance represents the relative position of two points." You can use a spaghetti plot to visualize the raw data:

data pr;
   input Person Gender $ y1 y2 y3 y4;
   y=y1; Age=8;  output;
   y=y2; Age=10; output;
   y=y3; Age=12; output;
   y=y4; Age=14; output;
   drop y1-y4;
   label y="Relative Distance (mm)";
   datalines;
 1   F   21.0    20.0    21.5    23.0
 2   F   21.0    21.5    24.0    25.5
 3   F   20.5    24.0    24.5    26.0
 4   F   23.5    24.5    25.0    26.5
 5   F   21.5    23.0    22.5    23.5
 6   F   20.0    21.0    21.0    22.5
 7   F   21.5    22.5    23.0    25.0
 8   F   23.0    23.0    23.5    24.0
 9   F   20.0    21.0    22.0    21.5
10   F   16.5    19.0    19.0    19.5
11   F   24.5    25.0    28.0    28.0
12   M   26.0    25.0    29.0    31.0
13   M   21.5    22.5    23.0    26.5
14   M   23.0    22.5    24.0    27.5
15   M   25.5    27.5    26.5    27.0
16   M   20.0    23.5    22.5    26.0
17   M   24.5    25.5    27.0    28.5
18   M   22.0    22.0    24.5    26.5
19   M   24.0    21.5    24.5    25.5
20   M   23.0    20.5    31.0    26.0
21   M   27.5    28.0    31.0    31.5
22   M   23.0    23.0    23.5    25.0
23   M   21.5    23.5    24.0    28.0
24   M   17.0    24.5    26.0    29.5
25   M   22.5    25.5    25.5    26.0
26   M   23.0    24.5    26.0    30.0
27   M   22.0    21.5    23.5    25.0
;
 
/* for information about the LEGENDITEM statement, see
   https://blogs.sas.com/content/iml/2018/02/12/merged-legends.html */
title "Relative Distance Between Features";
proc sgplot data=pr;
   series x=Age y=Y / group=Person groupLC=Gender curvelabel;
   /* LEGENDITEM is a SAS 9.4M5 feature. Delete the following statements in older versions of SAS */
   legenditem type=line name="F" / label="Girls" lineattrs=GraphData1; 
   legenditem type=line name="M" / label="Boys" lineattrs=GraphData2(pattern=dash); 
   keylegend "M" "F";
run;

Notice that I used the LEGENDITEM statement to customize the legend. The LEGENDITEM statement is a SAS 9.4M5 feature. You can delete the LEGENDITEM and KEYLEGEND statements to obtain the default legend.

One of the stated goals of the study is to model the average growth rate of the measurement for boys and for girls. You can see from the spaghetti plot that growth rate appears to be linear and that boys tend to have larger measurements than girls of the same age. However, it is not clear whether the rate (the slope of the average line) is the same for each gender or is significantly different.

The documentation example describes several ways to model the variance structure for the repeated measures. One choice is the AR(1) structure. The following statement uses the REPEATED statement to model the repeated measures. The STORE statement writes an item store that contains information about the model, which is used by PROC PLM to create effect plots:

proc mixed data=pr method=ml;
   class Person Gender;
   model y = Gender Age Gender*Age / s;
   repeated / type=ar(1) sub=Person r;
   store out=MixedModel;                       /* create item store */
run;
 
proc plm restore=MixedModel;                   /* use item store to create fit plots */
   effectplot fit(x=Age plotby=Gender);        /* panel */
   effectplot slicefit(x=Age sliceby=Gender);  /* overlay */
   *effectplot slicefit(x=Age sliceby=Person); /* ERROR: Person is not a fixed effect */
run;

The call to PROC PLM creates a panel of plots with confidence bands (not shown) and a graph that overlays the predicted values for males and females (shown). You can see a small difference in slopes between the two average growth curves, but the "Type 3 Tests of Fixed Effects" table from PROC MIXED (not shown) indicates that the difference is not statistically significant. The graph does not display the raw observations because PROC PLM does not have access to them. However, you can obtain a graph that overlays the observation by modifying the method in the next section.

Notice that this graph displays the model for boys and girls, but not for the individual subjects. In fact, I added a comment to the program that reminds you that it is an error to try to use the PERSON variable in the EFFECTPLOT statement because PERSON is not a fixed effect in the model. If you want to model to growth curves for individuals, see the next section.

Use the OUTPRED= option visualize the random-coefficient model

The spaghetti plot seems to indicate that the growth curves for the individuals have the same slope but different intercepts. You can model this by using the RANDOM statement to add a random intercept effect to the model. The resulting graph "untangles" the spaghetti plot by plotting a line that best fits each individual's growth.

You can use the OUTPRED= option on the MODEL statement to create an output data set in which the predicted values incorporate the random intercept for each person:

/* random coefficient model */
proc mixed data=pr method=ml;
   class Person Gender;
   model y = Gender Age Gender*Age / s outpred=Pred;
   random int / sub=Person;
run;
 
/* BE SURE TO SORT by the X variable, before creating a SERIES plot. */
/* These data are already sorted, but the next line shows how to sort, if necessary */
proc sort data=Pred;
   by Gender Person Age;
run;
 
title "Predicted Individual Growth Curves";
proc sgplot data=Pred;
   scatter x=Age y=Y / group=Gender;
   series x=Age y=Pred / group=Person GroupLC=Gender curvelabel;
   /* LEGENDITEM is a SAS 9.4M5 feature. Delete the following statements in older versions of SAS */
   legenditem type=markerline name="F" / label="Girls" lineattrs=GraphData1 markerattrs=GraphData1; 
   legenditem type=markerline name="M" / label="Boys" lineattrs=GraphData2(pattern=dash) markerattrs=GraphData2; 
   keylegend "M" "F";
run;

This graph shows a smoothed version of the spaghetti plot. The graph enables you to see the variation in intercepts for the subjects but the slopes are determined by the gender of each individual. In statistical terms, the predicted values in this graph "incorporate the EBLUP values." The documentation contains the formula for the predicted values.

It is worth noting that the MODEL statement in PROC MIXED also supports an OUTPREDM= option. (The 'M' stands for 'marginal' model.) The OUTPREDM= option writes a data set that contains the predictions that do not "incorporate the EBLUP values", where EBLUP is an abbreviation for the "empirical best linear unbiased prediction." These are the same predicted values that you obtain from the STORE statement and PROC PLM. However, the OUTPREDM= data set gives you complete control over how you use the predicted values. For example, you can use the OUTPREDM= data set to add the original observations to the graph of the predicted values for boys and girls.

In summary, when you include a continuous covariate (like age) in a mixed model, there are two ways to visualize the "fit plot." You can use the STORE statement and PROC PLM to obtain a graph of the predictions from the "marginal model" that contains the fixed effects. (You can also use the OUTPREDM= option for this purpose.) Alternatively, if you are fitting a model that estimates random coefficients (intercepts or slopes), you can use the OUTPRED= option to write a data set that contains the predicted that incorporate the estimates.

The post Visualize a mixed model that has repeated measures or random coefficients appeared first on The DO Loop.

12月 032018
 

When a graph includes several markers or line styles, it is often useful to create a legend that explains the relationship between the data and the symbols, color, and line styles in the graph. The SGPLOT procedure does a good job of automatically creating and placing a legend for most graphs. However, sometimes it is useful to override the procedure's default choices. This article describes five tips that you can use to customize the content and placement of legends. The tips are:

  1. Suppress the legend by using the NOAUTOLEGEND option.
  2. Choose which components of the graph appear in the legend by using a KEYLEGEND statement and the NAME= option.
  3. Position the legend by using the LOCATION= and POSITION= option on the KEYLEGEND statement.
  4. Exclude one or more items from a legend by using the EXCLUDE= option on the KEYLEGEND statement (requires SAS 9.4M3).
  5. Consolidate one or more items by using the LEGENDITEM statement (requires SAS 9.4M5).

1. Suppress the legend

By default, the SGPLOT procedure displays a legend when there are multiple plots that are overlaid in the graph. This can be caused by multiple statements or by using the GROUP= option on a statement. If the information in the default legend is redundant, and you might want to suppress it. For example, the following legend is unnecessary because the title explains the data and the regression line. You can uncomment the NOAUTOLEGEND option to suppress the legend.

title "Linear Regression for Weight and Height";
title2 "The legend is unnecessary";
proc sgplot data=Sashelp.Class /* NOAUTOLEGEND */;
   scatter x=Height y=Weight;
   reg x=Height y=Weight / nomarkers;
   footnote J=L "Use the NOAUTOLEGEND option to suppress the legend";
run;
footnote;
You can use the NOAUTOLEGEND option on the PROC SGPLOT statement in SAS to suppress a legend

2. Choose which components appear in the legend

In some graphs that overlay multiple components, some components are self -explanatory and do not need to appear in the legend. You can choose which components appear in the legend by using the NAME= option on the statements and using the KEYLEGEND statement to specify the contents of the legend. For example, the following statements create a graph that consists of a scatter plot, a confidence ellipse, and a regression line. If you only want the confidence ellipse and regression line to appear in the legend, use the NAME= option to identify each component and use the KEYLEGEND statement to specify the contents of the legend:

title "Weight versus Height";
title2 "Overlay Least Squares Fit and Confidence Ellipse";
proc sgplot data=Sashelp.Class;
   scatter x=Height y=Weight / name="scatter";
   ellipse x=Height y=Weight / name="ellipse";
   reg x=Height y=Weight     / name="reg" nomarkers lineattrs=GraphData2;
   keylegend "reg" "ellipse"; /* list item in the order you want them */
run;
In the SGPLOT procedure in SAS, you can use the NAME= option to name components and the KEYLEGEND statement to specify which components should appear in a legend

3. Position the legend

The KEYLEGEND statement supports the LOCATION= and POSTITION= options, which enable you to place the legend almost anywhere in the graph. The LOCATION= option controls whether the legend appears inside or outside of the graph area. The POSITION= option controls the placement of the legend on the graph (left, right, top, bottom,...). However, I can never remember which option controls which attribute! Therefore, I created a mnemonic, which I hope will help you remember, too:

  • The LOCATION= option contains the substring 'CAT'. A CAT likes to go INSIDE and OUTSIDE the house. Therefore, the valid keywords for the LOCATION= option are INSIDE and OUTSIDE.
  • The POSITION= option contains the substring 'SIT'. You can SIT on the LEFT or RIGHT side of a couch. (Also, "position" can be used as a verb to mean "place on a page.") Therefore, the valid keywords for the POSITION= option are BOTTOM, BOTTOMLEFT, BOTTOMRIGHT, LEFT, RIGHT, TOP, TOPLEFT, and TOPRIGHT. (Some other graphical elements support a CENTER position, but not the legend.)

The following graph is the same as in the previous example, except that the location of the legend is inside the graph area and the position of the legend is in the lower-right corner. When you move the legend to the left or right side of the graph, it is often useful to use the ACROSS=1 option to force the legend to list the items vertically. Similarly, if you position the legend at the top or bottom of a graph, you might want to use the DOWN=1 option to list the items horizontally.

   keylegend "reg" "ellipse" / location=inside position=bottomright across=1;
In the SGPLOT procedure in SAS, you can use the LOCATION= and POSITION= options to position a legend

4. Exclude items from a legend

When you use the GROUP= option to display groups, you might want to exclude some of the group categories from the legend. The KEYLEGEND statement supports the EXCLUDE= option that you can use to exclude certain items. Three situations come to mind:

  • The group levels contain missing values. You might want to exclude the missing values from the legend by using KEYLEGEND / EXCLUDE=(" ");.
  • The purpose of the graph is to focus on one or two subgroups. If so, it might make sense to label only those groups. For example, if the purpose of a graph is to show income disparity between blacks and whites, you might decide not to include Asians or Hispanics in the legend: EXCLUDE=("Asian" "Hispanic");.
  • The group is binary. If a graph shows the results of a clinical trial and the legend includes the marker shape for the patients who died, it should be clear that the other marker shape represents patients who survived: EXCLUDE=("Alive");. An example is shown below
ods graphics / attrpriority=none;
title "Patient Status";
proc sgplot data=Sashelp.Heart(obs=200 where=(Systolic<=200));
   styleattrs datasymbols=(X CircleFilled);
   scatter x=Systolic y=Diastolic / group=Status;
   keylegend / exclude=("Alive");
run;
In the SGPLOT procedure in SAS, you can use the EXCLUDE= option on the KEYLEGEND statement to suppress a legend item, such as a category in a group

5. Customize items in a legend

The previous section shows how to exclude one or more levels in a categorical variable that is specified on the GROUP= option. You also might want to customize the items that appear in the legend in order to combine, for example, marker and line attributes. A situation where this comes up is when you want to overlay a group of curves on a scatter plot.

The LEGENDITEM statement (supported in SAS 9.4M5) enables you to specify what combination of markers and line patterns you want to appear for every item in a legend. It is a "super customization" statement that gives you complete control over the legend items.

The following statements show how to use the LEGENDITEM statement to create a customized legend. By default, if you use the REG statement with the GROUP= option, the legend will show only the colors and line patterns for the regression lines. In the following example, I have used the ATTRPRIORITY=NONE option to force the marker symbols to differ between groups. I want the legend to show not only the colors and patterns of the regression lines but also the marker symbols for each group:

/* ensure order of BP_Status is High, Normal, Optimal */
proc sort data=Sashelp.Heart(obs=200 where=(Systolic<=200)) out=Heart;
   by BP_Status;
run;
 
ods graphics / attrpriority=none;
title "Patients by Blood Pressure Status";
proc sgplot data=Heart;
   styleattrs datalinepatterns=(solid) ;
   reg x=Systolic y=Diastolic / group=BP_Status;   
   legenditem type=markerline name="H" /
      label="High" lineattrs=GraphData1 markerattrs=GraphData1; 
   legenditem type=markerline name="N" /
      label="Normal" lineattrs=GraphData2 markerattrs=GraphData2; 
   legenditem type=markerline name="O" /
      label="Optimal" lineattrs=GraphData3 markerattrs=GraphData3; 
   keylegend "O" "N" "H" / title="BP Status"; 
run;
In the SGPLOT procedure in SAS, you can use the LEGENDITEM statement to create customized items in a legend

In summary, PROC SGPLOT in SAS supports several ways to create, suppress, position, and customize the items in a legend. Do you have a favorite way to customize a legend in PROC SGPLOT? Leave a comment!

The post 5 tips for customizing legends in PROC SGPLOT in SAS appeared first on The DO Loop.

11月 262018
 
Funnel plot of the proportion of unimmunized students in NC kindergarten classes

Last week my colleague, Robert Allison, visualized data regarding immunization rates for kindergarten classes in North Carolina. One of his graphs was a scatter plot that displayed the proportion of unimmunized students versus the size of the class for 1,885 kindergarten classes in NC. This scatter plot is the basis for a statistical plot that is known as a funnel plot for proportions. The plot to the right shows a funnel plot that I created based on Robert's analysis.

The basic idea of a funnel plot is that small random samples have more variation than large random samples from the same population. If students are randomly chosen from a population in which some small proportion of children have an attribute, it might not be unusual if 40% of the students in a five-student class have the attribute (that's 2 out of 5) whereas it might be highly unusual to see such a high proportion in a 100-student class. The funnel plot enhances a scatter plot by adding curves that indicate a reasonable range of values for the proportion, given the size of a random sample.

For a discussion of funnel plots and how to create a funnel plot in SAS, see the article "Funnel plots for proportions." You can download the immunization data and the SAS program that I used to create the funnel plot for proportions.

A funnel plot for proportions

The preceding funnel plot contains the following statistical features:
  • A scatter plot of the data. Each marker represents a school.
  • A horizontal reference line. The line indicates the average proportion for the data. For these data, the statewide average proportion of unimmunized kindergarteners is 4.58%.
  • A curve that indicates an upper confidence limit for the proportion of unimmunized students, assuming that the classes are a random sample from a population in which 4.58% of the students are unimmunized. When a marker (school) appears above the curve, the proportion of unimmunized students in that school is significantly higher than the statewide proportion.

This funnel plot uses the shape (and color) of a marker to indicate whether the school is public (circle), private (upward-pointing triangle), or a charter school (right-pointing triangle). The plot includes tool tips so that you can hover the mouse over a marker and see the name and county of the school.

The graph indicates that there are dozens of schools for which the proportion of unimmunized students far exceeds the state average. A graph like this enables school administrators and public health officials to identify schools that have a larger-than-expected proportion of unimmunized students. Identifying schools is the first step to developing initiatives that can improve the immunization rate in school-age children.

Funnel plots for each school district

You can use a WHERE statement or BY-group processing in PROC SGPLOT to create a funnel plot for each county. A graph that shows only the schools in a particular district is more relevant for local school boards and administrators. The following graphs show the proportion of unimmunized students in Mecklenburg County (near Charlotte) and Wake County (near Raleigh), which are the two largest school districts in NC.

Proportion of unimmunized students in Mecklenburg County kindergarten classes
Proportion of unimmunized students in Wake County kindergarten classes

The first graph shows that Mecklenburg County has several schools that contain more than 60 kindergarten students and for which 25% or more of the students are unimmunized. In fact, some large schools have more than 40% unimmunized! In Wake County, fewer schools have extreme proportions, but there are still many schools for which the proportion of unimmunized students is quite large relative to the statewide average.

As Robert pointed out in his blog post, these are not official figures, so it is possible that some of the extreme proportions are caused by data entry errors rather than by hordes of unimmunized students.

Funnel plots for public, private, and charter schools

The following graph shows a panel of plots for public, private, and charter schools. There are many public schools whose proportion of unimmunized students is well above the statewide average. For the private and charter schools, about 10 schools stand out in each group.

I think the plot of private schools is particularly interesting. When the media reports on outbreaks of childhood diseases in schools, there is often a mention of a "religious exemption," which is when a parent or guardian states that a child will not be immunized because of their religious beliefs. The report often mentions that private schools are often affiliated with a particular religion or church. I've therefore assumed that private schools have a larger proportion of unimmunized students because of the religious exemption. These data do not indicate which students are unimmunized because of a religious exemption, but the panel of funnel plots indicates that, overall, not many private schools have an abnormally large proportion of unimmunized students. In fact, the private schools show smaller deviations from the expected value than the public and charter schools.

Summary

In summary, I started with one of Robert Allison's graphs and augmented it to create a funnel plot for proportions. A funnel plot shows the average proportion and confidence limits for proportions (given the sample size). If the students in the schools were randomly sampled from a population where 4.58% of students are unimmunized, then few schools would be outside of the confidence curve. Of course, in reality, schools are not random samples. Many features of school performance—including unimmunized students—depend on local socioeconomic conditions. By taking into account the size of the classes, the funnel plot identifies schools that have an exceptionally large proportion of unimmunized students. A funnel plot can help school administrators and public health officials identify schools that might benefit from intervention programs and additional resources, or for which the data were incorrectly entered.

The post A funnel plot for immunization rates appeared first on The DO Loop.

11月 122018
 

The SGPLOT procedure enables you to use the value of a response variable to color markers or areas in a graph. For example, you can use the COLORRESPONSE= option to define a variable whose values will be used to color markers in a scatter plot or cells in a heat map. You can use the COLORMODEL= statement to use a pre-defined color ramp or to define and use a custom color ramp. These options are often used to visualize a "third variable" (Z) in a two-dimensional plot (Y vs X).

However, the SGPLOT procedure is not the only way to create color ramps. Recently, I needed to create a custom color ramp and specify a color for values that were outside of a certain range. I also needed to change the color used for missing values in the response variable. This article shows how to perform these tasks by using the RANGEATTRMAP statement in the SAS Graph Template Language (GTL).

Create a basic color ramp in SAS

SAS provides several ways to visualize the values of a response variable in scatter plots, contour plots, heat maps, and other graphs. An example is shown to the right. In this scatter plot, the markers represent data for patients in a heart study. The two-dimensional coordinates of the markers represent the systolic blood pressure and the MRW ratio (a body-mass index) of 80 patients. The colors indicate measurements of cholesterol. Three observations are highlighted by using arrows. One point has an extreme value of cholesterol (400). Two others have missing values of cholesterol and are shown in a grey color by default.

The code that generates the graph follows. The COLORRESPONSE= option is used to name the response variable. The values of the response variable determine the color for each marker according to the color ramp that is specified on the COLORMODEL= option.

data Have;           /* example data: a subset of the Heart data set */
set Sashelp.Heart(firstobs=5000 rename=(Systolic=X MRW=Y Cholesterol=Z)
                  where=(X < 200 AND Y < 200));
label X="Systolic" Z="Cholesterol";
keep X Y Z;
run;
 
title "Markers Colored by Using the COLORRESPONSE= Option";
proc sgplot data=Have;
   scatter x=X y=Y / colorresponse=Z                 /* specify response variable */
                     colormodel=ThreeAltColorRamp    /* specify color ramp */
                     filledoutlinedmarkers markerattrs=(symbol=CircleFilled size=12);
   xaxis grid; yaxis grid;
run;

Advanced features for custom color ramps

The SGPLOT option provide basic functionality for creating a custom color ramp. For more control, you can use either of the following advanced techniques:

The RANGEATTRMAP statement

As shown in the previous example, using a color ramp requires that you specify two pieces of information: a response variable and a color ramp. In GTL, use the RANGEATTRVAR statement to specify the response variable. Use the RANGEATTRMAP statement to define a custom color ramp.

Although the documentation discusses it, I was initially confused about the difference between color ramps and "alt" color ramps. I erroneously assumed that they are functionally equivalent. After all, when I use the COLORMODEL= option in PROC SGPLOT, I can choose any predefined color ramp. For example, the graph at the start of this article is created by using COLORMODEL=ThreeAltColorRamp, which is a blue-black-red color ramp. However, I could just as easily specify COLORMODEL=ThreeColorRamp to use a blue-white-red color ramp.

However, in the RANGEATTRMAP statement, there is a substantive difference between the RANGECOLOR= option and the RANGEALTCOLOR= option:

  • Use the RANGECOLOR= and RANGECOLORMODEL= options to define a color ramp for "area plots" that use a fill color, such as bar charts, histograms, mosaic plots, heat maps, and contour plots. Sanjay's example of a heat map and my example of a mosaic plot are both "area plots" that use the RANGECOLORMODEL= option.
  • Use the RANGEALTCOLOR= and RANGEALTCOLORMODEL= options to define a color ramp for graphs that use lines and markers, such as scatter plots and series plots. The documentation for the RANGEATTRMAP statement provides an example of a scatter plot whose markers are colored according to a range attribute map.

The following GTL defines a template that creates a scatter plot that is similar to the one at the beginning of this article. Because I want to color markers, I use the RANGEALTCOLOR= and RANGEALTCOLORMODEL= options. The RANGEATTRMAP statement contains three RANGE statements. Each RANGE statement associates a color or colors with ranges of data values:

  • A custom yellow-orange-red color ramp is used for cholesterol values less than 350.
  • A custom color (black) is used for markers whose Z-value is greater than 350.
  • A custom color (lime green) is used for markers whose Z-value is missing.
proc template;
define statgraph scatter3Dcol;
begingraph;
   rangeattrmap name="ResponseRange";
        range min-350 / rangeAltColorModel=(CXFFFFB2 CXFED976 CXFEB24C CXFD8D3C CXFC4E2A CXE31A1C CXB10026);
        range OTHER   / rangeAltColor=Black;   /* or use the OVER or UNDER keyword */
        range MISSING / rangeAltColor=Lime;    /* color for missing values */
   endrangeattrmap;
   rangeattrvar var=Z                          /* specify response variable in data set */
                attrmap="ResponseRange"        /* specify custom color ramp */
                attrvar=RangeVar;              /* alias for this variable/ramp combination */
 
   entrytitle "Markers Colored by Using the RANGEATTRMAP Statement";
   layout overlay;
      scatterplot x=X y=Y / markercolorgradient=RangeVar  /* color by Z and custom color ramp */
                  filledoutlinedmarkers=true
                  markerattrs=(symbol=circlefilled size=12) 
                  name="Scatter";
      continuouslegend "Scatter" / title='Cholesterol'; 
   endlayout;
endgraph;
end;
run;
 
proc sgrender data=Have template=scatter3Dcol;
run;

You can do more with the RANGEATTRMAP statement, but I'll stop here. In summary, you can use the RANGEATTRMAP (and RANGEATTRVAR) statement in the Graph Template Language to define a custom color ramp. The RANGEATTRMAP statement supports features that are not surfaced in PROC SGPLOT, such as enabling you to specify a color for out-of-range values and missing values. If you are going to use the color ramp in an "area plot" such as a heat map, use the RANGECOLOR= and RANGECOLORMODEL= options to define the color ramp. If you are going to use the color ramp to color lines and markers, use the RANGEALTCOLOR= and RANGEALTCOLORMODEL= options.

The post Define custom color ramps by using the RANGEATTRMAP statement in GTL appeared first on The DO Loop.

11月 072018
 

When solving optimization problems, it is harder to specify a constrained optimization than an unconstrained one. A constrained optimization requires that you specify multiple constraints. One little typo or a missing minus sign can result in an infeasible problem or a solution that is unrelated to the true problem. This article shows two ways to check that you've correctly specified the constraints for a two-parameter optimization. The first is a program that translates the linear constraint matrix in SAS/IML software into a system of linear equations in slope-intercept form. The second is a visualization of the constraint region.

For simplicity, this article only discusses two-dimensional constraint regions. For these regions, the SAS/IML constraint matrix is a k x 4 matrix. We also assume that the constraints are linear inequalities that define a 2-D feasible region.

The SAS/IML matrix for boundary and linear constraints

In SAS/IML software you must translate your boundary and linear constraints into a matrix that encodes the constraints. (The OPTMODEL procedure in SAS/OR software uses a more natural syntax to specify constraints.) The first and second rows of the constraint matrix specify the lower and upper bounds, respectively, for the variables. The remaining rows specify the linear constraints. In general, when you have p variables, the first p columns contain a matrix of coefficients; the (p+1)st column encodes whether the constraint is an equality or inequality; the (p+2)th column specifies the right-hand side of the linear constraints. In this article, p = 2.

The following matrix is similar to the one used in a previous article about how to find an initial guess in a feasible region. The matrix encodes three inequality constraints for a two-parameter optimization. The first two rows of the first column specify bounds for the first variable: 0 ≤ x ≤ 10. The first two rows of the second column specify bounds for the second variable: 0 ≤ y ≤ 8. The third through fifth rows specify linear inequality constraints, as indicated in the comments. The third column encodes the direction of the inequality constraints. A value of -1 means "less than"; a value of +1 means "greater than." You can also use the value 0 to indicate an equality constraint.

proc iml;
con = {  0   0   .   .,        /* lower bounds */
        10   8   .   .,        /* upper bounds */
         3  -2  -1  10,        /* 3*x1 + -2*x2 LE 10 */
         5  10  -1  56,        /* 5*x1 + 10*x2 LE 56 */
         4   2   1   7 };      /* 4*x1 +  2*x2 GE  7 */

Verify the constraint matrix

A pitfall of encoding the constraints is that you might wonder whether you made a mistake when you typed the constraint matrix. Also, each linear constraint (row) in the matrix represents a linear equation in "standard form" such as 3*x - 2*y ≤ 10, whereas you might be more comfortable visualizing constraints in the equivalent "slope-intercept form" such as "y ≥ (3/2)x - 5.

It is easy to transform the equations from standard form to slope-intercept form. If c2 > 0, then the standard equation c1*x + c2*y ≥ c0 is transformed to y ≥ (-c1/c2)*x + (c0/c2). If c2 < 0, you need to remember to reverse the sign of the inequality: y ≤ (-c1/c2)*x + (c0/c2). Because the (p+1)th column has values ±1, you can use the SIGN function to perform all transformations without writing a loop. The following function "decodes" the constraint matrix and prints it is "human readable" form:

start BLCToHuman(con);
   nVar = ncol(con) - 2;
   Range = {"x", "y"} + " in [" + char(con[1,1:nVar]`) + "," + char(con[2,1:nVar]`) + "]";
   cIdx = 3:nrow(con);              /* rows for the constraints */
   c1 = -con[cIdx,1] / con[cIdx,2];
   c0 =  con[cIdx,4] / con[cIdx,2];
   sign = sign(con[cIdx,3] # con[cIdx,2]);
   s = {"<=" "=" ">="};
   idx = sign + 2;                 /* convert from {-1 0 1} to {1 2 3} */
   EQN = "y" + s[idx] + char(c1) + "x + " + char(c0);
   print Range, EQN;
finish;
 
run BLCToHuman(con);

Visualize the feasible region

In a previous article, I showed a technique for visualizing a feasible region. The technique is crude but effective. You simply evaluate the constraints at each point of a dense, regular, grid. If a point satisfies the constraints, you plot it in one color; otherwise, you plot it in a different color.

For linear constraints, you can perform this operation by using matrix multiplication. If A is the matrix of linear coefficients and all inequalities are "greater than" constraints, then you simply check that A*X ≥ b, where b is the right-hand side (the (p+2)th column). If any of the inequalities are "less than" constraints, you can multiply that row of the constraint matrix by -1 to convert it into an equivalent "greater than" constraint. This algorithm is implemented in the following SAS/IML function:

start PlotFeasible(con);
   L = con[1, 1:2];                   /* lower bound constraints */
   U = con[2, 1:2];                   /* upper bound constraints */
   cIdx = 3:nrow(con);                /* rows for linear inequality constraints */
   C = con[cIdx,] # con[cIdx,3];      /* convert all inequalities to GT */ 
   M = C[, 1:2];                      /* coefficients for linear constraints */
   RHS = C[, 4];                      /* right hand side */
   x = expandgrid( do(L[1], U[1], (U[1]-L[1])/50),    /* define (x,y) grid from bounds */
                   do(L[2], U[2], (U[2]-L[2])/50) );
   q = (M*x`>= RHS);                  /* 0/1 indicator matrix */
   Feasible = (q[+,] = ncol(cIdx));   /* are all constraints satisfied? */
   call scatter(x[,1], x[,2],) group=Feasible grid={x y} option="markerattrs=(symbol=SquareFilled)";
finish;
 
ods graphics / width=430px height=450px;
title "Feasible Region";
title2 "Formed by Bounds and Linear Constraints";
run PlotFeasible(con);

The graph shows that the feasible region as a red pentagonal region. The left and lower edges are determined by the bounds on the parameters. There are three diagonal lines that are determined by the linear inequality constraints. By inspection, you can see that the point (2, 2) is in the feasible region. Similarly, the point (8,6) is in the blue region and is not feasible.

The technique uses the bounds on the parameters (the first two rows) to specify the range of the axes in the plot. If your problem does not have explicit upper/lower bounds on the parameters, you can "invent" bounds such as [0, 100] or [-10, 10] just for plotting the feasible region.

In summary, this article shows two techniques that can help you verify that a constraint matrix is correctly specified. The first translates that constraint matrix into "human readable" form. The second draws a crude approximation of the feasible region by evaluating the constraints at each location on a dense grid. Both techniques are shown for two-parameter problems. However, with a little effort, you could incorporate the second technique when searching for a good initial guess in higher-dimensional problems.

The post Visualize the feasible region for a constrained optimization appeared first on The DO Loop.

10月 312018
 

A useful feature in PROC SGPLOT is the ability to easily visualize subgroups of data. Most statements in the SGPLOT procedure support a GROUP= option that enables you to overlay plots of subgroups. When you use the GROUP= option, observations are assigned attributes (colors, line patterns, symbols, ...) that indicate the value of the grouping variable. This article reviews the GROUP= option and shows how to trick PROC SGPLOT into performing a group analysis for statements that do not support the GROUP= option.

Three ways to plot data by groups

It is common to use colors or symbols to indicate which observations belong to each category of a grouping variable. Typical grouping variables include gender (male and female), political affiliation (democrats, republicans, and independents), race, education level, and so forth. When you use the SAS SG procedures to plot subsets of the data, there are three ways to arrange the plots. You can plot each group individually, you can create a panel of graphs, or you can overlay the groups on a single graph:

  • If you use the BY statement in PROC SGPLOT, each subgroup is plotted independently in its own graph. The axes are scaled based only on the data in that subgroup.
  • If you use the PANELBY statement in PROC SGPANEL, each subgroup is plotted in a cell of a lattice in which the axes are scaled to a common range.
  • If you use the GROUP= option, the plots for each subgroup are overlaid in a single graph.

The following SAS statements demonstrate each approach. Only the GROUP= overlay is displayed because that is the topic of this article:

proc sgplot data=Sashelp.Iris;        /* BY-group visualization. Three independent graphs.  */
   by Species;
   histogram SepalLength;
   density SepalLength / type=kernel;
run;
 
proc sgpanel data=Sashelp.Iris;       /* Panel visualization. Shared common axis. */
   panelby Species / columns=1 onepanel;
   histogram SepalLength;
   density SepalLength / type=kernel;
run;
 
proc sgplot data=Sashelp.Iris;       /* Overlay three plots in one graph  */
   histogram SepalLength / GROUP=Species binstart=42 binwidth=3 transparency=0.5;
   density SepalLength / type=kernel GROUP=Species;
run;

How to emulate the GROUP= option

Many SGPLOT statements (such as the SERIES and SCATTER statements) have supported the GROUP= option since the early days of ODS graphics. For other statements, support for the GROUP= option was added more recently. For example, the GROUP= option was added to the HISTOGRAM and DENSITY statements in SAS 9.4M2.

Here is a trick (shown to me by my colleague, Paul) that you can use to emulate the GROUP= option. If a statement in the SGPLOT procedure does not support the GROUP= option, but the statement DOES support the FREQ= option, you can often use the FREQ= option to construct a graph that overlays the subgroups. You need to do two things. First, you need to create binary indicators variables (sometimes called dummy variables) for each level of the categorical variable. You then use multiple statements, each with a different frequency variable, to overlay the subgroups. These two steps are shown by the following DATA step and call to PROC SGPLOT, which uses the FREQ= trick to overlay three histograms:

/* emulate a GROUP= option for SGPLOT statements that do not support GROUP= */
data IrisFreq;
   set sashelp.Iris;
   Freq1 = (Species='Setosa');     /* Binary. Equals 1 if observation is in 'Setosa' group     */
   Freq2 = (Species='Versicolor'); /* Binary. Equals 1 if observation is in 'Versicolor' group */
   Freq3 = (Species='Virginica');  /* Binary. Equals 1 if observation is in 'Verginica' group  */
run;
 
title "Overlay Histograms by Using the FREQ= Option";
%let binOpts = binstart=42 binWidth=3 transparency=0.5; /* ensure common bins */
proc sgplot data=IrisFreq;
   histogram SepalLength / freq=Freq1 &binOpts;    /* only the 'Setosa' group     */
   histogram SepalLength / freq=Freq2 &binOpts;    /* only the 'Versicolor' group */
   histogram SepalLength / freq=Freq3 &binOpts;    /* only the 'Virginica' group  */
run;

The graph overlays three histograms, one for each value of the Species variable. The result is similar to the earlier graph that used the GROUP= option. You can use the same trick on the DENSITY statement, although you will need to manually set the line attributes so that they match the attributes for the corresponding histograms.

You can use this technique in old versions of SAS to emulate the GROUP= option on the HISTOGRAM statement. You can also use it for statements that do not support the GROUP= option.

Although this example uses the DATA step to manually create the dummy variables that are used as frequencies, you can also create the dummy variables automatically by generating the "design matrix" for the Species variable. The GLMMOD procedure is the simplest way to create dummy variables in SAS, but other procedures provide additional features.

Generate prediction ellipses for groups

Several years ago I showed how you can overlay prediction ellipses for each group on a scatter plot. (Note that the ELLIPSE statement does not support a GROUP= option.) The technique requires that you transpose the data from long to wide form by creating new variables, one for each group of the categorical variable. Paul recognized that creating a dummy variable and using the FREQ= option is a simpler way to overlay prediction ellipses on a scatter plot:

title "Prediction Ellipses for Iris Data";
proc sgplot data=IrisFreq;
   scatter x=PetalLength y=PetalWidth / group=Species;
   ellipse x=PetalLength y=PetalWidth / freq=Freq1 legendlabel="Setosa";
   ellipse x=PetalLength y=PetalWidth / freq=Freq2 legendlabel="Versicolor";
   ellipse x=PetalLength y=PetalWidth / freq=Freq3 legendlabel="Virginica";
run;

Advantages and disadvantages of the FREQ= trick

The main advantage of using the FREQ= option for group processing is that it enables you to overlay subgroups even when a statement does not support the GROUP= option. A secondary advantage is that this technique gives you complete control of the attributes of each subgroup. Although you can use the STYLEATTRS statement to control many group attributes, the STYLEATTTRS statement does not enable you to control marker sizes or line widths, to name two examples.

The FREQ= trick does have some disadvantages:

  • You can't use the FREQ= trick for statements that produce graphs of categorical variables. The SGPLOT documentation states, “If your plot is overlaid with other categorization plots, then the first FREQ variable that you specified is used for all of the plots.” [My emphasis.]
  • As mentioned earlier, if you are trying to produce multiple grouped plots, you might need to manually assign attributes to obtain consistency among the levels of the grouping variables. By default, most ODS styles use different attributes for each statement. If you want the attributes for the fourth statement to match the attributes for the first statement, you need to use an option such as LINEATTRS=GraphData1 on the fourth statement.

In conclusion, if a statement supports the GROUP= option, you should probably use that option to overlay plots of the groups. But if a statement does NOT support the GROUP= option (such as the ELLIPSE and HEATMAP statements), you can use the FREQ= trick to emulate the GROUP= behavior.

I thank my colleague, Paul, for showing me the ellipse example. I hope you agree that this trick is a real treat, not just on Halloween, but every day!

The post A trick to plot groups in PROC SGPLOT appeared first on The DO Loop.

10月 012018
 

Programmers on a SAS discussion forum recently asked about the chi-square test for proportions as implemented in PROC FREQ in SAS. One person asked the basic question, "how do I test the null hypothesis that the observed proportions are equal to a set of known proportions?" Another person said that the null hypothesis was rejected for his data, and he wanted to know which categories were "responsible for the rejection." This article answers both questions and points out a potential pitfall when you specify the proportions for a chi-square goodness-of-fit test in PROC FREQ.

The basic idea: The proportion of party affiliations for a group of voters

To make these questions concrete, let's look at some example data. According to a 2016 Pew research study, the party affiliation of registered voters in the US in 2016 was as follows: 33% of voters registered as Democrats, 29% registered as Republicans, 34% were Independents, and 4% registered as some other party. If you have a sample of registered voters, you might want to ask whether the observed proportion of affiliations matches the national averages. The following SAS data step defines the observed frequencies for a hypothetical sample of 300 voters:

data Politics;
length Party $5;
input Party $ Count;
datalines;
Dem   125
Repub  79
Indep  86
Other  10
;

You can use the TESTP= option on the TABLES statement in PROC FREQ to compare the observed proportions with the national averages for US voters. You might assume that the following statements perform the test, but there is a potential pitfall. The following statements contain a subtle error:

proc freq data=Politics;
/* National Pct:    D=33%, R=29%, I=34%, Other=04% */
tables Party / TestP=(0.33 0.29 0.34 0.04) nocum; /* WARNING: Contains an error! */
weight Count;
run;

If you look carefully at the OneWayFreqs table that is produced, you will see that the test proportions that appear in the fourth column are not the proportions that we intended to specify! The problem is that order of the categories in the table is alphabetical whereas the proportions in the LISTP= option correspond to the order that the categories appear in the data. In an effort to prevent this mistake, the documentation for the TESTP= option warns you to "order the values to match the order in which the corresponding variable levels appear in the one-way frequency table." The order of categories is important in many SAS procedures, so always think about the order! (The ESTIMATE and CONTRAST statements in linear regression procedures are other statements where order is important.)

Specify the test proportions correctly

To specify the correct order, you have two options: (1) list the proportions for the TESTP= option according to the alphabetical order of the categories, or (2) use the ORDER=DATA option on the PROC FREQ statement to tell the procedure to use the order of the categories as they appear in the data. The following statement uses the ORDER=DATA option to specify the proportions:

proc freq data=Politics ORDER=DATA;   /* list proportions in DATA order */
/*                 D=33%, R=29%, I=34%, Other=04% */
tables Party / TestP=(0.33 0.29 0.34 0.04);  /* cellchi2 not available for one-way tables */
weight Count;
ods output OneWayFreqs=FreqOut;
output out=FreqStats N ChiSq;
run;

The analysis is now correctly specified. The chi-square table indicates that the observed proportions are significantly different from the national averages at the α = 0.05 significance level.

Which categories are "responsible" for rejecting the null hypothesis?

A SAS programmer posted a similar analysis on a discussion and asked whether it was possible to determine which categories were the most different from the specified proportions. The analysis shows that the chi-square test rejects the null hypothesis, but does not indicate whether only one category is different than expected or whether many categories are different.

Interestingly, PROC FREQ supports such an option for two-way tables when the null hypothesis is the independence of the two variables. Recall that the chi-square statistic is a sum of squares, where each cell in the table contributes one squared value to the sum. The CELLCHI2 option on the TABLES statement "displays each table cell’s contribution to the Pearson chi-square statistic.... The cell chi-square is computed as
(frequencyexpected)2 / expected
where frequency is the table cell frequency (count) and expected is the expected cell frequency" under the null hypothesis.

Although the option is not supported for one-way tables, it is straightforward to use the DATA step to compute each cell's contribution. The previous call to PROC FREQ used the ODS OUTPUT statement to write the OneWayFreqs table to a SAS data set. It also wrote a data set that contains the sample size and the chi-square statistic. You can use these statistics as follows:

/* create macro variables for sample size and chi-square statistic */
data _NULL_;
   set FreqStats;
   call symputx("NumObs", N);         
   call symputx("TotalChiSq", _PCHI_);
run;
 
/* compute the proportion of chi-square statistic that is contributed
   by each cell in the one-way table */
data Chi2;
   set FreqOut;
   ExpectedFreq = &NumObs * TestPercent / 100;
   Deviation = Frequency - ExpectedFreq;
   ChiSqContrib = Deviation**2 / ExpectedFreq;  /* (O - E)^2 / E */
   ChiSqPropor = ChiSqContrib / &TotalChiSq;    /* proportion of chi-square contributed by this cell */
   format ChiSqPropor 5.3;
run;
 
proc print data=Chi2; 
   var Party Frequency TestPercent ExpectedFreq Deviation ChiSqContrib ChiSqPropor; 
run;

The table shows the numbers used to compute the chi-square statistic. For each category of the PARTY variable, the table shows the expected frequencies, the deviations from the expected frequencies, and the chi-square term for each category. The last column is the proportion of the total chi-square statistic for each category. You can see that the 'Dem' category contributes the greatest proportion. The interpretation is that the observed count of the 'Dem' group is much greater than expected and this is the primary reason why the null hypothesis is rejected.

You can also create a bar chart that shows the contributions to the chi-square statistic. You can create the "chi-square contribution plot" by using the following statements:

title "Proportion of Chi-Square Statistic for Each Category";
proc sgplot data=Chi2;
   vbar Party / response=ChiSqPropor datalabel=ChiSqPropor;
   xaxis discreteorder=data;
   yaxis label="Proportion of Chi-Square Statistic" grid;
run;
Contribution of each cell to the chi-square statistic

The bar chart makes it clear that the frequency of the 'Dem' group is the primary factor in the size of the chi-square statistic. The "chi-square contribution plot" is a visual companion to the Deviation Plot, which is produced automatically by PROC FREQ when you specify the PLOTS=DEVIATIONPLOT option. The Deviation Plot shows whether the counts for each category are more than expected or less than expected. When you combine the two plots, you can make pronouncements like Goldilocks:

  • The 'Dem' group contributes the most to the chi-square statistic because the observed counts are "too big."
  • The 'Indep' group contributes a moderate amount because the counts are "too small."
  • The remaining groups do not contribute much because their counts are "just right."

Summary

In summary, this article addresses three topics related to testing the proportions of counts in a one-way frequency table. You can use the TESTP= option to specify the proportions for the null hypothesis. Be sure that you specify the proportions in the same order that they appear in the OneWayFreqs table. (The ORDER=DATA option is sometimes useful for this.) If the data proportions do not fit the null hypothesis, you might want to know why. One way to answer this question is to compute the contributions of each category to the total chi-square computation. This article shows how to display that information in a table or in a bar chart.

The post Chi-square tests for proportions in one-way tables appeared first on The DO Loop.