A statistical programmer asked how to simulate event-trials data for groups. The subjects in each group have a different probability of experiencing the event. This article describes one way to simulate this scenario. The simulation is similar to simulating from a mixture distribution. This article also shows three different ways to visualize the results of the simulation.

### Modeling the data-generation process

A simulation is supposed to reproduce a data-generation process. Typically, a simulation is based on some real data. You (the programmer) need to ensure that the simulation reflects how the real data are generated. For example, there are often differences between simulating a designed experiment or simulating an observational study:

• In a designed experiment, the number of subjects in each group is determined by the researcher. For example, a researcher might choose 100 subjects, 50 men and 50 women. If so, each simulated data sample should also contain 50 males and 50 females.
• In an observational study, the number of subjects in each group is a random variable that is based on the proportion of the population in each group. For example, a researcher might choose 100 subjects at random. In the data, there might be 53 men and 47 women, but that split is the result of random chance. Each simulated data sample should generate the number of males and females according to a random binomial variable. Some samples might have a 52:48 split, others a 49:51 split, and so on. In general, the group sizes are simulated from a multinomial distribution.

For other ways to choose the number of subjects in categories, see the article, "Simulate contingency tables with fixed row and column sums in SAS."

### Data that specify events and trials

Suppose you have the following data from a pilot observational study. Subjects are classified into six groups based on two factors. One factor has three levels (for example, political party) and another has two levels (for example, sex). For each group, you know the number of subjects (Ni) and the number of events (Ei). You can therefore estimate the proportion of events in the i_th group as Ei / Ni. Because you know the total number of subjects (Sum(N)), you can also estimate the probability that an individual belongs to each group (πi = Ni / Sum(N)).

The following SAS DATA step defines the proportions for each group:

%let NTotal = 107; data Events; length Group $3; input Cat1$ Cat2 $Events N; Group = catx(" ", Cat1, Cat2); /* combine two factors into group ID */ p = Events / N; /* estimate P(event | Group[i]) */ pi = N / &NTotal; /* estimate P(subject in Group[i]) */ drop Cat1 Cat2; format p pi Best5.; datalines; A F 1 10 A M 8 24 B F 4 13 B M 12 36 C F 12 16 C M 6 8 ; proc print data=Events; run; This pilot study has 107 subjects. The first group ("A F") contained 10 subjects, which is π1 = 9.3% of the total. In the first group, one person experienced the event, which is p1 = 10% of the group. The other rows of the table are similar. Some people call this "event-trials" data because the quantity of interest is the proportion of events that occurred in the subjects. The next section shows how to use these estimates to simulate new data samples. ### Simulate group sizes and proportions in groups One of the advantages of simulation studies is that you can take preliminary data from a pilot study and "scale it up" to create larger samples, assuming that the statistics from the pilot study are representative of the parameters in the population. For example, the pilot study involved 107 subjects, but you can easily simulate samples of size 250 or 1000 or more. The following SAS/IML program simulates samples of size 250. The statistics from the pilot study are used as parameters in the simulation. For example, the simulation assumes that 0.093 of the population belongs to the first group, and that 10% of the subjects in the first group experience the event of interest. Notice that some groups will have a small number of subjects (such as Group 1 and Group 5, which have small values for π). Among the groups, some will have a small proportion of events (Group 1) whereas others will have a large proportion (Group 5 and Group 6). The following simulation uses the following features: • The RandMultinomial function in SAS/IML generates a random set of samples sizes based on the π vector, which estimates the proportion of the population that belongs to each group. • For each group, you can use the binomial distribution to randomly generate the number of events in the group. • The proportion of events is the ratio (Number of Events) / (Group Size). /* simulate proportions of events in groups */ proc iml; use Events; read all var _ALL_; /* creates vectors Groups, Events, N, p, pi, etc */ close; numGroups = nrow(Group); call randseed(12345); nTrials = 250; /* simulate data for a study that contains 250 subjects */ Size = T( RandMultinomial(1, nTrials, pi) ); /* new random group sizes */ nEvents = j(numGroups, 1, .); /* vector for number of events */ do i = 1 to nrow(nEvents); nEvents[i] = randfun(1, "Binomial", p[i], Size[i]); end; Proportion = nEvents / Size; /* proportion of events in each group */ results = Size || nEvents || Proportion; print results[r=Group c={'Size' 'nEvents' 'Proportion'} F=Best5.]; The table shows one random sample of size 250 based on the statistics in the smaller pilot study. The group size and the number of events are both random variables. If you rerun this simulation, the number of subjects in the groups will change, as will the proportion of subjects in each group that experience the event. It would be trivial to adapt the program to handle a designed experiment in which the Size vector is constant. ### Simulating multiple samples An important property of a Monte Carlo simulation study is that it reveals the distribution of statistics that can possibly result in a future data sample. The table in the previous section shows one possible set of statistics, but we can run the simulation additional times to generate hundreds or thousands of additional variations. The following program generates 1,000 possible random samples and outputs the results to a SAS data set. You can then graph the distribution of the statistics. This section uses box plots to visualize the distribution of the proportion of events in each group: /* simulate 1000 random draws under this model */ SampleID = j(numGroups, 1, .); nEvents = j(numGroups, 1, .); /* vector for number of events */ create Sim var {'SampleID' 'Group' 'Size' 'nEvents' 'Proportion'}; do ID = 1 to 1000; /* assign all elements the same value: https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */ SampleID[,] = ID; Size = T( RandMultinomial(1, nTrials, pi) ); /* new random group sizes */ do i = 1 to nrow(nEvents); nEvents[i] = randfun(1, "Binomial", p[i], Size[i]); end; Proportion = nEvents / Size; append; end; close Sim; QUIT; data PlotAll; set Sim Events(keep=Group p); run; title "Proportion of Events in Each Group"; title2 "Simulated N=250; NumSim=1000"; /* box plot */ proc sgplot data=PlotAll noautolegend; hbox Proportion / category=Group; scatter y=Group x=p / markerattrs=GraphData2(symbol=CircleFilled size=11); yaxis type=discrete display=(nolabel); /* create categorical axis */ xaxis grid; run; The box plot shows the distribution of the proportion of events in each group. For example, the proportion in the first group ("A F"), ranges from a low of 0% to a high of 33%. The proportion in the sixth group ("C M"), ranges from a low of 22% to a high of 100%. The red markers are the parameter values used in the simulation. These are the estimates from the pilot study. ### Alternative ways to visualize the distribution within groups The box plot shows a schematic representation of the distribution of proportions within each group. However, there are alternative ways to visualize the distributions. One way is to use a strip plot, as follows: /* strip plot */ proc sgplot data=PlotAll noautolegend; scatter y=Group x=Proportion / jitter transparency=0.85 /* handle overplotting */ markerattrs=(symbol=CircleFilled); scatter y=Group x=p / markerattrs=(symbol=CircleFilled size=11); yaxis type=discrete reverse display=(nolabel); /* create categorical axis */ xaxis grid; run; The strip plot uses a semi-transparent marker to display each statistic. (Again, the red markers indicate the parameters for the simulation.) The density of the distribution is visualized by the width of the strips and by the darkness of the plot, which is due to overplotting the semi-transparent markers. This plot makes it apparent that the variation between groups is based on the relative size of the groups in the pilot study. Large groups such as Group 4 ("B M") have less variation than small groups such as Group 6 ("C M"). That's because the standard error of the proportion is inversely proportional to the square root of the sample size. Thus, smaller groups have a wider distribution than the larger groups. You can see the same features in a panel of histograms. In the following visualization, the distribution of the proportions is shown by using a histogram. Red vertical lines indicate the parameters for the simulation. This graph might be easier to explain to non-statistician. /* plot of histograms */ proc sgpanel data=PlotAll; panelby Group / onepanel layout=rowlattice novarname; histogram Proportion; refline p / axis=x lineattrs=GraphData2(thickness=2); colaxis grid; run; ### Summary This article shows how to simulate event-trials data in SAS. In this example, the data belong to six different groups, and the probability of experiencing the event varies between groups. The groups are also different sizes. In the simulation, the group size and the number of events in each group are random variables. This article also shows three different ways to visualize the results of the simulation: a box plot, a strip plot, and a panel of histograms. The post Simulate proportions for groups appeared first on The DO Loop. A colleague spent a lot of time creating a panel of graphs to summarize some data. She did not use SAS software to create the graph, but I used SAS to create a simplified version of her graph, which is shown to the right. (The colors are from her graph.) The graph displays a panel of bar charts with uncertainty intervals. The height of the bar is the mean value of the response. I don't know which statistic she used for the "error bars." It could be one of three common statistics that visualize uncertainty: the standard deviation of the data, the standard error of the mean, or a 95% confidence interval for the mean. Each cell in the panel shows a set of "dynamite plots," which is a common name for those bar charts with error bars. Data visualization experts generally agree that dynamite plots are not the best way to summarize a data distribution. Alternatives that show the distribution of the data include box plots, strip plots, and violin plots. Even if you choose to present summarized data instead of the raw data, the dynamite plot is not the best choice. If you adhere to Tufte's advice to maximize the data-to-ink ratio, a simple dot plot with error bars conveys the same information with less ink. This article discusses a makeover of the panel. In particular, here are some best practices for remaking this graph: • Replace the dynamite plots with a simple dot plot. • Exchange the axes so that the group labels can be written horizontally and the response variable is plotted along the X axis. • Do the colors improve or detract from the visualization? Try plotting the data with and without using colored bars. ### Moving from bar charts to dot plots Bar charts are best for representing frequencies and percentages. A dot plot is a better way to display the means and error bars for data. In the SGPLOT procedure in SAS, the DOT statement will summarize the data and visualize the summary statistics. However, I do not have access to the raw data, so I will use a SCATTER statement to create a panel of dot plots. The following data step creates some data that are similar to the results that my colleague presented. I use PROC FORMAT to create a SAS format for the categories and groups. This enable readers to reuse my code for their own visualizations, and it keeps my colleague's data private: proc format; value CatFmt 1 = "Category 1" 2 = "Category 2" 3 = "Category 3" 4 = "Category 4"; value GroupFmt 1 = "Group 1" 2 = "Group 2" 3 = "Control"; run; data Have; format Category CatFmt. Group GroupFmt.; input Category Group Mean W; Lower = Mean - W; Upper = Mean + W; datalines; 1 1 35 2 1 2 55 5 1 3 70 9 2 1 55 10 2 2 30 7 2 3 10 4 3 1 5.5 3 3 2 8.2 4 3 3 7.2 1 4 1 24.5 7 4 2 27 15 4 3 12 11 ; The following makeover uses the following features of SAS graphics: • You can use PROC SGPANEL to create the panel of plots. The PANELBY statement specifies the categorical variable to use for the cells. You can use options on this statement to control the layout for the cells and the appearance of the cell headers. • The SCATTER statement plots the mean within each category. Use the GROUP= option to offset the means for each group. Use the XERRORLOWER= and XERRORUPPER= options to plot error bars. This plot uses the default colors for distinguishing groups. The colors are determined by the current ODS style. For the HTMLBlue style, the colors are dark shades of blue, red, and green. According to my colleague, she chose the purple-pink-yellow palette of colors "for aesthetic reasons." In a subsequent section, I show how to override the default colors for groups. • The COLAXIS and ROWAXIS statements are equivalent to the XAXIS and YAXIS statements in PROC SGPLOT. They enable you to add grid lines, control labels and tick marks, and modify axis-related properties. The following call to PROC SGPANEL shows one way to remake my colleague's graph as a panel of dot plots: ds graphics / width=480px height=400px; title "Paneled Makeover Plot"; title2 "Horizontal Dot Plot with Error Bars"; footnote J=L "Without Colored Bands"; proc sgpanel data=Have noautolegend; panelby Category / novarname layout=rowlattice rows=4; scatter x=Mean y=Group / xerrorlower=Lower xerrorupper=Upper errorbarattrs=(thickness=2) errorcapscale=0.5 markerattrs=(symbol=CircleFilled) group=Group; colaxis grid display=(nolabel); rowaxis grid display=(nolabel) type=discrete REVERSE offsetmin=0.2 offsetmax=0.2; run; If you turn your head sideways, you can see that this graph displays exactly the same information as the original graph. However, it is easier to compare the means and intervals within and across categories. The group labels are displayed horizontally, which means that this same layout will support longer group labels without needing to rotate the text. Furthermore, you can use the row-based layout even if there are additional categories or groups. The graph will be longer, but it will be the same width. That means that the graph will fit on one sheet of paper for many groups or categories. In contrast, the original layout gets wider as more categories are added, which might make it hard to fit the panel on a printed page in portrait mode. ### Adding colored bands to a dot plot In the previous graph, colors are used to distinguish the markers and lines for each group. This enables you to easily track the same group across different categories. The author of the original graph chose light colors for the bars, but light colors are not good choices for lines and markers. However, if those colors are important to the story that you want to tell, you can add colored bands to the background of the graph. You can use the STYLEATTRS statement to tell the procedure what colors to use for groups. You can use the HIGHLOW statement to create colored bands, which is a tip I learned from my colleague Sanjay Matange. To use the HIGHLOW statement, you need to add new variables to the data to indicate the minimum and maximum values for the bands. The following statements create an alternative visualization: /* set minimum and maximum values for colored bars */ data Want / view=Want; set Have; xmin=0; xmax=80; run; footnote J=L "With Colored Bands"; proc sgpanel data=Want noautolegend; styleattrs datacolors=(Lavender Salmon LightYellow); panelby Category / novarname layout=rowlattice rows=4; /* plot colored bands in the background */ highlow y=Group low=xmin high=xmax / group=Group type=bar groupdisplay=cluster barwidth=1.0 clusterwidth=0.9 transparency=0.5 fill nooutline; scatter x=Mean y=Group / xerrorlower=Lower xerrorupper=Upper markerattrs=(symbol=CircleFilled color=Black) errorbarattrs=(thickness=2 color=Black) errorcapscale=0.5; colaxis grid display=(nolabel); rowaxis grid display=(nolabel) type=discrete REVERSE offsetmin=0.2 offsetmax=0.2; run; Since the colored bands identify the groups, I used black to display the means and error bars. The information in this graph is the same as for the previous graph, but this graph emphasizes the colors more. In general, I prefer the graph without the color bands, but there might be situations in which colors enhance the presentation of the data. ### Summary In summary, this article shows how to remake a panel of dynamite plots, which are bar charts with error bars. The article shows that you can display the same information by using a dot plot with error bars. Furthermore, it is often helpful to rotate the plot so the response variable is plotted horizontally instead of vertically. This redesign can support many groups and categories because adding more categories makes the graph grow taller, not wider. Lastly, the article shows how to add colored bands in the background of a dot plot. ### Appendix: Create the original graph For completeness, the following SAS code creates the panel that is shown at the top of this article: title "Paneled Dynamite Plot"; title2 "Vertical Bar Charts with Error Bars"; footnote; proc sgpanel data=Have noautolegend; styleattrs datacolors=(CX6f6db2 CXe18dac CXe9ef7a); /* colors of original graph */ panelby Category / novarname layout=columnlattice columns=4; vbarbasic Group / response=Mean transparency=0.5 group=Group; scatter x=Group y=Mean / yerrorlower=Lower yerrorupper=Upper markerattrs=(size=0) errorbarattrs=(color=Black); colaxis grid display=(nolabel) fitpolicy=rotate; rowaxis grid display=(nolabel); run; The post Remaking a panel of dynamite plots appeared first on The DO Loop. This article shows how to create a "sliced survival plot" for proportional-hazards models that are created by using PROC PHREG in SAS. Graphing the result of a statistical regression model is a valuable way to communicate the predictions of the model. Many SAS procedures use ODS graphics to produce graphs automatically or upon request. The default graphics from SAS procedures are sufficient for most tasks. However, occasionally I customize or modify the graphs that are produced by SAS procedures. The graph to the right is an example of how to create a customized sliced fit plot from PROC PHREG. This article shows how to create the plot and others like it. It is worth mentioning that SAS supports multiple procedures for analyzing survival data. Of these, PROC LIFETEST has the most extensive methods for customizing graphics. An entire chapter of the SAS/STAT Users Guide is devoted to customizing the Kaplan-Meier plot, which is a popular graph in the life sciences. ### Sliced fit plots in SAS regression procedures Before discussing PROC PHREG, let's review the concepts for a sliced fit plot for a regression model. In a "sliced fit plot," you display the predicted values for the model (typically versus a continuous covariate) for a specific value of the other continuous covariates. Often the mean value is used for slicing, but other popular choices include the quartiles: the 25th percentile, the median, and the 75th percentile. My favorite way to create fit plots is to use the EFFECTPLOT SLICEFIT statement in PROC PLM. By default, the EFFECTPLOT statement evaluates the continuous covariates at their mean values, but you can use the SLICEBY= option or the AT= option to specify other values. PROC PLM and the EFFECTPLOT statement were designed primarily for generalized linear models. If you try to use the EFFECTPLOT statement for other kinds of regression models (or nonparametric regressions), you might get a warning: WARNING: The EFFECTPLOT statement cannot be used for the specified model. In that case, you can manually create a sliced fit plot by scoring the model on a carefully prepared data set. ### Example data and the default behavior of PROC PHREG PROC PHREG can create graphs automatically, so let's start by looking at the default survival plot. The following data is from an example in the PROC PHREG documentation. The goal is to predict the survival time (TIME and VSTATUS) of patients with multiple myeloma based on the measured values (log-transformed) of blood urea nitrogen (BUN). The example in the documentation analyzes nine variables, but for clarity, I only include two explanatory variables in the following data and examples: data Myeloma; input Time VStatus LogBUN Frac @@; label Time='Survival Time' VStatus='0=Alive 1=Dead'; datalines; 1.25 1 2.2175 1 1.25 1 1.9395 1 2.00 1 1.5185 1 2.00 1 1.7482 1 2.00 1 1.3010 1 3.00 1 1.5441 0 5.00 1 2.2355 1 5.00 1 1.6812 0 6.00 1 1.3617 0 6.00 1 2.1139 1 6.00 1 1.1139 1 6.00 1 1.4150 1 7.00 1 1.9777 1 7.00 1 1.0414 1 7.00 1 1.1761 1 9.00 1 1.7243 1 11.00 1 1.1139 1 11.00 1 1.2304 1 11.00 1 1.3010 1 11.00 1 1.5682 0 11.00 1 1.0792 1 13.00 1 0.7782 1 14.00 1 1.3979 1 15.00 1 1.6021 1 16.00 1 1.3424 1 16.00 1 1.3222 1 17.00 1 1.2304 1 17.00 1 1.5911 0 18.00 1 1.4472 0 19.00 1 1.0792 1 19.00 1 1.2553 1 24.00 1 1.3010 1 25.00 1 1.0000 1 26.00 1 1.2304 1 32.00 1 1.3222 1 35.00 1 1.1139 1 37.00 1 1.6021 0 41.00 1 1.0000 1 41.00 1 1.1461 1 51.00 1 1.5682 1 52.00 1 1.0000 1 54.00 1 1.2553 1 58.00 1 1.2041 1 66.00 1 1.4472 1 67.00 1 1.3222 1 88.00 1 1.1761 0 89.00 1 1.3222 1 92.00 1 1.4314 1 4.00 0 1.9542 0 4.00 0 1.9243 0 7.00 0 1.1139 1 7.00 0 1.5315 0 8.00 0 1.0792 1 12.00 0 1.1461 0 11.00 0 1.6128 1 12.00 0 1.3979 1 13.00 0 1.6628 0 16.00 0 1.1461 0 19.00 0 1.3222 1 19.00 0 1.3222 1 28.00 0 1.2304 1 41.00 0 1.7559 1 53.00 0 1.1139 1 57.00 0 1.2553 0 77.00 0 1.0792 0 ; Suppose you want to predict survival time by using LogBUN as the only explanatory variable. The following call to PROC PHREG builds the model and creates a survival plot: /* A COVARIATES= data set is not specified, so the average value for LogBUN is used in the survival plot */ proc phreg data=Myeloma plots(overlay)=survival; /* use plots(overlay CL) to get confidence limits */ model Time*VStatus(0)=LogBUN; run; Notice the title of the plot specifies that the survivor function is evaluated for LogBUN=1.3929. Why this strange value? Because this program did not use the BASELINE statement to specify a COVARIATES= data set. Consequently, the average value for LogBUN is used in the survival plot. You can use PROC MEANS to confirm the mean value of the LogBUN variable and to output other statistics such as the sample quartiles: proc means data=Myeloma Q1 Median Mean Q3; var LogBUN; run; The next section shows how to specify the quartiles as slicing values for LogBUN when you create a survival plot. ### Specify slicing values for a survival plot PROC PHREG has a special syntax for producing a sliced fit plot. On the BASELINE statement, use the COVARIATES= option to name a SAS data set in which you have specified values at which to slice the covariates. If you do not specify a data set, or if your data set does not list some of the variables in the model, then the model is evaluated at the mean values for the continuous variables and at the reference levels for the CLASS variables. Suppose you want to use the PHREG model to compare the probability of survival for three hypothetical patients whose LogBUN values are 1.15, 1.32, and 1.57, respectively. These LogBUN values for these patients are the sample quartiles for the distribution of patients in the study. The following DATA step creates a SAS data set that contains the quartile values. If you add the BASELINE statement and the COVARIATES= option, then PROC PHREG automatically incorporates those values into the survival graph that it creates: data InRisks; length Id$9; input Id LogBUN; datalines; Q1 1.15 Median 1.32 Q3 1.57 ;   proc phreg data=Myeloma plots(overlay)=survival; /* use plots(overlay CL) to get confidence limits */ model Time*VStatus(0)=LogBUN; baseline covariates=Inrisks / rowid=Id; run;

The survival graph now displays three curves, one for each specified value of the LogBUN variable. The curves are labeled by using the ROWID= option on the BASELINE statement. If you omit the option, the values of the LogBUN variable are displayed.

### Create your own custom survival plot

This article was motivated by a SAS programmer who wanted to customize the survival plot. For simplicity, let's suppose that you want to modify the title, add a grid, increase the width of the lines, and add an inset. In that case, you can use the OUT= option to write the relevant predicted values to a SAS data set. The SURVIVAL= option specifies the name for the variable that contains the survival probabilities. However, if you use the keyword _ALL_, then the procedure outputs estimates, standard errors, and confidence limits, and provides default names. For the _ALL_ keyword, "Survival" is used for the name of the variable that contains the probability estimates. You can then use PROC SGPLOT to create a custom survival plot, as follows:

/* write output to a SAS data set */ proc phreg data=Myeloma noprint; model Time*VStatus(0)=LogBUN; baseline covariates=Inrisks out=PHOut survival=_all_ / rowid=Id; run;   title "Customized Survival Curves"; proc sgplot data=PHOut; *band x=Time lower=LowerSurvival upper=UpperSurvival / group=Id transparency=0.75; step x=Time y=Survival / group=Id lineattrs=(thickness=2); inset ("Q1:"="LogBUN=1.15" "Median:"="LogBUN=1.32" "Q3:"="LogBUN=1.57") / border opaque; xaxis grid; yaxis grid; run;

### Including a classification variable

The Myeloma data contains a categorical variable (Frac) that indicates whether the patient had bone fractures at the time of diagnosis. If you want to include that variable in the analysis, you can do so. You can add that variable to the COVARIATES= data set to control which values are overlaid on the survival plot, as follows:

data InRisks2; length Id $9; input Id LogBUN Frac; datalines; Q1:0 1.15 0 Q1:1 1.15 1 Q3:0 1.57 0 Q3:1 1.57 1 ; proc phreg data=Myeloma plots(overlay)=survival; class Frac(ref='0'); model Time*VStatus(0)=LogBUN Frac; baseline covariates=Inrisks2 out=PHOut2 survival=_all_/rowid=Id; run; This process can be extended for other covariates in the model. When you add new variables to the model, you can also add them to the COVARIATES= data set, if you want. ### Summary PROC PHREG supports a convenient way to create a sliced survival plot. You can create a SAS data set that contains the names and values of covariates in the model. Variables that are not in the data set are assigned their mean values (if continuous) or their reference values (if categorical). The procedure evaluates the model at those values to create a survival plot for each hypothetical (or real!) patient who has those characteristics. If you use the PLOTS= option on the PROC PHREG statement, the procedure will create graphs for you. Otherwise, you can use the OUT= option on the BASELINE statement to write the relevant variables to a SAS data set so that you can create your own graphs. The post Sliced survival graphs in SAS appeared first on The DO Loop. In a previous article, I discussed a beautiful painting called "Phantom’s Shadow, 2018" by the Nigerian-born artist, Odili Donald Odita. I noted that if you overlay a 4 x 4 grid on the painting, then each cell contains a four-bladed pinwheel shape. The cells display rotations and reflections of the pinwheel. The figure at the right overlays the grid on Odita's painting. The cell in the upper right (marked R0) displays one pinwheel. When I first saw Odita's artwork, I thought that all 16 cells were a rotation or reflection of this one pinwheel. However, it turns out that Odita's work contains a mathematical twist. To generate all the pinwheels in the image, you need to use two different colorings of the pinwheel. The first coloring is (in counter-clockwise order) teal, orange, blue, and salmon. The second coloring is teal, salmon, blue, and orange. You cannot get the second colored pinwheel by rotating and reflecting the first. ### Rotations and reflections of a pinwheel The "fundamental pinwheel" (R0) is displayed to the left. As I showed in the previous article, you can generate this pinwheel by starting with a four-sided polygon and rotating it around the origin. For simplicity, you can start with the coordinates of the pinwheel. The following SAS data set contains the coordinates of the pinwheel vertices. As shown in the previous article, you can use the POLYGON statement in PROC SGPLOT to visualize the pinwheel. data Shape; input ID x y @@; datalines; 0 0 0 0 0.667 0.333 0 1 1 0 0 1 0 0 0 1 0 0 1 -0.333 0.667 1 -1 1 1 -1 0 1 0 0 2 0 0 2 -0.667 -0.333 2 -1 -1 2 0 -1 2 0 0 3 0 0 3 0.333 -0.667 3 1 -1 3 1 0 3 0 0 ; If you ignore the colors, the pinwheel shape has rotational and reflectional symmetries. Let's understand how this pinwheel looks when it is rotated or reflected under the symmetries of the square. The symmetries make up a finite group of order eight, which is known as the dihedral group of the square or D4. The group has a representation in terms of eight 2 x 2 orthogonal matrices. The following SAS/IML program defines a function that will transform a planar set of points according to one of the matrices in the dihedral group. The program reads the pinwheel coordinates into a data matrix and transforms the pinwheel according to each element of D4. The results are plotted by using PROC SGPANEL: proc iml; /* actions of the D4 dihedral group */ start D4Action(v, act); /* the subroup of rotations by 90 degrees */ if act=0 then M = { 1 0, 0 1}; else if act=1 then M = { 0 -1, 1 0}; else if act=2 then M = {-1 0, 0 -1}; else if act=3 then M = { 0 1, -1 0}; /* the subgroup of reflections across horz, vert, or diagonals */ else if act=4 then M = {-1 0, 0 1}; else if act=6 then M = { 0 -1, -1 0}; else if act=7 then M = { 1 0, 0 -1}; else if act=5 then M = { 0 1, 1 0}; return( v*M ); /* = (M*z) */ finish; /* read in the pinwheel shape */ use Shape; read all var {x y} into P; read all var "ID"; close; /* write out the transformation of the pinwheel under the D4 actions */ OpNames = {"Identity" "R1" "R2" "R3" "S0" "S1" "S2" "S3"}; Name = OpNames[1]; Q = {. . .}; create Panel from Name Q[c={'Name' 'ID' 'x' 'y'}]; do i = 0 to 7; R = D4Action(P, i); Q = ID || R; Name = j(nrow(Q), 1, OpNames[i+1]); append from Name Q; end; close; QUIT; ods graphics / width=720px height=480px; /* for convenience, define macros for the COLAXIS and ROWAXIS options */ %macro colOpts; colaxis offsetmin=0 offsetmax=0 display=(nolabel noticks novalues); %mend; %macro rowOpts; rowaxis offsetmin=0 offsetmax=0 display=(nolabel noticks novalues); %mend; title "Dihedral Group (D4) Actions on Pinwheel Figure"; title2 "Colors=(Teal, Orange, Blue, Salmon)"; proc sgpanel data=Panel noautolegend; styleattrs wallcolor=&gray datacolors=(&teal &orange &blue &salmon); panelby Name / columns=4 onepanel; polygon x=x y=y ID=ID / group=ID fill; %colOpts; %rowOpts; run; The labels for each cell indicate the element of D4 that transforms the original pinwheel. The first row represents rotations by 0, 90, 180, and 270 degrees. The second row represents a vertical reflection (S0), a reflection across a diagonal line (S1), a reflection across a different diagonal (S2), and a horizontal reflection (S3). As mentioned earlier, I assumed that these eight images would be sufficient to reproduce the cells in Odita's artwork, but I was wrong. Let's look at the 16 cells in Odita's image and label each cell according to the elements of the D4 dihedral group: I was surprised to discover that only half of the cells are transformations of the upper-left pinwheel. Furthermore, the four cells in the upper-left corner are arranged in the same order as the four cells in the lower-right corner. Out of the eight possible transformations of the pinwheel under the D4 group, Odita only uses four. The remaining cells either have the colors in a different order or have the "vane" of the pinwheel pointing in a different direction. ### A second generator Mathematically, Odita's pattern has a second pinwheel "generator." You could choose any of the unlabeled cells to be the generator of the remaining cells, but I will choose the cell on the second row and third column, which is displayed to the left. For this pinwheel, the vanes are pointing in the same direction as the first pinwheel, but the colors (in counter-clockwise order) are teal, salmon, blue, and orange. Notice that the salmon and orange colors have switched their order. In terms of the SAS data set, we can create this new colored pinwheel by switching the ID variable for the observations that have ID=1 and ID=3. You can then create a panel of images for this new pinwheel under the dihedral transformations, as follows: /* switch the colors of the 2nd and 4th vanes */ data Panel2; set Panel; if ID=1 then ID=3; else if ID=3 then ID=1; run; title "Dihedral Group (D4) Actions on Pinwheel Figure"; title2 "Colors=(Teal, Salmon, Blue, Orange)"; proc sgpanel data=Panel2 noautolegend; styleattrs wallcolor=&gray datacolors=(&teal &orange &blue &salmon); panelby Name / columns=4 onepanel; polygon x=x y=y ID=ID / group=ID fill; %colOpts; %rowOpts; run; Let's use a prime to distinguish the elements of this second panel. That is, the first row represents the elements R0, R1, R2, and R3. The second row represents the elements S0, S1, S2, and S3. We can use these labels to identify the remaining cells in Odita's work, as follows: The remaining eight cells display six of the possible patterns for the dihedral actions on the second generator. The S0 and S2 actions are repeated. The S1 and S3 actions do not appear. ### Using SAS to generate mathematical art Now that we have identified all the cells in Odita's painting, we could actually use PROC SGPANEL in SAS to create a mathematical reproduction of his work. Or, we can use SAS to create new Odita-inspired images. I will do the latter. The following statements stack the panels of the dihedral group for the two fundamental pinwheels: data Pinwheels; length Name$20; set Panel Panel2; Group = floor((_N_-1)/20); run;   ods graphics / width=640px height=640px; title "Dihedral Shadow"; proc sgpanel data=Pinwheels noautolegend; styleattrs wallcolor=&gray datacolors=(&teal &orange &blue &salmon); panelby Group / columns=4 onepanel noheader; polygon x=x y=y ID=ID / group=ID fill; %colOpts; %rowOpts; run;

In homage to Odita, I name this mathematical image "Dihedral's Shadow." The first two rows show the actions (rotations and reflections) of the D4 dihedral group on a pinwheel shape whose colors are in the sequential order "teal, orange, blue, and salmon." The last two rows show the dihedral actions on a pinwheel shape whose colors are in the order "teal, salmon, blue, and orange."

Notice that this mathematical image does not share some of the aesthetic properties of Odita's work. In Odita's image, the colors of the polygons are chosen so that the four neighbors (up-down and left-right) of a polygon are different colors from the polygon itself. My creation does not have that property. For example, if you think of the image as an 8 x 8 grid and traverse the first row, you see two teal polygons followed by two salmon polygons followed by two blue polygons. In Odita's work, no row has two adjacent polygons that have the same color.

I don't like the three orange polygons in the middle of my image. Odita arranged his pinwheel images to avoid neighboring polygons that have the same color. I suppose I could try arranging the pinwheels in a different order.

### Summary

There is much more that could be said about the mathematical structures in Odita's work, "Phantom’s Shadow, 2018." In this article, I show how you can analyze components of the painting as reflections and rotations of two four-color pinwheel-shaped figures. Mathematically, these images can be generated by using orthogonal matrices. The images can be assembled into a lattice by using PROC SGPANEL in SAS.

This article shows how to create a purely mathematical image: the dihedral actions on two pinwheel shapes. It arranges the images without consideration of how a pinwheel looks adjacent to its neighbors. In a triumph of art over mathematics, I think Odita's art is more pleasing than my imitation.

Would you like to use SAS to create mathematical art? You can download the SAS program that I used to create the images in this article. I challenge my readers to use the Pinwheels data set to create their own mathematical works. You can use one of the following ideas, or explore your own ideas:

• The design of experiments uses arrays to describe factors that are varied in the experiment. Use these pinwheel images to create a visual representation of an experimental design. The factors are rotation (4 levels), reflection (4 levels), and color-sequence (2 levels).
• Rearrange the cells in "Dihedral's Shadow" to reduce the number of adjacent polygons that have the same color. You can do this manually, but extra points if you use math or SAS/OR software to minimize the number of adjacent polygons that have the same color!
• Use a random number generation to create a random arrangement of the pinwheels.

I have created a thread on the SAS Support Communities that you can use to post your own creations. Have fun!

The post Use SAS to create mathematical art appeared first on The DO Loop.

This article shows how to estimate and visualize a two-dimensional cumulative distribution function (CDF) in SAS. SAS has built-in support for this computation. Although the bivariate CDF is not used as much as the univariate CDF, the bivariate version is still a useful tool in understanding the probable values of a random variate for a two-dimensional distribution.

Recall that the univariate CDF at the value x is the probability that a random observation from a distribution is less than x. In symbols, if X is a random variable, then the distribution function is $F_X(x) = P(X\leq x)$. The two-dimensional CDF is similar, but it gives the probability that two random variables are both less than specified values. If X and Y are random variables, the bivariate CDF for the joint distribution is given by $F_{XY}(x,y) = P(X \leq x ~\mbox{and}~ Y \leq y)$.

### Univariate estimates of the CDF

Let's start with a one-dimensional example before exploring the topic in higher dimensions. For univariate probability distributions, statisticians use both the density function and the cumulative distribution function. They each contain the same information, and if you know one you can find the other. The cumulative distribution (CDF) of a continuous distribution function is the integral of the probability density function (PDF); the density function is the derivative of the cumulative distribution.

In terms of data analysis, a histogram is an estimate of a density function. So is a kernel density estimate. Consequently, there are three popular methods for estimating a CDF from data:

Let's compare the first and last methods to estimate the CDF. For data, use the MPG_City variable from the Sashelp.Cars data.

/* Use PROC UNIVARIATE to compute the empirical CDF from data */ proc univariate data=sashelp.Cars; /* use the FREQ option to get a table */ var MPG_City; cdfplot MPG_City; ods select CDFPlot; ods output CDFPlot=UNIOut; run;   /* Use PROC KDE to estimate the CDF from a kernel density estimate */ proc kde data=sashelp.Cars; univar MPG_City / CDF out=KDEOut; run;   /* combine the estimates and plot on the same graph */ data All; set UNIOut(rename=(ECDFY=ECDF ECDFX=X)) KDEOut(rename=(distribution=kerCDF Value=X)); ECDF = ECDF / 100; /* convert from percent to proportion */ keep X ECDF kerCDF; run;   title "Estimates of CDF"; proc sgplot data=All; label X = "MPG_City"; series x=X y=ECDF / legendlabel="Empirical CDF"; series x=X y=kerCDF / legendlabel="Kernel CDF"; yaxis label="Cumulative Distribution" grid; xaxis values=(5 to 60 by 5) grid valueshint; run;

The two estimates are very close. The empirical CDF is more "jagged," whereas the estimate from the kernel density is smoother. You could use either curve to estimate quantiles or probabilities.

### Bivariate estimates of the CDF

Now onto the bivariate case. For bivariate data, you cannot use PROC UNIVARIATE (obviously, from the name!), but you can still use PROC KDE. Instead of using the UNIVAR statement, you can use the BIVAR statement to compute a bivariate KDE. Let's look at the joint distribution of the MPG_City and Weight variables in the Sashelp.Cars data:

title; proc sgplot data=sashelp.Cars; scatter x=MPG_City y=Weight; run;

The scatter plot shows that the density of these points is concentrated near the lower-left corner of the plot. You can use PROC KDE to visualize the density. The PLOTS= option can create a contour plot and a two-dimensional histogram, as follows:

title "2-D Histogram"; proc kde data=sashelp.Cars; bivar MPG_City Weight / plots=(Contour Histogram) CDF out=KDEOut(rename=(Distribution=kerCDF Value1=X Value2=Y)); run;

The OUT= option writes a SAS data set that contains the estimates for the density; the CDF option adds the cumulative distribution estimates. The column names are Value1 (for the first variable on the BIVAR statement), Value2 (for the second variable), and Distribution (for the CDF). I have renamed the variables to X, Y, and CDF, but that is not required. You can plot use a heat map (or a contour plot) to visualize the bivariate CDF, as follows:

title "Estimate of 2-D CDF"; proc sgplot data=KDEOut; label X="MPG_City" Y="Weight" kerCDF="Kernel CDF"; heatmapparm x=X y=Y colorresponse=kerCDF; run;

### The CDF of bivariate normal data

The CDF is not used very often for bivariate data because the density function shows more details. In some sense, all CDFs look similar. In the previous example, the two variables are negatively correlated and are not linearly related. For comparison, the following program uses the RANDNORMAL function in SAS/IML to generate bivariate normal data that has a positive correlation of 0.6. (I have previously shown the empirical CDF for this data, and I have also drawn the graph of the bivariate CDF for the population.) The calls to PROC KDE and PROC SGPLOT are the same as for the previous example:

proc iml; call randseed(123456); N = 1e5; /* sample size */ rho = 0.6; Sigma = ( 1 || rho)// /* correlation matrix */ (rho|| 1 ); mean = {0 0}; Z = randnormal(N, mean, Sigma); /* sample from MVN(0, Sigma) */ create BivNorm from Z[c={X Y}]; append from Z; close; quit;   proc kde data=BivNorm; bivar X Y / plots=NONE CDF out=KDEOut(rename=(distribution=kerCDF Value1=X Value2=Y)); run;   title "Estimate of 2-D Normal CDF"; proc sgplot data=KDEOut aspect=1; label kerCDF="Kernel CDF"; heatmapparm x=X y=Y colorresponse=kerCDF; run;

If you compare the bivariate CDF for the Cars data to the CDF for the bivariate normal data, you can see differences in the range of the red region, but the overall impression is the same. The CDF is near 0 in the lower-left corner, near 1 in the upper-right corner, and is approximately 0.5 along an L-shaped curve near the middle of the data. This is the basic shape of a generic 2-D CDF, just as an S-shape or sigmoid-shaped curve is the basic shape of a generic 1-D CDF.

### Summary

You can use the CDF option on the BIVAR statement in PROC KDE to obtain an estimate for a bivariate CDF. The estimate is based on a kernel density estimate of the data. You can use the HEATMAPPARM statement in PROC SGPLOT to visualize the CDF surface. If you prefer a contour plot, you can use the graph template language (GTL) to create a contour plot.

The post Estimate a bivariate CDF in SAS appeared first on The DO Loop.

It can be frustrating to receive an error message from statistical software. In the early days of the SAS statistical graphics (SG) procedures, an error message that I dreaded was
ERROR: Attempting to overlay incompatible plot or chart types.
This error message appears when you attempt to use PROC SGPLOT to overlay two plots that have different properties. For example, you might be trying to overlay a bar chart, which requires a categorical variable, with a scatter plot or series plot, which often displays values of a continuous variable.

In SAS 9.4M3 and later, there is a simple way to avoid this error message. You can combine a bar chart with other plot types by using the VBARBASIC or HBARBASIC statements, which create a bar chart that is compatible with other "basic" plots.

### Compatibility of plot types

The SAS documentation includes an explanation of chart types, and a table that shows which plots you can overlay when you use PROC SGPLOT or PROC SGPANEL. (The doc erroneously puts HBARBASIC and VBARBASIC in the "categorization" group, but they should be in the "basic group." I have contacted the doc writer to correct the mistake.) If you try to overlay plots from different chart types, you will get the dreaded ERROR: Attempting to overlay incompatible plot or chart types.

First, let me emphasize that this error message only appears when you use PROC SGPLOT or SGPANEL to overlay the plots. If you use the Graph Template Language (GTL) and PROC SGRENDER, you do not have this restriction.

I have previously written about two important cases for which it is necessary to overlay an empirical distribution (bar chart or histogram) and a theoretical distribution, which is visualized by using a scatter plot or series plot:

My previous article shows how to overlay a bar chart and a series plot, but the example is a little complicated. The examples in the next sections are much simpler. There are two ways to combine a bar chart and a line plot: You can use the HBAR and HLINE statements, or you can use the HBARBASIC and SERIES statements. By using HBARBASIC, you can overlay a bar chart with many other plots.

### Overlay a bar chart and a line plot

Suppose you want to use a bar chart to display the average height (by age) of a sample of school children. You also want to add a line that shows the average heights in the population. Because you must use "compatible" plot types, the traditional approach is to combine the HBAR and HLINE statements, as follows. (For simplicity, I omit the legend on this graph.) The DATA step creates fake data, which are supposed to represent the national average and a range of values for the average heights.

data NationalAverage; /* fake data for demonstration purposes */ label Average = "National Average"; input Age Average Low High; datalines; 11 60 53 63 12 62 54 65 13 65 55 67 14 66 56 68 15 67 57 70 16 68 58 72 ;   data All; set Sashelp.Class NationalAverage; run;   title "Average Heights of Students in Class, by Age"; proc sgplot data=All noautolegend; hbar Age / response=height stat=mean; hline Age / response=Average markers datalabel; run;

This is the "classic" bar chart and line plot. This syntax has been available in SAS since at least SAS 9.2. It enables you to combine multiple statements for discrete variables, such as HBAR/VBAR, HLINE/VLINE, and DOT. However, in some situations, you might need to overlay a bar chart and more complicated plots. In those situations, use the HBARBASIC or VBARBASIC graphs, as shown in the next section.

### Overlay a bar chart and plots of continuous data

The VBARBASIC and HBARBASIC statements (introduced in SAS 9.4M3) enable you to combine bar charts with one or more other "basic" plots such as scatter plots, series plots, and box plots. Like the VBAR and HBAR statements, these statements can summarize raw data. They have almost the same syntax as the VBAR and HBAR statements.

Suppose you want to combine a bar chart, a series plot, and a high-low plot. You can't use the VBAR or HBAR statements because that leads to "incompatible plot or chart types." However, you can use the VBARBASIC and HBARBASIC statements, as follows:

proc sgplot data=All noautolegend; hbarbasic Age / response=height stat=mean name="S" legendlabel="Class"; series y=Age x=Average / markers datalabel=Average name="Avg" legendlabel="National Average"; highlow y=Age low=Low high=High; keylegend "S" "Avg"; run;

Notice that the SERIES and HIGHLOW statements create "basic" graphs. To overlay these on a bar chart, use the HBARBASIC statement. In a similar way, you can overlay many other graph types on a bar chart.

### Summary

Sometimes you need to overlay a bar chart and another type of graph. If you aren't careful, you might get the error message: ERROR: Attempting to overlay incompatible plot or chart types. In SAS 9.4M3 and later, there is a simple way to avoid this error message. You can use the VBARBASIC or HBARBASIC statements to create a "basic" bar chart that is compatible with other "basic" plots.

The post Overlay other graphs on a bar chart with PROC SGPLOT appeared first on The DO Loop.

Most introductory statistics courses introduce the bar chart as a way to visualize the frequency (counts) for a categorical variable. A vertical bar chart places the categories along the horizontal (X) axis and shows the counts (or percentages) on the vertical (Y) axis. The vertical bar chart is a precursor to the histogram, which visualizes the distribution of counts for a continuous variable that has been binned.

Although bar charts are often displayed by using vertical bars, it is often advantageous to use a horizontal bar chart instead. This article discusses three situations in which a horizontal bar chart is preferable to a vertical bar chart.

### Create a horizontal bar chart in SAS

In SAS, it is easy to create a vertical or a horizontal bar chart:

• In PROC SGPLOT, you can use the HBAR statement to create a horizontal bar chart. Often you can switch from a vertical bar chart (the VBAR statement) by changing one letter: VBAR → HBAR. For pre-summarized data, you can use the HBARPARM statement.
• In PROC FREQ, you can create a bar chart by using the PLOTS=FREQPLOT option on the TABLES statement. By default, you get a vertical bar chart. Use the ORIENT=HORIZONTAL suboption to create a horizontal bar chart.

For example, the following calls to SAS procedures create vertical and horizontal bar charts. The charts show the number of patients in a study who smoke, where the smoking category has five different levels, from non-smoker to heavy smoker. This categorical variable is ordinal, so I chose to sort the data and use the DISCRETEORDER=DATA option (for PROC SGPLOT) or the ORDER=DATA option (for PROC FREQ) so that the bars appear in a logical order.

/* Sort the data by smoking status: See https://blogs.sas.com/content/iml/2016/06/20/select-when-sas-data-step.html */ data Heart; set sashelp.heart; select (Smoking_Status); when ('Non-smoker') Smoking_Cat=1; when ('Light (1-5)') Smoking_Cat=2; when ('Moderate (6-15)') Smoking_Cat=3; when ('Heavy (16-25)') Smoking_Cat=4; when ('Very Heavy (> 25)') Smoking_Cat=5; otherwise Smoking_Cat=.; end; run; proc sort data=Heart; by Smoking_Cat; run;   ods graphics/ width=400px height=300px; /* make the graphs small */   /* Standard vertical bar charts */ title "Vertical Bar Chart"; proc sgplot data=Heart; vbar Smoking_Status; xaxis discreteorder=data; /* use data order instead of alphabetical */ yaxis grid; run; proc freq data=Heart order=data; tables Smoking_Status / plot=FreqPlot; run;   title "Horizontal Bar Chart"; proc sgplot data=Heart; hbar Smoking_Status; xaxis grid; yaxis discreteorder=data; /* use data order instead of alphabetical */ run; proc freq data=Heart order=data; /* Y axis is reversed from PROC SGPLOT */ tables Smoking_Status / plot=FreqPlot(orient=horizontal); run;

The plots from PROC SGPLOT are displayed; the ones from PROC FREQ are similar. I intentionally made the graphs somewhat small so that the category labels for the vertical bar chart cannot be displayed without rotating or splitting the text labels.

### 3 advantages to horizontal bar charts

There are a least three advantages to using horizontal bar charts:

• Long labels for the categories are easier to display and read. Whereas the vertical bar chart must use various tricks to display the labels (such as rotating or splitting the text), the horizontal bar chart can display the category labels in a natural, easy-to-read manner. This is seen in the example above.
• Many categories are easier to display. The previous example has five categories but imagine having 20 or 50 categories. A horizontal bar chart can display the category names in a straightforward manner just by making the chart taller. This is an advantage for graphs on the printed page (in portrait mode) and for an HTML page because it is easy to scroll a web page vertically. The graph to the right shows a portion of a horizontal bar chart that has 45 categories. Each category is the name of a pair of variables and the bar chart shows the Pearson correlation between the two variables.
• Labels for many bars are easier to display without collision. It is common to label a bar with the count or percentage for that bar. For example, think about displaying the Pearson correlation coefficient next to each bar in the previous example. A horizontal chart enables you to display those values in a natural way, whereas a vertical chart of the same data does not have enough horizontal space between bars. There are ways to rotate the text, but, in most cases, the horizontal layout is both simpler to construct and easier to read.

A horizontal layout can also be helpful for labeling each segment in a stacked bar chart. An example is shown below. In practice, it is not always possible to get the labels to fit fully inside the bars, especially for categories that have few counts. However, if you have 10 or more categories, a horizontal bar chart offers a better chance of displaying segment labels inside the bars. You should experiment with both the vertical and horizontal charts to determine which is the better choice.

### Summary

SAS offers both vertical and horizontal bar charts. Vertical charts are used more often, but there are advantages to using a horizontal bar chart, especially if you are displaying many categories or categories that have long labels. This article shows how to create a horizontal bar chart and gives some situations in which the horizontal chart is preferable.

The post 3 reasons to prefer a horizontal bar chart appeared first on The DO Loop.

A previous article discusses how to interpret regression diagnostic plots that are produced by SAS regression procedures such as PROC REG. In that article, two of the plots indicate influential observations and outliers. Intuitively, an observation is influential if its presence changes the parameter estimates for the regression by "more than it should." Various researchers have developed statistics that give a precise meaning to "more than it should," and the formulas for several of the popular influence statistics are included in the PROC REG documentation.

This article discusses the Cook's D and leverage statistics. Specifically, how can you use the information in the diagnostic plots to identify which observations are influential? I show two techniques for identifying the observations. The first uses a DATA step and a formula to identify influential observations. The second technique uses the ODS OUTPUT statement to extract the same information directly from a regression diagnostic plot.

These are not the only regression diagnostic plots that can identify influential observations:

• The DFBETAS statistic estimates the effect that deleting each observation has on the estimates for the regression coefficients.
• The DFFITS statistic is a measure of how the predicted value changes when an observation is deleted. It is closely related to the Cook's D statistic.

### The data and model

As in the previous article, let's use a model that does NOT fit the data very well, which makes the diagnostic plots more interesting. The following DATA step adds a quadratic effect to the Sashelp.Thick data and also adds a variable that is used in a subsequent section to merge the data with the Cook's D and leverage statistics.

data Thick2; set Sashelp.Thick; North2 = North**2; /* add quadratic effect */ Observation = _N_; /* for merging with other data sets */ run;

### Graphs and labels

Rather than create the entire panel of diagnostic plots, you can use the PLOTS(ONLY)= option to create only the graphs for Cook's D statistic and for the studentized residuals versus the leverage. In the following call to PROC REG, the LABEL suboption requests that the influential observations be labeled. By default, the labels are the observation numbers. You can use the ID statement in PROC REG to specify a variable to use for the labels.

proc reg data=Thick2 plots(only label) =(CooksD RStudentByLeverage); model Thick = North North2 East; /* can also use INFLUENCE option */ run;

The graph of the Cook'd D statistic is shown above. The PROC REG documentation states that the horizontal line is the value 4/n, where n is the number of observations in the analysis. For these data, n = 75. According to this statistic, observations 1, 4, 8, 63, and 65 are influential.

The second graph is a plot of the studentized residual versus the leverage statistic. The PROC REG documentation states that the horizontal lines are the values ±2 and the vertical line is at the value 2p/n, where p is the number of parameters, including the intercept. For this model, p = 4. The observations 1, 4, 8, 10, 16, 63, and 65 are shown in this graph as potential outliers or potential high-leverage points.

From the graphs, we know that certain observations are influential. But how can highlight those influential observations in plots, print them, or otherwise analyze them? And how can we automate the process so that it works for any model on any data set?

### Manual computation of influential observations

There are two ways to determine which observations have large residuals or are high-leverage or have a large value for the Cook's D statistic. The traditional way is to use the OUTPUT statement in PROC REG to output the statistics, then identify the observations by using the same cutoff values that are shown in the diagnostic plots. For example, the following DATA step lists the observations whose Cook's D statistic exceeds the cutoff value 4/n ≈ 0.053.

/* manual identification of influential observations */ proc reg data=Thick2 plots=none; model Thick = North North2 East; /* can also use INFLUENCE option */ output out=RegOut predicted=Pred student=RStudent cookd=CookD H=Leverage; quit;   %let p = 4; /* number of parameter in model, including intercept */ %let n = 75; /* Number of Observations Used */ title "Influential (Cook's D)"; proc print data=RegOut; where CookD > 4/&n; var Observation East North Thick CookD; run;

This technique works well. However, it assumes that you can easily write a formula to identify the influential observations. This is true for regression diagnostics. However, you can imagine a more complicated graph for which "special" observations are not so easy to identify. As shown in the next section, you can leverage (pun intended) the fact that SAS already identified the special observations for you.

### Leveraging the power of the ODS OUTPUT statement

Did you know that you can create a data set from any SAS graphic? Many SAS programmers use ODS OUTPUT to save a table to a SAS data set, but the same technique enables you to save the data underlying any ODS graph. There is a big advantage of using ODS OUTPUT to get to the data in a graph: SAS has already done the work to identify and label the important points in the graph. You don't need to know any formulas!

Let's see how this works by extracting the observations whose Cook's D statistic exceeds the cutoff value. First, generate the graphs and use ODS OUTPUT to same the underlying data models, as follows:

/* Let PROC REG do the work. Use ODS OUTPUT to capture information */ ods exclude all; proc reg data=Thick2 plots(only label) =(CooksD RStudentByLeverage); model Thick = North North2 East; ods output CooksDPlot=CookOut /* output the data for the Cook's D graph */ RStudentByLeverage=RSOut; /* output the data for the outlier-leverage plot */ quit; ods exclude none;   /* you need to look at the data! */ proc print data=CookOut(obs=12); where Observation > 0; run;

When you output the data from a graph, you have to look at how the data are structured. Sometimes the data set has strange variable names or extra observations. In this case, the data begins with three extra observations for which the Cook's D statistic is missing. For the curious, fake observations are often used to set axes ranges or to specify the order of groups in a plot.

Notice that the CookOut data set includes a variable named Observation, which you can use to merge the CookOut data and the original data.

From the structure of the CookOut data set, you can infer that the influential observations are those for which the CooksDLabel variable is nonmissing (excepting the fake observations at the top of the data). Therefore, the following DATA step merges the output data sets and the original data. In the same DATA step, you can create other useful variables, such as a binary variable that indicates which observations have a large Cook's D statistic:

data All; merge Thick2 CookOut RSOut; by Observation; /* create a variable that indicates whether the obs has a large Cook's D stat */ CooksDInf = (^missing(CooksD) & CooksDLabel^=.); label CooksDInf = "Influential (Cook's D)"; run;   proc print data=All noobs; where CooksDInf > 0; var Observation East North Thick; run;   proc sgplot data=All; scatter x=East y=North / group=CooksDInf datalabel=CooksDLabel markerattrs=(symbol=CircleFilled size=10); run;

The output from PROC PRINT (not shown) confirms that observations 1, 4, 8, 63, and 65 have a large Cook's D statistic. The scatter plot shows that the influential observations are located at extreme values of the explanatory variables.

### Outliers and high-leverage points

The process to extract or visualize the outliers and high-leverage points is similar. The RSOut data set contains the relevant information. You can do the following:

1. Look at the names of the variables and the structure of the data set.
2. Merge with the original data by using the Observation variable.
3. Use one of more variables to identify the special observations.

The following call to PROC PRINT gives you an overview of the data and its structure:

/* you need to look at the data! */ proc print data=RSOut(obs=12) noobs; where Observation > 0; var Observation RStudent HatDiagonal RsByLevIndex outLevLabel RsByLevGroup; run;

For the RSOut data set, the indicator variable is named RsByLevIndex, which has the value 1 for ordinary observations and the value 2, 3, or 4 for influential observations. The meaning of each index value is shown in the RsByLevGroup variable, which has the corresponding values "Outlier," "Leverage," and "Outlier and Leverage" (or a blank string for ordinary observations). You can use these values to identify the outliers and influential observations. For example, you can print all influential observations or you can graph them, as shown in the following statements:

proc print data=All noobs; where Observation > 0 & RsByLevIndex > 1; var Observation East North Thick RsByLevGroup; run;   proc sgplot data=All; scatter x=East y=North / markerattrs=(size=9); /* ordinary points are not filled */ scatter x=East y=North / group=RsByLevGroup nomissinggroup /* special points are filled */ datalabel=OutLevLabel markerattrs=(symbol=CircleFilled size=10); run;

The PROC PRINT output confirms that we can select the noteworthy observations. In the scatter plot, the color of each marker indicates whether the observation is an outlier, a high-leverage point, both, or neither.

### Summary

It is useful to identify and visualize outliers and influential observations in a regression model. One way to do this is to manually compute a cutoff value and create an indicator variable that indicates the status of each observation. This article demonstrates that technique for the Cook's D statistic. However, an alternative technique is to take advantage of the fact that SAS can create graphs that label the outliers and influential observations. You can use the ODS OUTPUT statement to capture the data underlying any ODS graph. To visualize the noteworthy observations, you can merge the original data and the statistics, indicator variables, and label variables.

Although it's not always easy to decipher the variable names and the structure of the data that comes from ODS graphics, this technique is very powerful. Its use goes far beyond the regression example in this article. The technique enables you to incorporate any SAS graph into a second analysis or visualization.

The post Identify influential observations in regression models appeared first on The DO Loop.

When you fit a regression model, it is useful to check diagnostic plots to assess the quality of the fit. SAS, like most statistical software, makes it easy to generate regression diagnostics plots. Most SAS regression procedures support the PLOTS= option, which you can use to generate a panel of diagnostic plots. Some procedures (most notably PROC REG and PROC LOGISTIC) support dozens of graphs that help you to evaluate the fit of the model, to determine whether the data satisfy various assumptions, and to identify outliers and high-leverage points. Diagnostic plots are most useful when the size of the data is not too large, such as less than 5,000 observations.

This article shows how to interpret diagnostic plots for a model that does not fit the data. For an ill-fitting model, the diagnostic plots should indicate the lack of fit.

This article discusses the eight plots in the DiagnosticsPanel plot, which is produced by several SAS regression procedures. To make the discussion as simple as possible, this article uses PROC REG to fit an ordinary least squares model to the data.

The eight plots can be classified into five groups:

1. A plot of the observed versus predicted responses. (Center of panel.)
2. Two graphs of residual values versus the predicted responses. (Upper left of panel.)
3. Two graphs that indicate outliers, high-leverage observations, and influential observations. (Upper right of panel.)
4. Two graphs that assess the normality of the residuals. (Lower left of panel.)
5. A plot that compares the cumulative distributions of the centered predicted values and the residuals. (Bottom of panel.)

This article also includes graphs of the residuals plotted against the explanatory variables.

### Create a model that does not fit the data

This section creates a regression model that (intentionally) does NOT fit the data. The data are 75 locations and measurements of the thickness of a coal seam. A naive model might attempt to fit the thickness of the coal seam as a linear function of the East-West position (the East variable) and the South-North position (the North variable). However, even a cursory glance at the data (see Figure 2 in the documentation for the VARIOGRAM procedure) indicates that the thickness is not linear in the North-South direction, so the following DATA step creates a quadratic effect. The subsequent call to PROC REG fits the model to the data and uses the PLOT= option to create a panel of diagnostic plots. By default, PROC REG creates a diagnostic panel and a panel of residual plots. If you want to add a loess smoother to the residual plots, you can use the SMOOTH suboption to the RESIDUALPLOT option, as follows:

data Thick2; set Sashelp.Thick; North2 = North**2; /* add quadratic effect */ run;   proc reg data=Thick2 plots =(DiagnosticsPanel ResidualPlot(smooth)); model Thick = North North2 East; quit;

The panel of diagnostic plots is shown. The panel of residual plots is shown later in this article. To guide the discussion, I have overlaid colored boxes around certain graphs. You can look at the graphs in any order, but I tend to look at them in the order indicated by the numbers in the panel.

### 1. The predicted versus observed response

The graph in the center (orange box) shows the quality of the predictive model. The graph plots the observed response versus the predicted response for each observation. For a model that fits the data well, the markers will be close to the diagonal line. Markers that are far from the line indicate observations for which the predicted response is far from the observed response. That is, the residual is large.

For this model, the cloud of markers is not close to the diagonal, which indicates that the model does not fit the data. You can see several markers that are far below the diagonal. These observations will have large negative residuals, as shown in the next section.

### 2. The residual and studentized residual plots

Two residual plots in the first row (purple box) show the raw residuals and the (externally) studentized residuals for the observations.

The first graph is a plot of the raw residuals versus the predicted values. Ideally, the graph should not show any pattern. Rather, it should look like a set of independent and identically distributed random variables. You can use this graph for several purposes:

1. If the residuals seem to follow a curve (as in this example), it indicates that the model is not capturing all the "signal" in the response variable. You can try to add additional effects to the model to see if you can eliminate the pattern.
2. If the residuals are "fan shaped," it indicates that the residuals do not have constant variance. Fan-shaped residuals are an example of heteroscedasticity. The canonical example of a fan-shaped graph is when small (in magnitude) residuals are associated with observations that have small predicted values. Similarly, large residuals are associated with observations that have large predicted values. Heteroscedasticity does not invalidate the model's ability to make predictions, but it violates the assumptions behind inferential statistics such as p-values, confidence intervals, and significance tests.
3. Detect autocorrelation. If the residuals are not randomly scattered, it might indicate that they are not independent. A time series can exhibit autocorrelation; spatial data can exhibit spatial correlations.

The second residual graph often looks similar to the plot of the raw residuals. The vertical coordinates in the second graph (labeled "RStudent") are externally studentized residuals. The PROC REG documentation includes the definition of a studentized residual. For a studentized residual, the variance for the i_th observation is estimated without including the i_th observation. If the magnitude of the studentized residual exceeds 2, you should examine the observation as a possible outlier. Thus, the graph includes horizontal lines at ±2. For these data, five observations have large negative residuals.

### 3. Influential observations and high-leverage points

The two graphs in the upper right box (green) enable you to investigate outliers, influential observations, and high-leverage points. One graph plots the studentized residuals versus the leverage value for each observation. As mentioned previously, the observations whose studentized residuals exceed ±2 can be considered outliers. The leverage statistic attempts to identify influential observations. The leverage statistic indicates how far an observation is from the centroid of the data in the space of the explanatory variables. Observations far from the centroid are potentially influential in fitting the regression model. "Influential" means that deleting that observation can result in a relatively large change in the parameter estimates.

Markers to the right of the vertical line are high-leverage points. Thus, the three lines in the graph separate the observations into four categories: (1) neither an outlier nor a high-leverage point; (2) an outlier, but not a high-leverage point; (3) a high-leverage point, but not an outlier; and (4) an outlier and a high-leverage point.

If there are high-leverage points in your data, you should be careful interpreting the model. Least squares models are not robust, so the parameter estimates are affected by high-leverage points.

The needle plot of Cook's D second graph (second row, last column) is another plot that indicates influential observations. Needles that are higher than the horizontal line are considered influential.

A subsequent article includes details about how to use the Cook's D plot to identify influential observations in your data.

### 4. Normality of residuals

The graphs in the lower left (red box) indicate whether the residuals for the model are normally distributed. Normally distributed residuals are one of the assumptions of regression that are used to derive inferential statistics. The first plot is a normal quantile-quantile plot (Q-Q plot) of the residuals. If the residuals are approximately normal, the markers should be close to the diagonal line. The following table gives some guidance about how to interpret deviations from a straight line:

Shape of Q-Q Plots
Point Pattern Interpretation
all but a few points fall on a line outliers in the data
left end of pattern is below the line long tail on the left
left end of pattern is above the line short tail on the left
right end of pattern is below the line short tail on the right
right end of pattern is above the line long tail on the right

For the Q-Q plot of these residuals, the markers are below the diagonal line at both ends of the pattern. That means that the distribution of residuals is negatively skewed: it has a long tail on the left and a short tail on the right. This is confirmed by looking at the histogram of residuals. For this model, the residuals are not normal; the overlay of the normal density curve does not fit the histogram well.

The last graph (blue box) is the residual-fit spread plot, which I have covered in a separate article. For these data, you can conclude that the three explanatory variables account for a significant portion of the variation in the model.

### Residual plots with smoothers

Lastly, the PLOTS=RESIDUALPLOT(SMOOTH) option on the PROC REG statement creates a panel that includes the residuals plotted against the value of each explanatory variable. The SMOOTH suboption overlays a loess smoother, which is often helpful in determining whether the residuals contain a signal or are randomly distributed. The panel for these data and model is shown below:

The residuals plotted again the East variable shows a systematic pattern. It indicates that the model does not adequately capture the dependence of the response on the East variable. An analyst who looks at this plot might decide to add a quadratic effect in the East direction, or perhaps to model a fully quadratic response surface by including the North*East interaction term.

### Summary

This article created a regression model that does not fit the data. The regression diagnostic panel detects the shortcomings in the regression model. The diagnostic panel also shows you important information about the data, such as outliers and high-leverage points. The diagnostic plot can help you evaluate whether the data and model satisfy the assumptions of linear regression, including normality and independence of errors.

A subsequent article will describe how to use the diagnostic plots to identify influential observations.

The post An overview of regression diagnostic plots in SAS appeared first on The DO Loop.

When you fit a regression model, it is useful to check diagnostic plots to assess the quality of the fit. SAS, like most statistical software, makes it easy to generate regression diagnostics plots. Most SAS regression procedures support the PLOTS= option, which you can use to generate a panel of diagnostic plots. Some procedures (most notably PROC REG and PROC LOGISTIC) support dozens of graphs that help you to evaluate the fit of the model, to determine whether the data satisfy various assumptions, and to identify outliers and high-leverage points. Diagnostic plots are most useful when the size of the data is not too large, such as less than 5,000 observations.

This article shows how to interpret diagnostic plots for a model that does not fit the data. For an ill-fitting model, the diagnostic plots should indicate the lack of fit.

This article discusses the eight plots in the DiagnosticsPanel plot, which is produced by several SAS regression procedures. To make the discussion as simple as possible, this article uses PROC REG to fit an ordinary least squares model to the data.

The eight plots can be classified into five groups:

1. A plot of the observed versus predicted responses. (Center of panel.)
2. Two graphs of residual values versus the predicted responses. (Upper left of panel.)
3. Two graphs that indicate outliers, high-leverage observations, and influential observations. (Upper right of panel.)
4. Two graphs that assess the normality of the residuals. (Lower left of panel.)
5. A plot that compares the cumulative distributions of the centered predicted values and the residuals. (Bottom of panel.)

This article also includes graphs of the residuals plotted against the explanatory variables.

### Create a model that does not fit the data

This section creates a regression model that (intentionally) does NOT fit the data. The data are 75 locations and measurements of the thickness of a coal seam. A naive model might attempt to fit the thickness of the coal seam as a linear function of the East-West position (the East variable) and the South-North position (the North variable). However, even a cursory glance at the data (see Figure 2 in the documentation for the VARIOGRAM procedure) indicates that the thickness is not linear in the North-South direction, so the following DATA step creates a quadratic effect. The subsequent call to PROC REG fits the model to the data and uses the PLOT= option to create a panel of diagnostic plots. By default, PROC REG creates a diagnostic panel and a panel of residual plots. If you want to add a loess smoother to the residual plots, you can use the SMOOTH suboption to the RESIDUALPLOT option, as follows:

data Thick2; set Sashelp.Thick; North2 = North**2; /* add quadratic effect */ run;   proc reg data=Thick2 plots =(DiagnosticsPanel ResidualPlot(smooth)); model Thick = North North2 East; quit;`

The panel of diagnostic plots is shown. The panel of residual plots is shown later in this article. To guide the discussion, I have overlaid colored boxes around certain graphs. You can look at the graphs in any order, but I tend to look at them in the order indicated by the numbers in the panel.

### 1. The predicted versus observed response

The graph in the center (orange box) shows the quality of the predictive model. The graph plots the observed response versus the predicted response for each observation. For a model that fits the data well, the markers will be close to the diagonal line. Markers that are far from the line indicate observations for which the predicted response is far from the observed response. That is, the residual is large.

For this model, the cloud of markers is not close to the diagonal, which indicates that the model does not fit the data. You can see several markers that are far below the diagonal. These observations will have large negative residuals, as shown in the next section.

### 2. The residual and studentized residual plots

Two residual plots in the first row (purple box) show the raw residuals and the (externally) studentized residuals for the observations.

The first graph is a plot of the raw residuals versus the predicted values. Ideally, the graph should not show any pattern. Rather, it should look like a set of independent and identically distributed random variables. You can use this graph for several purposes:

1. If the residuals seem to follow a curve (as in this example), it indicates that the model is not capturing all the "signal" in the response variable. You can try to add additional effects to the model to see if you can eliminate the pattern.
2. If the residuals are "fan shaped," it indicates that the residuals do not have constant variance. Fan-shaped residuals are an example of heteroscedasticity. The canonical example of a fan-shaped graph is when small (in magnitude) residuals are associated with observations that have small predicted values. Similarly, large residuals are associated with observations that have large predicted values. Heteroscedasticity does not invalidate the model's ability to make predictions, but it violates the assumptions behind inferential statistics such as p-values, confidence intervals, and significance tests.
3. Detect autocorrelation. If the residuals are not randomly scattered, it might indicate that they are not independent. A time series can exhibit autocorrelation; spatial data can exhibit spatial correlations.

The second residual graph often looks similar to the plot of the raw residuals. The vertical coordinates in the second graph (labeled "RStudent") are externally studentized residuals. The PROC REG documentation includes the definition of a studentized residual. For a studentized residual, the variance for the i_th observation is estimated without including the i_th observation. If the magnitude of the studentized residual exceeds 2, you should examine the observation as a possible outlier. Thus, the graph includes horizontal lines at ±2. For these data, five observations have large negative residuals.

### 3. Influential observations and high-leverage points

The two graphs in the upper right box (green) enable you to investigate outliers, influential observations, and high-leverage points. One graph plots the studentized residuals versus the leverage value for each observation. As mentioned previously, the observations whose studentized residuals exceed ±2 can be considered outliers. The leverage statistic attempts to identify influential observations. The leverage statistic indicates how far an observation is from the centroid of the data in the space of the explanatory variables. Observations far from the centroid are potentially influential in fitting the regression model. "Influential" means that deleting that observation can result in a relatively large change in the parameter estimates.

Markers to the right of the vertical line are high-leverage points. Thus, the three lines in the graph separate the observations into four categories: (1) neither an outlier nor a high-leverage point; (2) an outlier, but not a high-leverage point; (3) a high-leverage point, but not an outlier; and (4) an outlier and a high-leverage point.

If there are high-leverage points in your data, you should be careful interpreting the model. Least squares models are not robust, so the parameter estimates are affected by high-leverage points.

The needle plot of Cook's D second graph (second row, last column) is another plot that indicates influential observations. Needles that are higher than the horizontal line are considered influential.

A subsequent article includes details about how to use the Cook's D plot to identify influential observations in your data.

### 4. Normality of residuals

The graphs in the lower left (red box) indicate whether the residuals for the model are normally distributed. Normally distributed residuals are one of the assumptions of regression that are used to derive inferential statistics. The first plot is a normal quantile-quantile plot (Q-Q plot) of the residuals. If the residuals are approximately normal, the markers should be close to the diagonal line. The following table gives some guidance about how to interpret deviations from a straight line:

Shape of Q-Q Plots
Point Pattern Interpretation
all but a few points fall on a line outliers in the data
left end of pattern is below the line long tail on the left
left end of pattern is above the line short tail on the left
right end of pattern is below the line short tail on the right
right end of pattern is above the line long tail on the right

For the Q-Q plot of these residuals, the markers are below the diagonal line at both ends of the pattern. That means that the distribution of residuals is negatively skewed: it has a long tail on the left and a short tail on the right. This is confirmed by looking at the histogram of residuals. For this model, the residuals are not normal; the overlay of the normal density curve does not fit the histogram well.

The last graph (blue box) is the residual-fit spread plot, which I have covered in a separate article. For these data, you can conclude that the three explanatory variables account for a significant portion of the variation in the model.

### Residual plots with smoothers

Lastly, the PLOTS=RESIDUALPLOT(SMOOTH) option on the PROC REG statement creates a panel that includes the residuals plotted against the value of each explanatory variable. The SMOOTH suboption overlays a loess smoother, which is often helpful in determining whether the residuals contain a signal or are randomly distributed. The panel for these data and model is shown below:

The residuals plotted again the East variable shows a systematic pattern. It indicates that the model does not adequately capture the dependence of the response on the East variable. An analyst who looks at this plot might decide to add a quadratic effect in the East direction, or perhaps to model a fully quadratic response surface by including the North*East interaction term.

### Summary

This article created a regression model that does not fit the data. The regression diagnostic panel detects the shortcomings in the regression model. The diagnostic panel also shows you important information about the data, such as outliers and high-leverage points. The diagnostic plot can help you evaluate whether the data and model satisfy the assumptions of linear regression, including normality and independence of errors.

A subsequent article will describe how to use the diagnostic plots to identify influential observations.

The post An overview of regression diagnostic plots in SAS appeared first on The DO Loop.