Statistical Graphics

4月 142021
 

It can be frustrating to receive an error message from statistical software. In the early days of the SAS statistical graphics (SG) procedures, an error message that I dreaded was
ERROR: Attempting to overlay incompatible plot or chart types.
This error message appears when you attempt to use PROC SGPLOT to overlay two plots that have different properties. For example, you might be trying to overlay a bar chart, which requires a categorical variable, with a scatter plot or series plot, which often displays values of a continuous variable.

In SAS 9.4M3 and later, there is a simple way to avoid this error message. You can combine a bar chart with other plot types by using the VBARBASIC or HBARBASIC statements, which create a bar chart that is compatible with other "basic" plots.

Compatibility of plot types

The SAS documentation includes an explanation of chart types, and a table that shows which plots you can overlay when you use PROC SGPLOT or PROC SGPANEL. (The doc erroneously puts HBARBASIC and VBARBASIC in the "categorization" group, but they should be in the "basic group." I have contacted the doc writer to correct the mistake.) If you try to overlay plots from different chart types, you will get the dreaded ERROR: Attempting to overlay incompatible plot or chart types.

First, let me emphasize that this error message only appears when you use PROC SGPLOT or SGPANEL to overlay the plots. If you use the Graph Template Language (GTL) and PROC SGRENDER, you do not have this restriction.

I have previously written about two important cases for which it is necessary to overlay an empirical distribution (bar chart or histogram) and a theoretical distribution, which is visualized by using a scatter plot or series plot:

My previous article shows how to overlay a bar chart and a series plot, but the example is a little complicated. The examples in the next sections are much simpler. There are two ways to combine a bar chart and a line plot: You can use the HBAR and HLINE statements, or you can use the HBARBASIC and SERIES statements. By using HBARBASIC, you can overlay a bar chart with many other plots.

Overlay a bar chart and a line plot

Suppose you want to use a bar chart to display the average height (by age) of a sample of school children. You also want to add a line that shows the average heights in the population. Because you must use "compatible" plot types, the traditional approach is to combine the HBAR and HLINE statements, as follows. (For simplicity, I omit the legend on this graph.) The DATA step creates fake data, which are supposed to represent the national average and a range of values for the average heights.

data NationalAverage;         /* fake data for demonstration purposes */
label Average = "National Average";
input Age Average Low High;
datalines;
11 60 53 63
12 62 54 65
13 65 55 67
14 66 56 68
15 67 57 70
16 68 58 72
;
 
data All;
set Sashelp.Class NationalAverage;
run;
 
title "Average Heights of Students in Class, by Age";
proc sgplot data=All noautolegend;
   hbar Age / response=height stat=mean;
   hline Age / response=Average markers datalabel;
run;

This is the "classic" bar chart and line plot. This syntax has been available in SAS since at least SAS 9.2. It enables you to combine multiple statements for discrete variables, such as HBAR/VBAR, HLINE/VLINE, and DOT. However, in some situations, you might need to overlay a bar chart and more complicated plots. In those situations, use the HBARBASIC or VBARBASIC graphs, as shown in the next section.

Overlay a bar chart and plots of continuous data

The VBARBASIC and HBARBASIC statements (introduced in SAS 9.4M3) enable you to combine bar charts with one or more other "basic" plots such as scatter plots, series plots, and box plots. Like the VBAR and HBAR statements, these statements can summarize raw data. They have almost the same syntax as the VBAR and HBAR statements.

Suppose you want to combine a bar chart, a series plot, and a high-low plot. You can't use the VBAR or HBAR statements because that leads to "incompatible plot or chart types." However, you can use the VBARBASIC and HBARBASIC statements, as follows:

proc sgplot data=All noautolegend;
   hbarbasic Age / response=height stat=mean name="S" legendlabel="Class";
   series y=Age x=Average / markers datalabel=Average name="Avg" legendlabel="National Average";
   highlow y=Age low=Low high=High;
   keylegend "S" "Avg";
run;

Notice that the SERIES and HIGHLOW statements create "basic" graphs. To overlay these on a bar chart, use the HBARBASIC statement. In a similar way, you can overlay many other graph types on a bar chart.

Summary

Sometimes you need to overlay a bar chart and another type of graph. If you aren't careful, you might get the error message: ERROR: Attempting to overlay incompatible plot or chart types. In SAS 9.4M3 and later, there is a simple way to avoid this error message. You can use the VBARBASIC or HBARBASIC statements to create a "basic" bar chart that is compatible with other "basic" plots.

The post Overlay other graphs on a bar chart with PROC SGPLOT appeared first on The DO Loop.

4月 122021
 

Most introductory statistics courses introduce the bar chart as a way to visualize the frequency (counts) for a categorical variable. A vertical bar chart places the categories along the horizontal (X) axis and shows the counts (or percentages) on the vertical (Y) axis. The vertical bar chart is a precursor to the histogram, which visualizes the distribution of counts for a continuous variable that has been binned.

Although bar charts are often displayed by using vertical bars, it is often advantageous to use a horizontal bar chart instead. This article discusses three situations in which a horizontal bar chart is preferable to a vertical bar chart.

Create a horizontal bar chart in SAS

In SAS, it is easy to create a vertical or a horizontal bar chart:

  • In PROC SGPLOT, you can use the HBAR statement to create a horizontal bar chart. Often you can switch from a vertical bar chart (the VBAR statement) by changing one letter: VBAR → HBAR. For pre-summarized data, you can use the HBARPARM statement.
  • In PROC FREQ, you can create a bar chart by using the PLOTS=FREQPLOT option on the TABLES statement. By default, you get a vertical bar chart. Use the ORIENT=HORIZONTAL suboption to create a horizontal bar chart.

For example, the following calls to SAS procedures create vertical and horizontal bar charts. The charts show the number of patients in a study who smoke, where the smoking category has five different levels, from non-smoker to heavy smoker. This categorical variable is ordinal, so I chose to sort the data and use the DISCRETEORDER=DATA option (for PROC SGPLOT) or the ORDER=DATA option (for PROC FREQ) so that the bars appear in a logical order.

/* Sort the data by smoking status: See
   https://blogs.sas.com/content/iml/2016/06/20/select-when-sas-data-step.html
*/
data Heart;
set sashelp.heart;
select (Smoking_Status);
   when ('Non-smoker')        Smoking_Cat=1;
   when ('Light (1-5)')       Smoking_Cat=2;
   when ('Moderate (6-15)')   Smoking_Cat=3;
   when ('Heavy (16-25)')     Smoking_Cat=4;
   when ('Very Heavy (> 25)') Smoking_Cat=5;
   otherwise                  Smoking_Cat=.;
end;
run;
proc sort data=Heart; by Smoking_Cat; run;
 
ods graphics/ width=400px height=300px;    /* make the graphs small */
 
/* Standard vertical bar charts */
title "Vertical Bar Chart";
proc sgplot data=Heart;
   vbar Smoking_Status;
   xaxis discreteorder=data;   /* use data order instead of alphabetical */
   yaxis grid;
run;
proc freq data=Heart order=data;
   tables Smoking_Status / plot=FreqPlot;
run;
 
title "Horizontal Bar Chart";
proc sgplot data=Heart;
   hbar Smoking_Status;
   xaxis grid;
   yaxis discreteorder=data;   /* use data order instead of alphabetical */
run;
proc freq data=Heart order=data; /* Y axis is reversed from PROC SGPLOT */
   tables Smoking_Status / plot=FreqPlot(orient=horizontal);
run;

The plots from PROC SGPLOT are displayed; the ones from PROC FREQ are similar. I intentionally made the graphs somewhat small so that the category labels for the vertical bar chart cannot be displayed without rotating or splitting the text labels.

3 advantages to horizontal bar charts

There are a least three advantages to using horizontal bar charts:

  • Long labels for the categories are easier to display and read. Whereas the vertical bar chart must use various tricks to display the labels (such as rotating or splitting the text), the horizontal bar chart can display the category labels in a natural, easy-to-read manner. This is seen in the example above.
  • Many categories are easier to display. The previous example has five categories but imagine having 20 or 50 categories. A horizontal bar chart can display the category names in a straightforward manner just by making the chart taller. This is an advantage for graphs on the printed page (in portrait mode) and for an HTML page because it is easy to scroll a web page vertically. The graph to the right shows a portion of a horizontal bar chart that has 45 categories. Each category is the name of a pair of variables and the bar chart shows the Pearson correlation between the two variables.
  • Labels for many bars are easier to display without collision. It is common to label a bar with the count or percentage for that bar. For example, think about displaying the Pearson correlation coefficient next to each bar in the previous example. A horizontal chart enables you to display those values in a natural way, whereas a vertical chart of the same data does not have enough horizontal space between bars. There are ways to rotate the text, but, in most cases, the horizontal layout is both simpler to construct and easier to read.

A horizontal layout can also be helpful for labeling each segment in a stacked bar chart. An example is shown below. In practice, it is not always possible to get the labels to fit fully inside the bars, especially for categories that have few counts. However, if you have 10 or more categories, a horizontal bar chart offers a better chance of displaying segment labels inside the bars. You should experiment with both the vertical and horizontal charts to determine which is the better choice.

Summary

SAS offers both vertical and horizontal bar charts. Vertical charts are used more often, but there are advantages to using a horizontal bar chart, especially if you are displaying many categories or categories that have long labels. This article shows how to create a horizontal bar chart and gives some situations in which the horizontal chart is preferable.

The post 3 reasons to prefer a horizontal bar chart appeared first on The DO Loop.

3月 292021
 

A previous article discusses how to interpret regression diagnostic plots that are produced by SAS regression procedures such as PROC REG. In that article, two of the plots indicate influential observations and outliers. Intuitively, an observation is influential if its presence changes the parameter estimates for the regression by "more than it should." Various researchers have developed statistics that give a precise meaning to "more than it should," and the formulas for several of the popular influence statistics are included in the PROC REG documentation.

This article discusses the Cook's D and leverage statistics. Specifically, how can you use the information in the diagnostic plots to identify which observations are influential? I show two techniques for identifying the observations. The first uses a DATA step and a formula to identify influential observations. The second technique uses the ODS OUTPUT statement to extract the same information directly from a regression diagnostic plot.

These are not the only regression diagnostic plots that can identify influential observations:

  • The DFBETAS statistic estimates the effect that deleting each observation has on the estimates for the regression coefficients.
  • The DFFITS statistic is a measure of how the predicted value changes when an observation is deleted. It is closely related to the Cook's D statistic.

The data and model

As in the previous article, let's use a model that does NOT fit the data very well, which makes the diagnostic plots more interesting. The following DATA step adds a quadratic effect to the Sashelp.Thick data and also adds a variable that is used in a subsequent section to merge the data with the Cook's D and leverage statistics.

data Thick2;
set Sashelp.Thick;
North2 = North**2;   /* add quadratic effect */
Observation = _N_;   /* for merging with other data sets */
run;

Graphs and labels

Rather than create the entire panel of diagnostic plots, you can use the PLOTS(ONLY)= option to create only the graphs for Cook's D statistic and for the studentized residuals versus the leverage. In the following call to PROC REG, the LABEL suboption requests that the influential observations be labeled. By default, the labels are the observation numbers. You can use the ID statement in PROC REG to specify a variable to use for the labels.

proc reg data=Thick2  plots(only label) =(CooksD RStudentByLeverage);
   model Thick = North North2 East; /* can also use INFLUENCE option */
run;

The graph of the Cook'd D statistic is shown above. The PROC REG documentation states that the horizontal line is the value 4/n, where n is the number of observations in the analysis. For these data, n = 75. According to this statistic, observations 1, 4, 8, 63, and 65 are influential.

The second graph is a plot of the studentized residual versus the leverage statistic. The PROC REG documentation states that the horizontal lines are the values ±2 and the vertical line is at the value 2p/n, where p is the number of parameters, including the intercept. For this model, p = 4. The observations 1, 4, 8, 10, 16, 63, and 65 are shown in this graph as potential outliers or potential high-leverage points.

From the graphs, we know that certain observations are influential. But how can highlight those influential observations in plots, print them, or otherwise analyze them? And how can we automate the process so that it works for any model on any data set?

Manual computation of influential observations

There are two ways to determine which observations have large residuals or are high-leverage or have a large value for the Cook's D statistic. The traditional way is to use the OUTPUT statement in PROC REG to output the statistics, then identify the observations by using the same cutoff values that are shown in the diagnostic plots. For example, the following DATA step lists the observations whose Cook's D statistic exceeds the cutoff value 4/n ≈ 0.053.

/* manual identification of influential observations */
proc reg data=Thick2  plots=none;
   model Thick = North North2 East; /* can also use INFLUENCE option */
   output out=RegOut predicted=Pred student=RStudent cookd=CookD H=Leverage;
quit;
 
%let p = 4;  /* number of parameter in model, including intercept */
%let n = 75; /* Number of Observations Used */
title "Influential (Cook's D)";
proc print data=RegOut;
   where CookD > 4/&n;
   var Observation East North Thick CookD;
run;

This technique works well. However, it assumes that you can easily write a formula to identify the influential observations. This is true for regression diagnostics. However, you can imagine a more complicated graph for which "special" observations are not so easy to identify. As shown in the next section, you can leverage (pun intended) the fact that SAS already identified the special observations for you.

Leveraging the power of the ODS OUTPUT statement

Did you know that you can create a data set from any SAS graphic? Many SAS programmers use ODS OUTPUT to save a table to a SAS data set, but the same technique enables you to save the data underlying any ODS graph. There is a big advantage of using ODS OUTPUT to get to the data in a graph: SAS has already done the work to identify and label the important points in the graph. You don't need to know any formulas!

Let's see how this works by extracting the observations whose Cook's D statistic exceeds the cutoff value. First, generate the graphs and use ODS OUTPUT to same the underlying data models, as follows:

/* Let PROC REG do the work. Use ODS OUTPUT to capture information */
ods exclude all;
proc reg data=Thick2  plots(only label) =(CooksD RStudentByLeverage);
   model Thick = North North2 East; 
   ods output CooksDPlot=CookOut         /* output the data for the Cook's D graph */
              RStudentByLeverage=RSOut;  /* output the data for the outlier-leverage plot */
quit;
ods exclude none;
 
/* you need to look at the data! */
proc print data=CookOut(obs=12); 
   where Observation > 0;
run;

When you output the data from a graph, you have to look at how the data are structured. Sometimes the data set has strange variable names or extra observations. In this case, the data begins with three extra observations for which the Cook's D statistic is missing. For the curious, fake observations are often used to set axes ranges or to specify the order of groups in a plot.

Notice that the CookOut data set includes a variable named Observation, which you can use to merge the CookOut data and the original data.

From the structure of the CookOut data set, you can infer that the influential observations are those for which the CooksDLabel variable is nonmissing (excepting the fake observations at the top of the data). Therefore, the following DATA step merges the output data sets and the original data. In the same DATA step, you can create other useful variables, such as a binary variable that indicates which observations have a large Cook's D statistic:

data All;
merge Thick2 CookOut RSOut;
by Observation;
/* create a variable that indicates whether the obs has a large Cook's D stat */
CooksDInf = (^missing(CooksD) & CooksDLabel^=.);
label CooksDInf = "Influential (Cook's D)";
run;
 
proc print data=All noobs;
   where CooksDInf > 0; 
   var Observation East North Thick;
run;
 
proc sgplot data=All;
   scatter x=East y=North / group=CooksDInf datalabel=CooksDLabel 
                  markerattrs=(symbol=CircleFilled size=10);
run;

The output from PROC PRINT (not shown) confirms that observations 1, 4, 8, 63, and 65 have a large Cook's D statistic. The scatter plot shows that the influential observations are located at extreme values of the explanatory variables.

Outliers and high-leverage points

The process to extract or visualize the outliers and high-leverage points is similar. The RSOut data set contains the relevant information. You can do the following:

  1. Look at the names of the variables and the structure of the data set.
  2. Merge with the original data by using the Observation variable.
  3. Use one of more variables to identify the special observations.

The following call to PROC PRINT gives you an overview of the data and its structure:

/* you need to look at the data! */
proc print data=RSOut(obs=12) noobs; 
   where Observation > 0;
   var Observation RStudent HatDiagonal RsByLevIndex outLevLabel RsByLevGroup;
run;

For the RSOut data set, the indicator variable is named RsByLevIndex, which has the value 1 for ordinary observations and the value 2, 3, or 4 for influential observations. The meaning of each index value is shown in the RsByLevGroup variable, which has the corresponding values "Outlier," "Leverage," and "Outlier and Leverage" (or a blank string for ordinary observations). You can use these values to identify the outliers and influential observations. For example, you can print all influential observations or you can graph them, as shown in the following statements:

proc print data=All noobs;
   where Observation > 0 & RsByLevIndex > 1;
   var Observation East North Thick RsByLevGroup;
run;
 
proc sgplot data=All;
   scatter x=East y=North / markerattrs=(size=9);        /* ordinary points are not filled */
   scatter x=East y=North / group=RsByLevGroup nomissinggroup /* special points are filled */
           datalabel=OutLevLabel markerattrs=(symbol=CircleFilled size=10);
run;

The PROC PRINT output confirms that we can select the noteworthy observations. In the scatter plot, the color of each marker indicates whether the observation is an outlier, a high-leverage point, both, or neither.

Summary

It is useful to identify and visualize outliers and influential observations in a regression model. One way to do this is to manually compute a cutoff value and create an indicator variable that indicates the status of each observation. This article demonstrates that technique for the Cook's D statistic. However, an alternative technique is to take advantage of the fact that SAS can create graphs that label the outliers and influential observations. You can use the ODS OUTPUT statement to capture the data underlying any ODS graph. To visualize the noteworthy observations, you can merge the original data and the statistics, indicator variables, and label variables.

Although it's not always easy to decipher the variable names and the structure of the data that comes from ODS graphics, this technique is very powerful. Its use goes far beyond the regression example in this article. The technique enables you to incorporate any SAS graph into a second analysis or visualization.

The post Identify influential observations in regression models appeared first on The DO Loop.

3月 242021
 

When you fit a regression model, it is useful to check diagnostic plots to assess the quality of the fit. SAS, like most statistical software, makes it easy to generate regression diagnostics plots. Most SAS regression procedures support the PLOTS= option, which you can use to generate a panel of diagnostic plots. Some procedures (most notably PROC REG and PROC LOGISTIC) support dozens of graphs that help you to evaluate the fit of the model, to determine whether the data satisfy various assumptions, and to identify outliers and high-leverage points. Diagnostic plots are most useful when the size of the data is not too large, such as less than 5,000 observations.

This article shows how to interpret diagnostic plots for a model that does not fit the data. For an ill-fitting model, the diagnostic plots should indicate the lack of fit.

This article discusses the eight plots in the DiagnosticsPanel plot, which is produced by several SAS regression procedures. To make the discussion as simple as possible, this article uses PROC REG to fit an ordinary least squares model to the data.

The eight plots can be classified into five groups:

  1. A plot of the observed versus predicted responses. (Center of panel.)
  2. Two graphs of residual values versus the predicted responses. (Upper left of panel.)
  3. Two graphs that indicate outliers, high-leverage observations, and influential observations. (Upper right of panel.)
  4. Two graphs that assess the normality of the residuals. (Lower left of panel.)
  5. A plot that compares the cumulative distributions of the centered predicted values and the residuals. (Bottom of panel.)

This article also includes graphs of the residuals plotted against the explanatory variables.

Create a model that does not fit the data

This section creates a regression model that (intentionally) does NOT fit the data. The data are 75 locations and measurements of the thickness of a coal seam. A naive model might attempt to fit the thickness of the coal seam as a linear function of the East-West position (the East variable) and the South-North position (the North variable). However, even a cursory glance at the data (see Figure 2 in the documentation for the VARIOGRAM procedure) indicates that the thickness is not linear in the North-South direction, so the following DATA step creates a quadratic effect. The subsequent call to PROC REG fits the model to the data and uses the PLOT= option to create a panel of diagnostic plots. By default, PROC REG creates a diagnostic panel and a panel of residual plots. If you want to add a loess smoother to the residual plots, you can use the SMOOTH suboption to the RESIDUALPLOT option, as follows:

data Thick2;
set Sashelp.Thick;
North2 = North**2;   /* add quadratic effect */
run;
 
proc reg data=Thick2
         plots =(DiagnosticsPanel ResidualPlot(smooth));
   model Thick = North North2 East;
quit;

The panel of diagnostic plots is shown. The panel of residual plots is shown later in this article. To guide the discussion, I have overlaid colored boxes around certain graphs. You can look at the graphs in any order, but I tend to look at them in the order indicated by the numbers in the panel.

1. The predicted versus observed response

The graph in the center (orange box) shows the quality of the predictive model. The graph plots the observed response versus the predicted response for each observation. For a model that fits the data well, the markers will be close to the diagonal line. Markers that are far from the line indicate observations for which the predicted response is far from the observed response. That is, the residual is large.

For this model, the cloud of markers is not close to the diagonal, which indicates that the model does not fit the data. You can see several markers that are far below the diagonal. These observations will have large negative residuals, as shown in the next section.

2. The residual and studentized residual plots

Two residual plots in the first row (purple box) show the raw residuals and the (externally) studentized residuals for the observations.

The first graph is a plot of the raw residuals versus the predicted values. Ideally, the graph should not show any pattern. Rather, it should look like a set of independent and identically distributed random variables. You can use this graph for several purposes:

  1. If the residuals seem to follow a curve (as in this example), it indicates that the model is not capturing all the "signal" in the response variable. You can try to add additional effects to the model to see if you can eliminate the pattern.
  2. If the residuals are "fan shaped," it indicates that the residuals do not have constant variance. Fan-shaped residuals are an example of heteroscedasticity. The canonical example of a fan-shaped graph is when small (in magnitude) residuals are associated with observations that have small predicted values. Similarly, large residuals are associated with observations that have large predicted values. Heteroscedasticity does not invalidate the model's ability to make predictions, but it violates the assumptions behind inferential statistics such as p-values, confidence intervals, and significance tests.
  3. Detect autocorrelation. If the residuals are not randomly scattered, it might indicate that they are not independent. A time series can exhibit autocorrelation; spatial data can exhibit spatial correlations.

The second residual graph often looks similar to the plot of the raw residuals. The vertical coordinates in the second graph (labeled "RStudent") are externally studentized residuals. The PROC REG documentation includes the definition of a studentized residual. For a studentized residual, the variance for the i_th observation is estimated without including the i_th observation. If the magnitude of the studentized residual exceeds 2, you should examine the observation as a possible outlier. Thus, the graph includes horizontal lines at ±2. For these data, five observations have large negative residuals.

3. Influential observations and high-leverage points

The two graphs in the upper right box (green) enable you to investigate outliers, influential observations, and high-leverage points. One graph plots the studentized residuals versus the leverage value for each observation. As mentioned previously, the observations whose studentized residuals exceed ±2 can be considered outliers. The leverage statistic attempts to identify influential observations. The leverage statistic indicates how far an observation is from the centroid of the data in the space of the explanatory variables. Observations far from the centroid are potentially influential in fitting the regression model. "Influential" means that deleting that observation can result in a relatively large change in the parameter estimates.

Markers to the right of the vertical line are high-leverage points. Thus, the three lines in the graph separate the observations into four categories: (1) neither an outlier nor a high-leverage point; (2) an outlier, but not a high-leverage point; (3) a high-leverage point, but not an outlier; and (4) an outlier and a high-leverage point.

If there are high-leverage points in your data, you should be careful interpreting the model. Least squares models are not robust, so the parameter estimates are affected by high-leverage points.

The needle plot of Cook's D second graph (second row, last column) is another plot that indicates influential observations. Needles that are higher than the horizontal line are considered influential.

A subsequent article includes details about how to use the Cook's D plot to identify influential observations in your data.

4. Normality of residuals

The graphs in the lower left (red box) indicate whether the residuals for the model are normally distributed. Normally distributed residuals are one of the assumptions of regression that are used to derive inferential statistics. The first plot is a normal quantile-quantile plot (Q-Q plot) of the residuals. If the residuals are approximately normal, the markers should be close to the diagonal line. The following table gives some guidance about how to interpret deviations from a straight line:

Shape of Q-Q Plots
Point Pattern Interpretation
all but a few points fall on a line outliers in the data
left end of pattern is below the line long tail on the left
left end of pattern is above the line short tail on the left
right end of pattern is below the line short tail on the right
right end of pattern is above the line long tail on the right

For the Q-Q plot of these residuals, the markers are below the diagonal line at both ends of the pattern. That means that the distribution of residuals is negatively skewed: it has a long tail on the left and a short tail on the right. This is confirmed by looking at the histogram of residuals. For this model, the residuals are not normal; the overlay of the normal density curve does not fit the histogram well.

The spread plot

The last graph (blue box) is the residual-fit spread plot, which I have covered in a separate article. For these data, you can conclude that the three explanatory variables account for a significant portion of the variation in the model.

Residual plots with smoothers

Lastly, the PLOTS=RESIDUALPLOT(SMOOTH) option on the PROC REG statement creates a panel that includes the residuals plotted against the value of each explanatory variable. The SMOOTH suboption overlays a loess smoother, which is often helpful in determining whether the residuals contain a signal or are randomly distributed. The panel for these data and model is shown below:

The residuals plotted again the East variable shows a systematic pattern. It indicates that the model does not adequately capture the dependence of the response on the East variable. An analyst who looks at this plot might decide to add a quadratic effect in the East direction, or perhaps to model a fully quadratic response surface by including the North*East interaction term.

Summary

This article created a regression model that does not fit the data. The regression diagnostic panel detects the shortcomings in the regression model. The diagnostic panel also shows you important information about the data, such as outliers and high-leverage points. The diagnostic plot can help you evaluate whether the data and model satisfy the assumptions of linear regression, including normality and independence of errors.

A subsequent article will describe how to use the diagnostic plots to identify influential observations.

The post An overview of regression diagnostic plots in SAS appeared first on The DO Loop.

3月 242021
 

When you fit a regression model, it is useful to check diagnostic plots to assess the quality of the fit. SAS, like most statistical software, makes it easy to generate regression diagnostics plots. Most SAS regression procedures support the PLOTS= option, which you can use to generate a panel of diagnostic plots. Some procedures (most notably PROC REG and PROC LOGISTIC) support dozens of graphs that help you to evaluate the fit of the model, to determine whether the data satisfy various assumptions, and to identify outliers and high-leverage points. Diagnostic plots are most useful when the size of the data is not too large, such as less than 5,000 observations.

This article shows how to interpret diagnostic plots for a model that does not fit the data. For an ill-fitting model, the diagnostic plots should indicate the lack of fit.

This article discusses the eight plots in the DiagnosticsPanel plot, which is produced by several SAS regression procedures. To make the discussion as simple as possible, this article uses PROC REG to fit an ordinary least squares model to the data.

The eight plots can be classified into five groups:

  1. A plot of the observed versus predicted responses. (Center of panel.)
  2. Two graphs of residual values versus the predicted responses. (Upper left of panel.)
  3. Two graphs that indicate outliers, high-leverage observations, and influential observations. (Upper right of panel.)
  4. Two graphs that assess the normality of the residuals. (Lower left of panel.)
  5. A plot that compares the cumulative distributions of the centered predicted values and the residuals. (Bottom of panel.)

This article also includes graphs of the residuals plotted against the explanatory variables.

Create a model that does not fit the data

This section creates a regression model that (intentionally) does NOT fit the data. The data are 75 locations and measurements of the thickness of a coal seam. A naive model might attempt to fit the thickness of the coal seam as a linear function of the East-West position (the East variable) and the South-North position (the North variable). However, even a cursory glance at the data (see Figure 2 in the documentation for the VARIOGRAM procedure) indicates that the thickness is not linear in the North-South direction, so the following DATA step creates a quadratic effect. The subsequent call to PROC REG fits the model to the data and uses the PLOT= option to create a panel of diagnostic plots. By default, PROC REG creates a diagnostic panel and a panel of residual plots. If you want to add a loess smoother to the residual plots, you can use the SMOOTH suboption to the RESIDUALPLOT option, as follows:

data Thick2;
set Sashelp.Thick;
North2 = North**2;   /* add quadratic effect */
run;
 
proc reg data=Thick2
         plots =(DiagnosticsPanel ResidualPlot(smooth));
   model Thick = North North2 East;
quit;

The panel of diagnostic plots is shown. The panel of residual plots is shown later in this article. To guide the discussion, I have overlaid colored boxes around certain graphs. You can look at the graphs in any order, but I tend to look at them in the order indicated by the numbers in the panel.

1. The predicted versus observed response

The graph in the center (orange box) shows the quality of the predictive model. The graph plots the observed response versus the predicted response for each observation. For a model that fits the data well, the markers will be close to the diagonal line. Markers that are far from the line indicate observations for which the predicted response is far from the observed response. That is, the residual is large.

For this model, the cloud of markers is not close to the diagonal, which indicates that the model does not fit the data. You can see several markers that are far below the diagonal. These observations will have large negative residuals, as shown in the next section.

2. The residual and studentized residual plots

Two residual plots in the first row (purple box) show the raw residuals and the (externally) studentized residuals for the observations.

The first graph is a plot of the raw residuals versus the predicted values. Ideally, the graph should not show any pattern. Rather, it should look like a set of independent and identically distributed random variables. You can use this graph for several purposes:

  1. If the residuals seem to follow a curve (as in this example), it indicates that the model is not capturing all the "signal" in the response variable. You can try to add additional effects to the model to see if you can eliminate the pattern.
  2. If the residuals are "fan shaped," it indicates that the residuals do not have constant variance. Fan-shaped residuals are an example of heteroscedasticity. The canonical example of a fan-shaped graph is when small (in magnitude) residuals are associated with observations that have small predicted values. Similarly, large residuals are associated with observations that have large predicted values. Heteroscedasticity does not invalidate the model's ability to make predictions, but it violates the assumptions behind inferential statistics such as p-values, confidence intervals, and significance tests.
  3. Detect autocorrelation. If the residuals are not randomly scattered, it might indicate that they are not independent. A time series can exhibit autocorrelation; spatial data can exhibit spatial correlations.

The second residual graph often looks similar to the plot of the raw residuals. The vertical coordinates in the second graph (labeled "RStudent") are externally studentized residuals. The PROC REG documentation includes the definition of a studentized residual. For a studentized residual, the variance for the i_th observation is estimated without including the i_th observation. If the magnitude of the studentized residual exceeds 2, you should examine the observation as a possible outlier. Thus, the graph includes horizontal lines at ±2. For these data, five observations have large negative residuals.

3. Influential observations and high-leverage points

The two graphs in the upper right box (green) enable you to investigate outliers, influential observations, and high-leverage points. One graph plots the studentized residuals versus the leverage value for each observation. As mentioned previously, the observations whose studentized residuals exceed ±2 can be considered outliers. The leverage statistic attempts to identify influential observations. The leverage statistic indicates how far an observation is from the centroid of the data in the space of the explanatory variables. Observations far from the centroid are potentially influential in fitting the regression model. "Influential" means that deleting that observation can result in a relatively large change in the parameter estimates.

Markers to the right of the vertical line are high-leverage points. Thus, the three lines in the graph separate the observations into four categories: (1) neither an outlier nor a high-leverage point; (2) an outlier, but not a high-leverage point; (3) a high-leverage point, but not an outlier; and (4) an outlier and a high-leverage point.

If there are high-leverage points in your data, you should be careful interpreting the model. Least squares models are not robust, so the parameter estimates are affected by high-leverage points.

The needle plot of Cook's D second graph (second row, last column) is another plot that indicates influential observations. Needles that are higher than the horizontal line are considered influential.

A subsequent article includes details about how to use the Cook's D plot to identify influential observations in your data.

4. Normality of residuals

The graphs in the lower left (red box) indicate whether the residuals for the model are normally distributed. Normally distributed residuals are one of the assumptions of regression that are used to derive inferential statistics. The first plot is a normal quantile-quantile plot (Q-Q plot) of the residuals. If the residuals are approximately normal, the markers should be close to the diagonal line. The following table gives some guidance about how to interpret deviations from a straight line:

Shape of Q-Q Plots
Point Pattern Interpretation
all but a few points fall on a line outliers in the data
left end of pattern is below the line long tail on the left
left end of pattern is above the line short tail on the left
right end of pattern is below the line short tail on the right
right end of pattern is above the line long tail on the right

For the Q-Q plot of these residuals, the markers are below the diagonal line at both ends of the pattern. That means that the distribution of residuals is negatively skewed: it has a long tail on the left and a short tail on the right. This is confirmed by looking at the histogram of residuals. For this model, the residuals are not normal; the overlay of the normal density curve does not fit the histogram well.

The spread plot

The last graph (blue box) is the residual-fit spread plot, which I have covered in a separate article. For these data, you can conclude that the three explanatory variables account for a significant portion of the variation in the model.

Residual plots with smoothers

Lastly, the PLOTS=RESIDUALPLOT(SMOOTH) option on the PROC REG statement creates a panel that includes the residuals plotted against the value of each explanatory variable. The SMOOTH suboption overlays a loess smoother, which is often helpful in determining whether the residuals contain a signal or are randomly distributed. The panel for these data and model is shown below:

The residuals plotted again the East variable shows a systematic pattern. It indicates that the model does not adequately capture the dependence of the response on the East variable. An analyst who looks at this plot might decide to add a quadratic effect in the East direction, or perhaps to model a fully quadratic response surface by including the North*East interaction term.

Summary

This article created a regression model that does not fit the data. The regression diagnostic panel detects the shortcomings in the regression model. The diagnostic panel also shows you important information about the data, such as outliers and high-leverage points. The diagnostic plot can help you evaluate whether the data and model satisfy the assumptions of linear regression, including normality and independence of errors.

A subsequent article will describe how to use the diagnostic plots to identify influential observations.

The post An overview of regression diagnostic plots in SAS appeared first on The DO Loop.

3月 222021
 

This article shows how to use PROC SGPLOT in SAS to create the scatter plot shown to the right. The scatter plot has the following features:

  • The colors of markers are determined by the value of a third variable.
  • The outline of each marker is the same color (such as black).
  • The X axis is reversed because the X coordinate increases from east to west.

The purpose of this article is twofold. The primary purpose is to show how you can use the FILLEDOUTLINEDMARKERS option to separately control the fill colors and outline colors of markers. A secondary purpose is to remind longtime SAS programmers that a program that was written 10-15 years ago can sometimes benefit from a "facelift" by using more recent features in SAS procedures.

Is your program taking advantage of recent features?

Sometimes, I see an old SAS program and notice that it can be written more compactly by using modern features of the SAS language. I've noticed that there are even examples in the SAS documentation that can be simplified! In most cases, the documentation examples were written for an earlier version of SAS, before the modern features were implemented. These examples persist because some SAS customers are still using older versions of SAS, and because of the maxim, "if it ain't broke, don't fix it!"

When I see an old SAS program that uses the Graph Template Language (GTL), I study it to determine whether it could be rewritten to use PROC SGPLOT. In the early days of ODS graphics, PROC SGPLOT did not support as many statements and options as it does today. It was not uncommon to want to create a graph that required using GTL. As SAS released new versions (especially the early releases of SAS 9.4), many features in GTL made their way into the syntax of PROC SGPLOT.

An example of a feature that once required GTL is coloring scatter plot markers by the values of a third variable. In modern versions of PROC SGPLOT, you can use the COLORRESPONSE= option to specify a variable whose values determine the color of markers. Furthermore, you can use the COLORMODEL= option to specify a palette of colors to use for the markers. In a 2016 article, I wrote that the COLORRESPONSE= option is "one of my favorite new features in PROC SGPLOT in SAS 9.4m2."

Another useful feature is the FILLEDOUTLINEDMARKERS option (introduced in SAS 9.4 [M0]), which enables you to control the fill and outline attributes of markers separately. In this article, the fill colors are determined by using the values of a third variable. The outline color of the markers is specified by using the MARKEROUTLINEATTRS= option.

A remake of a classic scatter plot

The LOESS procedure in SAS/STAT was released in SAS 8 and was one of the first SAS/STAT procedures to incorporate ODS graphics. An example in the PROC LOESS documentation uses GTL to create a scatter plot of 179 locations and measurements of sulfate (SO4). The SAS 9.3 documentation (circa 2011) was the first to use the red-blue three-color ramp to assign the fill colors of markers based on the S04 values. The filled markers are overlaid with a second scatter plot that uses outline-only markers to create a boundary outline for the filled markers. The graph looks similar to the graph at the top of the page. You can click the documentation link to see the GTL and the 9.3 scatter plot. In contrast, the following call to PROC SGPLOT requires only two statements. (The documentation shows how to generate the SO4 data set.)

title "Sulfate Measurements";
proc sgplot data=SO4;
   scatter x=longitude y=latitude / colorresponse = SO4
               colormodel  = ThreeColorRamp
               filledoutlinedmarkers
               markerattrs = (symbol=circleFilled size=9)
               markeroutlineattrs=(color=black); /* omit this option if you want the GraphData1 color */
    xaxis reverse label="West Longitude";
run;

The graph is shown at the top of the article. The locations of these points are in the 48 contiguous US states. From the scatter plot, you can make out New England, the Great Lakes region, Texas, Florida, and other geographic features. (With more work, you can overlay the points on a map of the 48 states.) The REVERSE option in the XAXIS statement is necessary because the locations of the measurements are "west longitude," which increases from east to west.

Summary

This article shows how to use the FILLEDOUTLINEDMARKERS option in PROC SGPLOT to control the fill color and outline color of markers in a scatter plot. In this example, the COLORRESPONSE= option specifies a variable to use for color-coding. The MARKEROUTLINEATTRS= option specifies the outline color of the markers.

These features are useful, but it is equally useful to recognize that ODS statistical graphics (in particular, PROC SGPLOT) has evolved a lot since SAS 9.2 and 9.3. If you see an old program that uses GTL to create a plot, think about whether you can create the graph by using newer features in PROC SGPLOT. Because PROC SGPLOT code is easier to read and to maintain, it might be worth the effort.

What do you think? Is it a waste of time to rewrite a program that produces the correct output? Or is it beneficial to simplify and modernize old programs? Let me know your opinion in the comments.

The post Control the fill and outline colors of scatter plot markers in SAS appeared first on The DO Loop.

3月 082021
 

I recently learned about a new feature in PROC QUANTREG that was added in SAS/STAT 15.1 (part of SAS 9.4M6). Recall that PROC QUANTREG enables you to perform quantile regression in SAS. (If you are not familiar with quantile regression, see an earlier article that describes quantile regression and provides introductory references.) Whereas ordinary least-squares regression predicts the conditional mean of a continuous response variable, quantile regression predicts one or more conditional quantiles, such as the median or deciles.

The CONDDIST statement in PROC QUANTREG enables you to estimate and visualize the conditional distribution of a response variable at any X value in regression data. Furthermore, it enables you to compare the conditional distribution to the "unconditional" distribution of the response (that is, to the marginal distribution). This article shows how to use the CONDDIST statement in PROC QUANTREG.

The main idea

Before discussing quantile regression, let's recall what we mean by a conditional and marginal distribution. The following SAS DATA step simulates 250 observations in a regression model where the response variable (Y) depends on an explanatory variable (X) and random noise. A scatter plot shows the data and overlays three sets of reference lines that will be used later in this article.

data  Have;
   call  streaminit(123);
   subj=1; x=2; y=12; output;
   subj=2; x=8; y=30; output;   
   do subj = 3 to 250;
      x =  rand ('uniform',  1, 10);           /* x in interval (1, 10) */
      err = rand ('normal',   0, 2+log(3*x));
      y = 1  + 4 * x + err;                    /* Y = response variable */
      output;
   end;
run;
 
proc sql;
   select mean(x) into :xavg from Have; 
quit;
%let xavg = %sysfunc(round(&xavg,0.1));  /* round to nearest tenth */
%put &=xavg;
 
title "Joint Distribution of X and Y";
proc sgplot data=Have;
   scatter x=x y=y;
   refline 2 &xavg 8 / axis=X lineattrs=(thickness=2) label;
   xaxis grid; yaxis grid;
run;

The scatter plot shows that the conditional mean of Y at any value X is close to 1 + 4*X, which is the form of the underlying model. The data are simulated, so we know that the exact conditional distributions are normal distributions. In a subsequent section, we will use the CONDDIST statement in PROC QUANTREG to estimate the conditional distribution at X=2, X=8, and at the average X value, which is X=5.7. We will compare those distributions to the overall (marginal) distribution of Y.

The graph overlays three sets of drop lines to guide the discussion of conditional distributions. The following are visual estimates:

  • From the line at X=2, it looks like the conditional distribution of Y is centered near Y=9 and most of the probability density is on the interval [2, 18].
  • From the line at X=5.7, it looks like the conditional distribution of Y is centered near Y=24 and most of the probability density is on [13, 33].
  • From the line at X=8, it looks like the conditional distribution of Y is centered near Y=33 and most of the probability density is on [23, 43].

Empirical estimates for the marginal and conditional distributions

Before using PROC QUANTREG to estimate the marginal (unconditional) and conditional densities, let's estimate by using a simpler method. The marginal distribution of Y is easy to estimate: Use the CDFPLOT statement in PROC UNIVARIATE to obtain the empirical cumulative distribution function (ECDF). You can use the HISTOGRAM statement and the KERNEL option to estimate the marginal density function (PDF).

/* The marginal distribution of Y does not consider values of X */
proc univariate data=Have;
   var y;
   cdfplot y / vscale=proportion    odstitle="Empirical Marginal CDF";
   histogram y / kernel outkernel=K odstitle="Empirical Marginal PDF";
   ods select CDFPlot Histogram;
run;

These graphs estimate the PDF and CDF of the Y variable by using the observed data, so they are empirical estimates. The kernel density estimate shows two small humps in the density near X=15 and X=30. The OUTKERNEL= option creates a SAS data set (K) that contains the kernel density estimate.

It is more difficult to estimate the conditional distribution of Y at some value of X. For definiteness, suppose we want to estimate the conditional distribution of Y at the mean of X, which is X=5.7. The simplest estimate is to choose all observations that are close to 5.7. There is no precise definition for "close to." In the following procedure, the WHERE clause limits the data to observations for which X is within 0.5 units of 5.7. I call this the "poor man's" estimation procedure because it doesn't use any analytics. As shown in the next section, quantile regression provides a more elegant estimate.

%let Threshold = 0.5;
proc univariate data=Have;
   WHERE abs(x-&xavg) < &Threshold;
   var y;
   cdfplot y / vscale=proportion       odstitle="Empirical Conditional CDF at X=5.7";
   histogram y / kernel outkernel=KAvg odstitle="Empirical Conditional PDF at X=5.7" ;
   ods select CDFPlot Histogram;
run;

Only the PDF is shown because it is easier to interpret. According to the "poor man's" estimate, the conditional distribution at X=5.7 looks approximately normal but perhaps with some positive skewness. The peak is near Y=23 and it looks like the ±3σ range is [13, 36]. These values are very close to the estimates that we guessed based on the scatter plot in the previous section.

Compare the conditional and marginal distributions

I used the OUTKERNEL= option to output the kernel density estimates to SAS data sets. To compare the marginal and conditional distributions, you can combine those data sets and overlay the estimates on a single graph, as follows:

/* Compare the marginal distribution of Y to the distribution of Y|(x is near average) */
data All;
label Dist = "Distribution";
set k(in=marginal) kAvg;
if marginal then Dist = "Marginal   ";
            else Dist = "Conditional";
run;
 
title "Empirical Marginal and ""Poor-Man's"" Conditional Distributions";
title2 "Approximation to the Conditional Distribution at x=&xavg";
proc sgplot data=All;
   series x=_Value_ y=_Density_ / group=Dist;
   xaxis grid; yaxis grid;
run;

This graph enables you to compare the overall distribution of the Y values (in blue) to the conditional distribution of the Y values near X=5.7 (in red). You can see that the conditional distribution is narrower and approximately normal.

To obtain this graph, I ran PROC UNIVARIATE twice, merged the outputs, and called PROC SGPLOT to overlay the curves. By using the CONDDIST statement in PROC QUANTREG, you can create a similar graph with one line of SAS code.

Estimate the conditional distribution at the mean of the explanatory variables

So far, the analyses have been empirical and heuristic. This section uses quantile regression to create a model of the conditional quantiles. The CONDDIST statement fits a quantile regression model for many quantiles You can invert the quantile estimates to obtain a model-based estimate of the conditional quantiles.

Let's start by visualizing several quantile regression models for these data. The following call to PROC QUANTREG fits nine models to the data. The models are for the quantiles 0.1, 0.2, ..., 0.9. The PLOT=FITPLOT option displays the nine quantile regression curves on a scatter plot of the data.

/* Quantile regression models */
proc quantreg data=Have ci=none;
   model y = x / plot=fitplot quantile=(0.1 to 0.9 by 0.1);
run;

The graph shows the nine quantile models. If you draw a vertical line at any X value (such as X=5.7) the places where the vertical line crosses the regression curves are the conditional quantiles for Y at X. You can think about plotting those Y values against the quantile values (0.1, 0.2, ..., 0.9) to obtain a model-based cumulative distribution curve.

But why stop at nine quantile models? Why not fit 100 models for the quantiles 0.005, 0.015, ..., 0.995? Then the scatter plot has 100 regression curves overlaid on it. If you draw a vertical line at X=5.7, that line will intersect the 100 regression lines. If the ordered pairs of the intersection points are (y1,0.005), (y2,0.015), ..., (y100, 0.995), then those 100 points estimate a model-based conditional distribution of Y at X=5.7.

That is essentially what the CONDDIST statement in PROC QUANTREG does. The documentation for the QUANTREG procedure contains a more mathematical description. The following call to the QUANTREG procedure uses the SHOWAVG option on the CONDDIST statement to create a CDF plot and a PDF plot. Each plot overlays the observed marginal distribution of Y and the conditional distribution at the average of the explanatory variables, which is X=5.7. To make the comparison easier, the marginal distribution is evaluated at the same quantiles as the conditional distribution.

/* The conditional distribution of response at the average of X is created 
   automatically by using the CONDIST statement in PROC QUANTREG. The SHOWAVG 
   option displays the conditional distribution of Y at the mean of X. */
%let HideAll = HideDropLines HideDropNumbers HideObsDots HideObsLabels;
proc quantreg data=Have ci=none;
   model y = x;
   conddist plot(&HideAll ShowGrids)=(cdfplot pdfplot) showavg; 
run;

The CONDDIST statement creates two plots, but only the PDF plot is shown here. Notice the similarity to the earlier graph that compared the kernel density estimates from PROC UNIVARIATE. In both graphs, the blue curve represents the marginal density estimate for Y. The red curve estimates the conditional density at X=5.7. In the previous graph, the estimate was crude and was obtained by throwing out a lot of the data and using only observations near X=5.7. In the current graph, the model-based estimate is formed by using all the data and by fitting 100 different quantile regression models. In both graphs, the conditional PDF has a peak near Y=23, is approximately normal, and the bulk of the density is supported on [13, 36].

Estimate the conditional distribution at a data value

The SHOWAVG option to the CONDDIST statement estimates the conditional distribution of Y at the average value of the explanatory variables. However, you can use the OBS= option to specify values in the data at which to estimate the conditional distribution. When I created the simulated data for this example, I intentionally hard-coded the first two observations as (2,9) and (8, 33). The following call to PROC QUANTREG estimates the conditional distribution at X=2 and X=8:

/* The conditional distribution of response at the average of X is created 
   automatically by using the CONDDIST stastement in PROC QUANTREG */
proc quantreg data=Have ci=none;
   model y = x;
   conddist plot(ShowGrids)=(pdfplot) obs=(1 2) ;
run;

Again, the marginal distribution of Y is overlaid as a comparison. To help interpret the curves, PROC QUANTSELECT adds drop lines to the graph (by default). The vertical drop line specifies the Y value for the observation (or the mean of the Ys for the marginal density, which is shown in blue).

  • The estimate of the conditional distribution at X=2 is shown in red. The Y value of the data point is Y=12, which is larger than the most likely value (which is approximately Y=10). The distribution appears to be normally distributed with the bulk of the density on [0,18].
  • The estimate of the conditional distribution at X=8 is shown in green. The Y value of the data point is Y=30, which is smaller than the most likely value (which is approximately Y=34). The distribution appears to approximately normal, but maybe there is some positive skewness? The bulk of the density is on [20,46].

Why is this useful?

Why is this useful? Suppose you are measuring blood pressure of patients. Some of the patients are overweight and others are not. If you set Y=BloodPressure and X=BMI (body-mass index), then a graph like this enables you to compare the distribution of blood pressures for all patients (the marginal distribution), for healthy-weight individuals (BMI=20), and for obese individuals (BMI=30). In fact, for ANY value of the BMI, you can compare the conditional distribution to the marginal distribution and/or to other conditional distributions.

And although I have presented an example with only one continuous variable, you can do the same thing for models that contain multiple explanatory variables, both continuous and CLASS variables.

I have left out a lot of details, but I hope I have whetted your appetite to learn more about the CONDDIST statement in PROC QUANTREG. The documentation for the QUANTREG procedure has additional information and additional examples. It also shows how you can fit the regression on one set of data (the training data) and apply the models to estimate "counterfactual" distributions for new data (the test data).

The post The conditional distribution of a response variable appeared first on The DO Loop.

3月 012021
 

I recently wrote about a simple statistical formula that approximates the wind chill temperature, which is the cumulative effect of air temperature and wind on the human body. The formula uses two independent variables (air temperature and wind speed) to predict the wind chill temperature. This article describes how to use SAS to create line charts and heat maps that visualize the formula. For the heat map, I show how to overlay numerical values and to choose colors for the numbers so that the text is more readable.

An example of a wind chill chart is shown below. You can use the techniques in this article to visualize any response surface.

A formula for wind chill

The wind chill calculations that are used currently in the US, Canada, and the UK are based on a 2001 revision of an earlier wind chill table. The numbers in the wind chill table are based on equations from physics and mathematical modeling. However, you can fit a simple linear regression model that includes the main effects and an interaction term.

The formula for the wind chill temperature in US units is
\(WCT_{US} = 35.74 + 0.6215*T_F - 35.75*V_{mph}^{0.16} + 0.4275*T_F*V_{mph}^{0.16}\)
where TF is the air temperature in degrees Fahrenheit, and Vmph is the wind speed in miles per hour.

The formula in SI units is
\(WCT_{SI} = 13.12 + 0.6215*T_C - 11.37*V_{kph}^{0.16} + 0.3965*T_C*V_{kph}^{0.16}\)
where TC is the air temperature in degrees Celsius, and Vkph is the wind speed in kilometers per hour.

Visualize slices of the wind chill surface

Given a function of two variables, f(x1, x2), I often visualize the graph of the function by using two techniques. The first technique is to use a sliced fit plot. To create a sliced fit plot, specify a sequence of values for one variable and then overlay the sequence of graphs as a function of the remaining variable. For example, if you choose the values x1=0, 10, 20, and 30, you would overlay the graphs of the functions of x2: f(0, x2), f(10, x2), f(20, x2), and f(30, x2).

The following SAS DATA step evaluates the wind chill function (in US units) at a regular grid of values for the air temperature and the wind speed. The subsequent call to PROC SGPLOT overlays four curves that show the wind chill temperature as a function of the wind speed:

/* Wind chill formula for US customary units.
   The formula gives the apparent temperature (F) from 
   ambient temperature (F) and wind speed (mph).
   Source NOAA: https://www.weather.gov/safety/cold-wind-chill-chart    
*/
data WindChillUS;
label T = "Ambient Temperature (F)"
      V = "Wind Speed (mph)"
      WCT = "Wind Chill Temperature (F)";
do T = 40 to -40 by -5;
   do V = 5 to 40 by 5;
      WCT = 35.74 + 0.6215*T - 35.75*V**0.16 + 0.4275*T*V**0.16;
      output;
   end;
end;
run;
 
ods graphics / width=640px height=480px;
title "Wind Chill Chart (US Units)";
title2 "Dependence on Wind Speed";
footnote J=L "Source NOAA: https://www.weather.gov/safety/cold-wind-chill-chart";
proc sgplot data=WindChillUS;
   label T = "Temperature (F)";
   where T in (30 20 10 0);
   series x=V y=WCT / group=T lineattrs=(thickness=2);
   keylegend / position=E ;
   xaxis grid values=(-40 to 40 by 5) valueshint; 
   yaxis grid values=(-80 to 40 by 5) valueshint;
run;

For any air temperature, the apparent temperature decreases as a function of the wind speed. Stronger winds make it seem colder. Notice that the curves are not parallel, which indicates an interaction between temperature and wind speed. For example, if the air temperature is 30° F, a 25-mph wind makes it seem about 13 degrees colder. However, if the air temperature is 0° F, a 25-mph wind makes it seem about 24 degrees colder.

Create a heat map of the wind chill surface

A second way to visualize a response surface is to create a heat map. To be more useful, you can overlay the wind chill temperatures on the heat map.

When I look at the wind-chill chart that is provided by the National Weather Service (NWS) (shown at the right), I notice three things:

  • The NWS chart contains only four colors. The colors correspond to the danger of frostbite for exposed skin.
  • The temperature axis (horizontal dimension) is reversed. The extremely cold temperatures are on the right side of the graph and the temperatures decrease from left to right. The wind speed axis is also reversed.
  • The NWS chart uses light colors for moderately cold temperatures and dark colors for extremely cold temperatures. Black text is overlaid on the light colors and white text is overlaid on the dark colors. This way, the color of the text contrasts with the color of the background, which makes the text more readable.

I decided to deviate from the NWS chart in two ways. First, instead of four colors, I decided to use a spectrum of colors that represent the wind chill temperature, not the risk of frostbite. Second, I decided to use the traditional orientation of the axes, where temperatures increase from left to right and wind speeds increase from bottom to top. However, I followed the NWS example for choosing the color of the text.

You can use a trick to get the text to display as either white or black. The following DATA step assigns a binary variable according to whether the temperature is less than some threshold value. If you use the TEXT statement in PROC SGPLOT to display the text, you can use the COLORRESPONSE= option and a white-to-black color ramp to generate text that is white or black according to whether the wind chill temperature is above or below the threshold value. (You could also use the GROUP= option and the STYLEATTRS statement.) To be effective, you also need to choose the color ramp so that lighter colors are used above the threshold and darker colors are used below the threshold. Like the NWS, I use dark purples and blues for extremely cold temperatures, but I added lighter colors (off-white, yellow, and orange) for the milder temperatures.

The following SAS program generates the wind chill chart in US units. Notice the trick that generates white/black text for different temperatures. The chart is shown at the top of this article.

title "Wind Chill Chart (US Units)";
footnote J=L "Source NOAA: https://www.weather.gov/safety/cold-wind-chill-chart";
 
/* use threshold to decide between white text and dark text */
data WindChillUS2;
set WindChillUS;
textColor = (WCT > -19.5);  /* use to color text */
run;
 
proc sgplot data=WindChillUS2;
   format WCT 3.0;
   heatmapparm x=T y=V colorresponse=WCT / name="w"
      colormodel=(DarkPurple Purple DarkBlue DarkCyan LightGray LightYellow LightOrange);
   /* trick: white text on a dark background and dark text on a light background */
   text x=T y=V text=WCT / colorresponse=textColor colormodel=(white black) textattrs=(weight=bold);
   gradlegend "w";
   xaxis values=(-40 to 40 by 5) valueshint; 
   yaxis values=(-100 to 40 by 5) valueshint;
run;

Creating a wind chill chart in SI units

Because most of the world uses SI (metric) units, the following program repeats the process for temperatures in Celsius and wind speeds in kilometers per hour (kph). The range of temperatures and wind speeds are approximately the same as for the US chart.

data WindChillSI;
label TC = "Ambient Temperature (C)"
      VK = "Wind Speed (kph)"
      WCT = "Wind Chill Temperature (C)";
do TC = 6 to -39 by -3;
   do VK = 5 to 65 by 5;
      WCT = 13.12 + 0.6215*TC - 11.37*VK**0.16 + 0.3965*TC*VK**0.16;
      output;
   end;
end;
 
/* use white text on a dark background and dark text on a light background */
data WindChillSI2;
set WindChillSI;
textColor = (WCT > -29.5);  /* use to color text */
run;
 
ods graphics / width=640px height=480px;
title "Wind Chill Chart (SI Units)";
proc sgplot data=WindChillSI2;
   format WCT 3.0;
   heatmapparm x=TC y=VK colorresponse=WCT / name="w"
      colormodel=(DarkPurple Purple DarkBlue DarkCyan LightGray LightYellow LightOrange);
   text x=TC y=VK text=WCT / colorresponse=textColor colormodel=(white black) textattrs=(weight=bold);
   gradlegend "w";
   xaxis values=(-39 to 6 by 3) valueshint; 
   yaxis values=(5 to 100 by 5) valueshint;
run;

Summary

This article shows how to create a wind chill chart in SAS. One technique slices the response surface and overlays the one-dimensional graph. Another creates a heat map of the wind chill temperature and overlays text. The colors of the heat map and the text are chosen so that the text is readable on both light and dark backgrounds. You can use the techniques in this article to visualize any response surface.

You can download the complete SAS program that creates these graphs.

The post Create a wind chill chart in SAS appeared first on The DO Loop.

2月 242021
 

In cold and blustery conditions, the weather forecast often includes two temperatures: the actual air temperature and the wind chill temperature. The wind chill temperature conveys the cumulative effect of air temperature and wind on the human body. The goal of the wind-chill scale is to communicate the effect of the wind by giving an equivalent hypothetical scenario in which the temperature is colder but there is no wind. For example, if the wind chill temperature is 0 degrees, it means that the outdoor air temperature and wind combine to make you feel as cold as if it was a calm, clear, night with a temperature of 0°.

This article briefly discusses wind chill calculations and presents a statistical formula for predicting wind chill temperatures. It then creates two wind chill charts by using the formula. The first chart (shown below; click to enlarge) is for US customary units (temperatures in Fahrenheit and wind speeds in miles per hour (mph)). A second chart is shown for SI units (temperatures in Celsius and wind speeds in kilometers per hour (kph)). The end of the article links to information about how scientists developed the wind chill chart. For readers who are interested in how the charts were made, I link to a SAS program.

What is wind chill?

A challenge for meteorologists is how to communicate the risk of severe weather. The wind chill table was developed to communicate the cumulative risk of extreme cold and windy conditions. The table displays the "wind chill equivalent temperature" (WCT or WCET), for a grid of values of air temperature and wind speed.

The wind chill table is based on sophisticated calculations of heat transfer from the human face in the presence of wind, which greatly increases the loss of heat due to convection. It is based on a lot of modeling assumptions (clear night, no sun, the human face is approximately a cylinder,...), and a lot of experimental data to estimate thermal properties of humans. The wind chill calculations that are used currently in the US, Canada, and the UK are based on a 2001 revision of an earlier wind chill table.

A formula for wind chill

The numbers in the official wind chill table are based on equations from physics and mathematical modeling. The equations are very complicated, and you cannot get the numbers in the table by using a simple formula. However, because a simple formula is very useful for understanding how the apparent temperature depends on wind speed, researchers constructed a linear regression model for which the apparent temperature (WCT) is regressed onto the air temperature and a power of the wind speed. The model includes the main effects and an interaction term.

The formula is accurate to within 1° over the range of physically relevant temperatures and wind speeds. The formula is a valid approximation only for temperatures below 10° C (50° F) and for wind speeds above 5 kph (3 mph). In particular, if you plug in V=0, you do not get the air temperature.

The formula for the wind chill temperature in US units is available on a web page for the National Weather Service:
\(WCT_{US} = 35.74 + 0.6215*T_F - 35.75*V_{mph}^{0.16} + 0.4275*T_F*V_{mph}^{0.16}\)
where TF is the air temperature in degrees Fahrenheit, and Vmph is the wind speed in miles per hour.

The previous formula is a conversion from the official metric formula:
\(WCT_{SI} = 13.12 + 0.6215*T_C - 11.37*V_{kph}^{0.16} + 0.3965*T_C*V_{kph}^{0.16}\)
where TC is the air temperature in degrees Celsius, and Vkph is the wind speed in kilometers per hour. You can get one formula from the other by using the equations TC = 5/9*TF – 32 and Vkph=Vmph / 1.609.

Notice that the exponent for the wind speed is rather strange (0.16), but presumably this is based on either a statistical transformation (such as a Box-Cox transformation) or by fitting the exponent as a parameter in the model. Despite my efforts, I was unable to find details about how the regression formula was fitted.

A little known fact is that the wind speeds in the formula are measured at 10 meters (standard anemometer height). The formula uses a wind-profile model to adjust the wind speed to approximate the speed at the height of an average person.

Visualizing wind chill for a given air temperature

Winds removes heat from a layer next to your skin. If you specify the air temperature but allow the wind speed to vary, a graph of the wind chill temperature (WTC) shows how much colder it feels based on the speed of the wind. The following graph shows the WTC (in US units) for four sub-freezing air temperatures: 30°, 20°, 10°, and 0° F.

In each case, you can see that stronger winds make it seem colder. The effect of the wind is nonlinear and is stronger for colder temperatures. For example, if the air temperature is 30° F, a 25-mph wind makes it seem about 13 degrees colder. However, if the air temperature is 0° F, a 25-mph wind makes it seem about 24 degrees colder.

Creating a wind chill chart in US units

You can evaluate the formula for the wind chill temperature on a regular grid to obtain a chart that enables you to estimate the wind chill from the current conditions. The chart for US units is shown at the top of this article. To use the chart, find the approximate outdoor temperature along the horizontal axis, then move up the chart into the row for the approximate wind speed. For example, an outdoor temperature of 5° F and a wind speed of 20 mph results in a wind chill temperature of -15° F.

Creating a wind chill chart in SI units

You can create a similar chart for SI (metric) units, as shown below. Because the Celsius degree is almost twice as large as the Fahrenheit degree, I used increments of 3° C along the horizontal axis. However, I continue to use 5 as an increment for wind speed, even though 5 kph is only about 3 mph. The ranges for the axes are chosen so that the range of temperatures and wind speeds are approximately the same as for the US chart.

Summary and further reading

This article discusses wind chill temperature. The computations that calculate the numbers in the official wind chill chart are very complicated. However, you can fit a linear regression to the numbers in the official chart to obtain an approximate formula that is easy to evaluate. This article presents several wind chill charts that are based on the regression formula.

The history of wind chill computations is fascinating. The Wikipedia article for "Wind Chill" has a few sentences about the history. For an interesting account of the development of the most recent wind chill chart (2001), I highly recommend reading Osczevski and Bluestein, "The New Wind Chill Equivalent Temperature Chart," 2005. They explain some of the complications and approximations used to construct the chart. They note that "the public seems to have a strong preference for the equivalent temperature" chart, which seems simple but hides a lot of complexity. They conclude that the chart is "a deceptive simplification" that only seems to be easy to understand.

A future article will discuss how I constructed the wind chill charts in SAS. You can download the complete SAS program that creates these graphs.

The post The wind chill chart appeared first on The DO Loop.

2月 222021
 

A previous article describes how to use the SGPANEL procedure to visualize subgroups of data. It focuses on using headers to display information about each graph. In the example, the data are time series for the price of several stocks, and the headers include information about whether the stock price increased, decreased, or stayed the same during a time period. The previous article discusses several advantages and disadvantages of using the SGPANEL procedure for this task.

An alternative approach is to use the BY statement in the SGPLOT procedure to process each subgroup separately. This article shows how to use the #BYVAR and #BYVAL keywords in SAS titles to display information about the data in each subgroup.

The example data

I will use the same data as for the previous article. In real life, you would use a separate analysis to determine whether each stock increased or decreased, but I will hard-code this information for the three stocks in the example data.

The following DATA step creates a subset of the Sashelp.Stocks data. The STOCK variable contains the name of three stocks: IBM, Intel, and Microsoft. The OPEN variable contains the opening stock price for these companies for each month. The DATA step restricts the data to the time period Jan 1998 – May 2000. The TREND variable indicates whether the stock price increased or decreased during the time period.

data Have;
   set Sashelp.Stocks;
   where '01Jan1998'd <= Date <= '30May2000'd;
   /* prepare data to display information */
   if      Stock='IBM'       then Trend='Neutral   ';
   else if Stock='Intel'     then Trend='Increasing';
   else if Stock='Microsoft' then Trend='Decreasing';
run;
/* NOTE: The Sashelp.Stock data set is already sorted by Stock and by Date. 
   Be sure to sort your data if you want to use the BY statement. For example:
proc sort data=Have;
   by Stock Date;
run;
*/

You must sort the data for BY-group processing. The example data are already sorted by the STOCK variable, which is the grouping variable. And within each stock, the data are sorted by the DATE variable, which is important for visualizing the stock prices versus time.

Titles for BY-group analysis

When you run a BY-group analysis, SAS automatically creates a title that indicates the name and value of the BY-group variable(s). (This occurs whenever the BYLINE option is on, and it is on by default.) SAS looks at how many titles you have specified and uses the next available title to display the BY-group information. For example, without doing anything special, you can use the standard BY-group analysis to graph the prices for all three stocks in the data set:

/* assume OPTION BYLINE is set */
title "Stock Price Jan 1998 - May 2000";  /* the BY-line will appear in the TITLE2 position */
proc sgplot data=Have;
   by Stock;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;

To save space, I have truncated the output. Each graph shows only a subset of the data. The TITLE2 line displays the name of the BY-group variable (Stock) and the value of the variable. All this happens automatically.

Customize titles: The #BYVAL substitution

The TITLE and TITLEn statements in SAS support substituting the values of a BY-group variable. You can insert the name of a BY-group variable by using the #BYVARn keyword. You can insert the name of a BY-group value by using the #BYVALn keyword. When using these text substitutions, you should specify OPTIONS NOBYLINE to suppress the automatic generation of subtitles.

By default, the BY statement generates the plots one after another, as shown in the previous example. However, you can use the ODS LAYOUT GRIDDED statement to arrange the graphs in a lattice. Essentially, you are using ODS to replicate the layout that PROC SGPANEL handles automatically. In the following example, I let the vertical scale of the axes vary according to the values for each BY group. If you prefer, you can use the MIN= and MAX= options on the YAXIS statement to specify a range of values for each axis.

/* layout the graphs. Use the #BYVALn values to build the titles */
ods graphics / width=300px height=250px;         /* make small to fit on page */
options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
title "BY-Group Analysis of the #byvar1 Variable";/* substitute variable name for #BYVAR */
title2 "Time Series for #byval1";                /* substitute name of stock for #BYVAL */
 
proc sgplot data=Have;
   by Stock;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Now you can see the power of the #BYVAL keyword. (Click the graph to enlarge it.) It gives you great flexibility in creating a custom subtitle that contains the value of the BY-group variable. The keywords #BYVAR and #BYVAL are an alias for #BYVAR1 and #BYVAL1, just like TITLE is an alias for TITLE1. The next example uses a second BY-group variable.

Because the BY statement supports multiple BY-group variables and because you can specify the NOTSORTED option for variables that are not sorted, you can include the TREND variable as a second BY=group variable. You can then use both the #BYVAL1 and #BYVAL2 keywords to further customize the titles:

options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
title2 "The Time Series for #byval1 Is #byval2"; /* substitute stock name and trend value */
 
proc sgplot data=Have;
   by Stock TREND notsorted;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Controlling attributes in a BY-group

I have one more tip. You can use a discrete attribute map to link the attributes of markers and lines to the value of a variable in the data. For example, suppose you want to color the lines in these plots according to whether the stock price increased, decreased, or stayed the same. The following DATA step creates a discrete attribute map that assigns the line colors based on the value in the TREND variable. On the PROC SGPLOT statement, you can use the DATTRMAP= option, which makes the data map available to the procedure. You can add the ATTRID= option to the SERIES statement. Because the colors are determined by the GROUP=TREND option, the procedure will look at the attribute map to determine which color to use for each line.

/* Create a discrete attribute map. Line color is determined by the TREND value. */
data Attrs;
length Value $20 LineColor $20;
ID = "StockTrend";
Value='Neutral';    LineColor = "DarkBlue    "; output;
Value='Increasing'; LineColor = "DarkGreen   "; output;
Value='Decreasing'; LineColor = "DarkRed     "; output;
run;
 
options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
 
proc sgplot data=Have noautolegend DATTRMAP=Attrs;
   by Stock Trend notsorted;
   series x=Date y=Open / group=Trend ATTRID=StockTrend lineattrs=(thickness=2);
   yaxis grid label="Stock Price";
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Summary and further reading

This article shows how to customize the title and attributes in graphs that are generated as part of a BY-group analysis. You can use the #BYVARn and #BYVALn keywords to insert information about the BY groups into titles. You can use a discrete attribute map to link attributes in a graph (such as line color) to the values of a variable in the data. Although creating a sequence of graphs by using a BY-group analysis is a powerful technique, I often prefer to use PROC SGPANEL, for the reasons discussed in a previous article. PROC SGPANEL provides support for controlling many features of the graphs and the layout of the graphs.

If you are interested in the SAS Macro solution to creating titles, you can read the original thread and solution on the SAS Support Communities. For more information about using the #BYVAR and #BYVAL keywords in SAS titles, see

The post How to use the #BYVAR and #BYVAL keywords to customize graph titles in SAS appeared first on The DO Loop.