Statistical Graphics

6月 202022
 

For a linear regression model, a useful but underutilized diagnostic tool is the partial regression leverage plot. Also called the partial regression plot, this plot visualizes the parameter estimates table for the regression. For each effect in the model, you can visualize the following statistics:

  • The estimate for each regression coefficient in the model.
  • The hypothesis tests β0=0, β1=0, ..., where βi is the regression coefficient for the i_th effect in the model.
  • Outliers and high-leverage points.

This article discusses partial regression plots, how to interpret them, and how to create them in SAS. If you are performing a regression that uses k effects and an intercept term, you will get k+1 partial regression plots.

Example data for partial regression leverage plots

The following SAS DATA step uses Fisher's iris data. To make it easier to discuss the roles of various variables, the DATA step renames the variables. The variable Y is the response, and the explanatory variables are x1, x2, and x3. (Explanatory variables are also called regressors.)

In SAS, you can create a panel of partial regression plots automatically in PROC REG. Make sure to enable ODS GRAPHICS. Then you can use the PARTIAL option on the MODEL statement PROC REG statement to create the panel. To reduce the number of graphs that are produced, the following call to PROC REG uses the PLOTS(ONLY)=(PARTIAL) option to display only the partial regression leverage plots.

data Have;
set sashelp.iris(rename=(PetalLength = Y
                 PetalWidth  = x1
                 SepalLength = x2
                 SepalWidth  = x3)
                 where=(Species="Versicolor"));
ID = _N_;
label Y= x1= x2= x3=;
run;
 
ods graphics on;
title "Basic Partial Regression Leverage Plots";
proc reg data=Have plots(only)=(PartialPlot);
   model Y = x1 x2 x3 / clb partial;
   ods select ParameterEstimates PartialPlot;
quit;

Let's call the parameter for the Intercept term the 0th coefficient, the parameter for x1 the 1st coefficient, and so on. Accordingly, we'll call the upper left plot the 0th plot, the upper right plot the 1st plot, and so on.

A partial regression leverage plot is a scatter plot that shows the residuals for a specific regressions model. In the i_th plot (i=0,1,2,3), the vertical axis plots the residuals for the regression model where Y is regressed onto the explanatory variables but omits the i_th variable. The horizontal axis plots the residuals for the regression model where the i_th variable is regressed onto the other explanatory variables. For example:

  • The scatter plot with the "Intercept" title is the 0th plot. The vertical axis plots the residual values for the model that regresses Y onto the no-intercept model with regressors x1, x2, and x3. The horizontal axis plots the residual values for the model that regresses the Intercept column in the design matrix onto the regressors x1, x2, and x3. Thus, the regressors in this plot omit the Intercept variable.
  • The scatter plot with the "x1" title is the 1st plot. The vertical axis plots the residual values for the model that regresses Y onto the model with regressors Intercept, x2, and x3. The horizontal axis plots the residual values for the model that regresses x1 onto the regressors Intercept, x2, and x3. Thus, the regressors in this plot omit the x1 variable.

These plots are called "partial" regression plots because each plot is based on a regression model that contains only part of the full set of regressors. The i_th plot omits the i_th variable from the set of regressors.

Interpretation of a partial regression leverage plot

Each partial regression plot includes a regression line. It is this line that makes the plot useful for visualizing the parameter estimates. The line passes through the point (0, 0) in each plot. The slope of the regression line in the i_th plot is the parameter estimate for the i_th regression coefficient (βi) in the full model. If the regression line is close to being horizontal, that is evidence for the null hypothesis βi=0.

To demonstrate these facts, look at the partial regression plot for the Intercept. The partial plot has a regression line that is very flat (almost horizontal). This is because the parameter estimate for the Intercept is 1.65 with a standard error of 4. The 95% confidence interval is [-6.4, 9.7]. Notice that this interval contains 0, which means that the flatness of the regression line in the partial regression plot supports "accepting" (failing to reject) the null hypothesis that the Intercept parameter is 0.

In a similar way, look at the partial regression plot for the x1 variable. The partial plot has a regression line that is not flat. This is because the parameter estimate for x1 is 1.36 with a standard error of 0.24. The 95% confidence interval is [0.89, 1.83], which does not contain 0. Consequently, the steepness of the slope of the regression line in the partial regression plot visualizes the fact that we would reject the null hypothesis that the x1 coefficient is 0.

The partial regression plots for the x2 and x3 variables are similar. The regression line in the x2 plot has a steep slope, so the confidence interval for the x2 parameter does not contain 0. The regression line for the x3 variable has a negative slope because the parameter estimate for x3 is negative. The line is very flat, which indicates that the confidence interval for the x3 parameter contains 0.

John Sall ("Leverage Plots for General Linear Hypotheses", TAS, 1990) showed that you could add a confidence band to the partial regression plot such that if the line segment for Y=0 is not completely inside the band, then you reject the null hypothesis that the regression coefficient is 0.

Identify outliers by using partial regression leverage plots

Here is a remarkable mathematical fact about the regression line in a partial regression plot: the residual for each observation in the scatter plot is identical to the residual for the same observation in the full regression model! Think about what this means. The full regression model in this example is a set of 50 points in four-dimensional space. The regression surface is a 3-D hyperplane over the {x1, x2, x3} variables. Each observation has a residual, which is obtained by subtracting the predicted value from the observed value of Y. The remarkable fact is that in each partial regression plot, the residuals between the regression lines and the 2-D scatter points are exactly the same as the residuals in the full regression model. Amazing!

One implication of this fact is that you can identify points where the residual value is very small or very large. The small residuals indicate that the model fits these points well; the large residuals are outliers for the model.

Let's identify some outliers for the full model and then locate those observations in each of the partial regression plots. If you run the full regression model and analyze the residual values, you can determine that the observations that have the largest (magnitude of) residuals are ID=15, ID=39, ID=41, and ID=44. Furthermore, the next section will look at high-leverage points, which are ID=2 and ID=3. Unfortunately, the PLOTS=PARTIALPLOT option does not support the LABEL suboption, so we need to output the partial regression data and create the plots manually. The following DATA step adds a label variable to the data and reruns the regression model. The PARTIALDATA option on the MODEL statement creates an ODS table (PartialPlotData) that you can write to a SAS data set by using the ODS OUTPUT statement. That data set contains the coordinates that you need to create all of the partial regression plots manually:

/* add indicator variable for outliers and high-leverage points */
data Have2;
set Have;
if ID in (2,3) then Special="Leverage";
else if ID in (15,39,41,44) then Special="Outlier ";
else Special = "None    ";
if Special="None" then Label=.;
else Label=ID;
run;
 
proc reg data=Have2 plots(only)=PartialPlot;
   model Y = x1 x2 x3 / clb partial partialdata;
   ID ID Special Label;
   ods exclude PartialPlotData;
   ods output PartialPlotData=PD;
quit;

You can use the PD data set to create the partial regression plot. The variables for the horizontal axis have names such as Part_Intercept, Part_x1, and so forth. The variables for the vertical axis have names such as Part_Y_Intercept, Part_Y_x1, and so forth. Therefore, it is easy to write a macro that creates the partial regression plot for any variable.

%macro MakePartialPlot(VarName);
   title "Partial Leverage Plot for &VarName";
   proc sgplot data=PD ;
      styleattrs datasymbols=(CircleFilled X Plus);
      refline 0 / axis=y;
      scatter y=part_Y_&VarName x=part_&VarName / datalabel=Label group=Special;
      reg     y=part_Y_&VarName x=part_&VarName / nomarkers;
   run;
%mend;

The following ODS GRAPHICS statement uses the PUSH and POP options to temporarily set the ATTRPRIORITY option to NONE so that the labeled points appear in different colors and symbols. The program then creates all four partial regression plots and restores the default options:

ods graphics / PUSH AttrPriority=None width=360px height=270px;
%MakePartialPlot(Intercept);
%MakePartialPlot(x1);
%MakePartialPlot(x2);
%MakePartialPlot(x3);
ods graphics / POP;      /* restore ODS GRAPHICS options */

The first two plots are shown. The outliers (ID in (15,39,41,44)) will have large residuals in EVERY partial regression plot. In each scatter plot, you can see that the green markers with the "plus" (+) symbol are far from the regression line. Therefore, they are outliers in every scatter plot.

Identify high-leverage points by using partial regression leverage plots

In a similar way, the partial regression plots enable you to see whether a high-leverage point has extreme values in any partial coordinate. For this example, two high-leverage points are ID=2 and ID=3, and they are displayed as red X-shaped markers.

The previous section showed two partial regression plots. In the partial plot for the Intercept, the two influential observations do not look unusual. However, in the x1 partial plot, you can see that both observations have extreme values (both positive) for the "partial x1" variable.

Let's look at the other two partial regression plots:


The partial regression plot for x2 shows that the "partial x2" coordinate is extreme (very negative) for ID=3. The partial regression plot for x3 shows that the "partial x3" coordinate is extreme (very negative) for ID=2. Remember that these extreme values are not the coordinates of the variables themselves, but of the residuals when you regress each variable onto the other regressors.

It is not easy to explain in a few sentences how the high-leverage points appear in the partial regression plots. I think the details are described in Belsley, Kuh, and Welsch (Regression Diagnostics, 1980), but I cannot check that book right now because I am working from home. But these extreme values are why the Wikipedia article about partial regression plots states, "the influences of individual data values on the estimation of a coefficient are easy to see in this plot."

Summary

In summary, the partial regression leverage plots provide a way to visualize several important features of a high-dimensional regression problem. The PROC REG documentation includes a brief description of how to create partial regression leverage plots in SAS. As shown in this article, the slope of a partial regression line equals the parameter estimate, and the relative flatness of the line enables you to visualize the null hypothesis βi=0.

This article describes how to interpret the plots to learn more about the regression model. For example, outliers in the full model are also outliers in every partial regression plot. Observations that have small residuals in the full model also have small residuals in every partial regression plot. The high-leverage points will often show up as extreme values in one or more partial regression plots. To examine outliers and high-leverage plots in SAS, you can use the PARTIALDATA option to write the partial regression coordinates to a data set, then add labels for the observations of interest.

Partial regression leverage plots are a useful tool for analyzing the fit of a regression model. They are most useful when the number of observations is not too big. I recommend them when the sample size is 500 or less.

The post Partial leverage plots appeared first on The DO Loop.

6月 152022
 

The ODS GRAPHICS statement in SAS supports more than 30 options that enable you to configure the attributes of graphs that you create in SAS. Did you know that you can display the current set of graphical options? Furthermore, did you know that you can temporarily set certain options and then restore the options to their previous values? This article shows how to use the SHOW, PUSH, POP, and RESET options on the ODS GRAPHICS statement.

SHOW the ODS GRAPHICS options

ODS graphics have a default set of characteristics, but you can use the ODS GRAPHICS statement to override the default characteristics for graphs. Probably the most familiar options are the WIDTH= and HEIGHT= options, which set a graph's horizontal and vertical dimensions, respectively. For example, the following ODS GRAPHICS statement resets all options to their default values (RESET) and then sets the dimensions of future graphs. You can use the SHOW option on the ODS GRAPHICS statement to list the default value of several graphical options:

ods graphics / reset width=480px height=360px;
ods graphics / SHOW;

The SAS log shows the value of many options. The first few lines are shown below:

ODS Graphics Settings
---------------------
Output format:                     STATIC
By line:                           NOBYLINE
Antialias:                         ON
Maximum Loess observations:        5000
Image width:                       480px
Image height:                      360px
Maximum stack depth:               1024
Stack depth:                       0
... other options omitted ...

The log reports that the width and height of future graphs are set to 480 pixels and 360 pixels, respectively. Note that the "stack depth" is set to 0, which means that no options have been pushed onto the stack. I will say more about the stack depth in a later section.

To demonstrate that the new options are in effect, the following call to PROC SGPLOT creates a scatter plot of two variables in the SasHelp.Iris data set and uses the GROUP= option to visualize the three different species of flowers in the data:

title "Indicate Groups by Using GROUP= Option";
title2 "HTML Destination Uses AttrPriority=COLOR";
proc sgplot data=sashelp.iris;
   scatter x=PetalWidth y=SepalWidth/ group=Species;
run;

This image shows what the graph looks like in the HTML destination, which uses the ATTRPRIORITY=COLOR option by default. When you use the ATTRPRIORITY=COLOR option, groups are visualized by changing the color of markers (or lines). Thus, some markers are blue, others are reddish, and others are green. There is another option (ATTRPRIORITY=NONE), which is used by other ODS destinations, which visualizes groups by changing the color and symbol of markers (and the color and line pattern for lines). For more information about the ATTRPRIORITY= option, see "Getting started with SGPLOT: Style Attributes."

PUSH new options onto the stack

If you override the default value of an option, it will affect all future graphs until you reset the option or end the SAS session. Often, this is exactly what you want to happen. However, sometimes you want to temporarily override options, create some graphs, and then restore the existing set of options. For example, if you are producing code for others to use (such as writing a SAS macro), it is a good programming practice to not change any options that the user has set. For this situation, you can use the PUSH and POP options on the ODS GRAPHICS statement.

Let's see how the PUSH option works. The following statement temporarily overrides the ATTRPRIORITY= option by pushing the option NONE onto the current stack of options. (If you are not familiar with the stack data structure, read the Wikipedia article about stacks.) The call to PROC SGPLOT then creates a scatter plot. The scatter plot uses both colors and symbols to visualize the species of flowers.

ods graphics / PUSH AttrPriority=None;    /* temporarily change option to NONE */
 
title "Indicate Groups by Using Symbols";
title2 "Use AttrPriority=NONE";
proc sgplot data=sashelp.iris;
   scatter x=PetalWidth y=SepalWidth/ group=Species;
run;

You can confirm that the options have changed by using the SHOW option to see the new values of the ODS GRAPHICS options:

ods graphics / SHOW;
ODS Graphics Settings
---------------------
Output format:                     STATIC
By line:                           NOBYLINE
Antialias:                         ON
Maximum Loess observations:        5000
Image width:                       480px
Image height:                      360px
Attribute priority:                NONE
Maximum stack depth:               1024
Stack depth:                       1
... other options omitted ...

The SAS log confirms that the ATTRPRIORITY= option is set to NONE. In addition, the "stack depth" value is now 1, which means that you have pushed options onto the stack. Additional ODS GRAPHICS statements will affect the state of this stack, but they are "temporary" in the sense that you can use the POP option to restore the previous options, as shown in the next section.

POP the stack to restore old options

If you use the PUSH option to create a new set of options, you can use the POP option to restore the previous options. For example, the following statement pops the stack and displays the current options to the SAS log:

ods graphics / POP;           /* restore previous options */
ods graphics / SHOW;
ODS Graphics Settings
---------------------
Output format:                     STATIC
By line:                           NOBYLINE
Antialias:                         ON
Maximum Loess observations:        5000
Image width:                       480px
Image height:                      360px
Maximum stack depth:               1024
Stack depth:                       0
... other options omitted ...

The log shows that the ATTRPRIORITY= option is no longer set, which means that each destination will use its default behavior. In addition, the stack depth is back to 0, which means that the options that are displayed are the "base" options at the bottom of the stack. Notice that the image width and height are not the default values because those options were set before the first PUSH option was set.

RESET all or some options

Even though the width and height options were set before the first PUSH operation, you can still restore the width and height to their default values. The RESET= option can reset individual options, as follows:

ods graphics / reset=width reset=height;    /* restore width and height to default values */

You can restore all options to their default values by using the RESET=ALL option. Alternatively, you can use the RESET option without an equal sign.

/* restore all options to default values */
ods graphics / reset=all;    /* same as ODS GRAPHICS / RESET; */

Summary

The ODS GRAPHICS statement enables you to set options that affect the characteristics of future graphs. If you want to temporarily change the options, you can use the PUSH option to add options to the stack. All future graphs will use those options. You can restore the previous set of options by using the POP option to pop the stack. If you are writing code for others to use (for example, a SAS macro), it is good programming practice to push your options onto a stack and then pop the options after your program completes. In this way, you will not overwrite options that the use of your program has explicitly set.

If you set an option but later want to restore it to its default value, you can use the RESET= option.

The post PUSH, POP, and reset options for ODS graphics appeared first on The DO Loop.

4月 202022
 

Recently, I showed how to use a heat map to visualize measurements over time for a set of patients in a longitudinal study. The visualization is sometimes called a lasagna plot because it presents an alternative to the usual spaghetti plot. A reader asked whether a similar visualization can be created for subjects if the response is an ordinal variable, such as a count. Yes! And the heat map approach is a substantial improvement over a spaghetti plot in this situation.

This article pulls together several techniques from previous articles:

This article uses the following data, which represent the counts of malaria cases at five clinics over a 14-week time period:

data Clinical;
input SiteID @;
do Week = 1 to 14;
   input Count @;
   output;
end;
/* ID Wk1  Wk2  Wk3  Wk4 ... Wk14 */
datalines;
001  1 0 0 0 0 0 0 3 1 3 3 0 3 0
002  0 0 0 1 1 2 1 2 2 1 1 0 2 2
003  1 . . 1 0 1 0 3 . 1 0 3 2 1
004  1 1 . 1 0 1 2 2 3 2 1 0 . 0
005  1 1 1 . 0 0 0 1 0 1 2 4 5 1
;
 
title "Spaghetti Plot of Counts for Five Clinics";
proc sgplot data=Clinical;
   series x=Week y=Count / group=SiteID; 
   xaxis integer values=(1 to 14) valueshint;
run;

The line plot is not an effective way to visualize these data. In fact, it is almost useless. Because the counts are discrete integer values, and most counts are in the range [0, 3], the graph cannot clearly show the weekly values for any one clinic. The following sections develop a heat map that visualizes these data better.

Format the raw values

The following call to PROC FORMAT defines a format that associates character strings with values of the COUNT variable:

proc format;
value CountFmt
      . = "Not Counted"
      0 = "None"
      1 = "1"
      2 = "2"
      3 = "3" 
      4 - high = "4+";
run;

You can use this format to encode values and display them in a legend. Notice that you could also use a format to combine counts, such as using the word "Few" to describe 2 or 3 counts.

Associate colors to formatted values

A discrete attribute map ensures that the colors will not change if the data change. For ordinal data, it also ensures that the legend will be in ordinal order, as opposed to "data order" or alphabetical order.

You can use a discrete data map to associate graphical attributes with the formatted value of a variable. Examples of graphical attributes include marker colors, marker symbols, line colors, and line patterns. For a heat map, you want to associate the "fill color" of each cell with a formatted value. The following DATA step creates the mapping between values and colors. Notice that I use the PUTN function to apply the format to raw data values. This ensures that the mapping correctly associates formatted values with colors. The raw values are stored in an array (VAL) as are the colors (COLOR). This makes it easy to modify the map in the future or to adapt it to other situations.

data MyAttrs;
length Value $11 FillColor $20;
retain ID 'MalariaCount'               /* name of map */
     Show 'AttrMap';                   /* always show all groups in legend */
 
/* output the formatted value and color for a missing value */
Value = putn(., "CountFmt.");          /* formatted value */
FillColor = "LightCyan";               /* color for missing value */
output;
/* output the formatted values and colors for nonmissing values */
array   val[5]     _temporary_ (0 1 2 3 4);
array color[5] $20 _temporary_ ('White' 'CXFFFFB2' 'CXFECC5C' 'CXFD8D3C' 'CXE31A1C');
do i = 1 to dim(val);
   Value = putn(val[i], "CountFmt.");  /* formatted value for this raw value */
   FillColor = color[i];               /* color for this formatted value */
   output;
end;
drop i;
run;

Create a discrete heat map

Now you can create a heat map that uses the format and discrete attribute map from the previous sections. To use the map, you must specify two pieces of information:
  • Use the DATTRPMAP= option on the PROC SGPLOT statement to specify the name of the data set that contains the map.
  • Because a data set can contain multiple maps, use the ATTRID= option on the HEATMAPPARM statement to specify the value of the ID variable that contains the attributes for these data.
title "Heat Map of Malaria Data";
proc sgplot data=Clinical DATTRMAP=MyAttrs; /* <== the data set that contains attributes */
   format Count CountFmt.;                  /* <== apply the format to bin data */
   heatmapparm x=Week y=SiteID colorgroup=Count / outline outlineattrs=(color=gray)
               ATTRID=MalariaCount;         /* <== the discrete attribute map for these data */
   discretelegend;
   refline (1.5 to 5.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(1 to 14) valueshint;
run;

The heat map makes the data much clearer than the spaghetti map. For each clinic and each week, you can determine how many cases of malaria were seen. As mentioned earlier, you can use this same technique to classify counts into categories such as None, Few, Moderate, Many, and so forth.

Summary

You can use SAS formats to bin numerical data into ordinal categories. You can use a discrete attribute map to associate colors and other graphical attributes with categories. By combining these techniques, you can create a heat map that enables you to visualize ordinal responses for subjects over time. The heat map is much better at visualizing the data than a spaghetti plot is.

The post Use a heat map to visualize an ordinal response in longitudinal data appeared first on The DO Loop.

4月 042022
 

Oh, no! Your boss just told you to change the way that SAS displays certain features in graphs, such as missing values. But you have a library of hundreds of SAS programs! Do you need to modify all of your previous programs? Fortunately, the answer is no. SAS provides ODS styles that control the colors and attributes that appear in graphs and tables. By using styles, you can make one change that affects all graphs that use the style.

This article provides a simple introduction to modifying a graph style in SAS. To keep things simple, the article shows how to modify the GraphMissing style element, which is used to render markers, lines, and bars that represent missing values.

Style modification is not hard, but I do not recommend it for beginning programmers because it requires familiarity with several aspects of SAS software. The appendix describes simpler ways to modify the attributes of graph elements.

An example of style modification: Change the missing-value color

A previous article about SAS graphs shows how to change the missing-value color by using a range attribute map. You can use the range attribute map on a case-by-case basis to modify the missing-value color in graphs.

The range attribute map is a good solution when you are creating a new graph. However, it will not affect existing graphs. For example, if your company runs a SAS report every morning that creates hundreds of graphs, the colors in those graphs will not change unless you rewrite the programs to explicitly use the range attribute map. In contrast, if you create a new ODS style and render the graphs in that style, existing programs will reflect the change.

This article shows how to change the missing-value color by modifying the GraphMissing style element. By default, my ODS style uses gray to display a tile or marker that represents a missing value as shown in the scatter plot to the right. In this graph, the gray-colored markers represent patients that have missing values for the cholesterol variable. In this example, I will replace gray with a bright color (cyan) so that the change is obvious.

This article solves the following subtasks:

  1. What is the current ODS style in your SAS session?
  2. Which style template defines the style element that you want to modify?
  3. How do you modify the style element?
  4. How do you use the modified style?

Creating a new ODS style requires knowledge about the way SAS works. The appendix provides references that contain more details about how to modify an ODS style.

What is the current ODS style?

You can discover which ODS destinations are open by using the Dictionary.Destinations dictionary table or the Sashelp.Vdest view. For each open destination, the output also shows the ODS style for the destination, as follows:

proc print DATA=SASHELP.VDEST;
run;

In the SAS 9 windowing Environment, you might see the following output:

This output tells you that the HTML destination is open and is using the HTMLBlue style. Other SAS interfaces might give different results. For example, running SAS in SAS Studio might give the following output:

This output tells you that the HTML5 destination is open and is using the Illuminate style.

The remainder of this article uses the HTMLBlue style, but the methods apply to any other style.

Which style template defines the style element?

The information in this section is not strictly necessary, but I think it is useful for understanding how the GraphMissing style element is defined. This section shows how to find the definition of a style element that is currently being used. The subsequent section shows how to override the element.

Styles inherit properties from other styles, so it isn't always easy to find which template defines an element. I usually start by displaying the definition of the active ODS style. If necessary, I display the parent style, the parent-of-the-parent style, and so forth until I find the definition of the element.

For example, the following statements display the definition of the HTMLBlue style in the log:

proc template;
   source styles.HTMLBlue;
run;
define style Styles.HTMLBlue;
   parent = styles.statistical;
   class GraphColors /
      'gndata12' = cxECE8C4
      'gndata11' = cxDBD8F8
      'gndata10' = cxC6E4BF
...
end;

If you look at the HTMLBlue template, you will see that it does not define the GraphMissing style element. No problem; let's look at the parent style next. The HTMLBlue style inherits properties from the Statistical style, so let's display that template definition:

proc template;
   source styles.Statistical;
run;
define style Styles.Statistical;
   parent = styles.default;
...
   class GraphColors /
      'gncdata12' = cxF9DA04
      'gncdata11' = cxB38EF3
      'gncdata10' = cx47A82A
...
      'gcmiss' = cx979797
      'gmiss' = cx848484
...
end;

The Statistical style does not define the GraphMissing element, but it does define the shades of gray that are used to graph missing values. The 'gmiss' color is used for the fill color of tiles and bars, and the 'gcmiss' color is used for lines and markers. Although you can override these colors, I prefer to override the GraphMissing style element, so let's keep looking. The Statistical style inherits properties from the Default style, so let's display that template:

proc template;
   source styles.Default;
run;
define style Styles.Default;
...
   class GraphMissing /
      fillpattern = "X5"
      markersize = 7px
      markersymbol = "hash"
      linethickness = 1px
      linestyle = 2
      contrastcolor = GraphColors('gcmiss')
      color = GraphColors('gmiss');
...
end;

Here, at last, is the definition of the GraphMissing style element. You can see that the ContrastColor and Color properties are based on the 'gcmiss' and 'gmiss' colors, respectively, in the current ODS style. The next section shows how to redefine the GraphMissing element.

How to modify the style element

We want to redefine the ContrastColor and Color properties of the GraphMissing element. The following statements define a new style called "MyMissing". In this style, I choose cyan (bright blue) to be the color for missing values. I probably wouldn't choose this color for serious work, but it will make it easy to see the new missing-value color for this example.

proc template;
 define style MyMissing;   /* name of new style */
   parent = Styles.HTMLBlue; /* inherit attributes from HTMLBlue */
   class GraphMissing / 
        contrastcolor = cyan     /* color used for markers and lines */
        color = cyan;            /* fill color used for bars and cells */
 end;
run;

That's it. The MyMissing style inherits properties from the HTMLBlue style and overrides the ContrastColor and Color attributes of the GraphMissing element. If you use the MyMissing style, all graphical features that correspond to a missing value will be displayed in cyan.

How to use the modified style

Let's use the new style. Because I am using the HTML destination, I will specify the name of the style on the ODS HTML statement. The following statements use the style to visualize the blood pressure and cholesterol values of 200 patients in a clinical study:

ods html style=MyMissing;   /* <== use the new style */
 
%let WhiteYeOrRed = (CXFFFFFF CXFFFFB2 CXFECC5C CXFD8D3C CXE31A1C);
title "Missing Values Displayed in GraphMissing Color";
proc sgplot data=Sashelp.Heart(obs=200);
   scatter x=Diastolic y=Systolic / colorresponse=Cholesterol 
        markerattrs=(symbol=CircleFilled size=9) colormodel=&WhiteYeOrRed; 
   gradlegend;
   legenditem type=marker name='Miss' / label="Missing" markerattrs=GraphMissing(symbol=CircleFilled);
   keylegend 'Miss';
run;
 
ods html style=HTMLBlue;    /* <== restore the previous style */

The new scatter plot is shown. It is a scatter plot of the blood pressure measurements for 200 patients. The color of a marker indicates the patient's cholesterol levels. You can see that some patients have a missing value for their cholesterol level. Those markers are displayed in the GraphMissing color, which is cyan for this new ODS style. In the previous scatter plot, those markers were colored gray.

Summary

SAS graphs use ODS styles to determine the attributes of elements in a graph, such as markers, lines, and bars. This article shows a simple example of how to modify an ODS style for a graph in SAS. The example shows how to redefine the color used to represent a missing value.

Further reading

Modifying an ODS style is not the easiest way to change the attributes of graphs. If you are new to SAS, the SGPLOT procedure supports several simpler ways to change attributes:

  • Use the "ATTRS" statements, such as MARKERATTRS and LINEATTRS, to set the attributes for a specific plot.
  • Use the STYLEATTRS statement to control attributes that are shared across several plots that you overlay in a single graph.
  • Use a range attribute map to control attributes that are not supported on the STYLEATTRS statement.
  • For modifying graphs that are created automatically by SAS procedures, see the section "ODS Graphics Template Modification" in the SAS/STAT User's Guide. The documentation describes useful techniques and macros such as the %ModStyle and %GrTitle macros, which simplify common tasks.

If you decide that you need to modify an ODS style, read the following articles:

The post Modify an ODS style to change the missing-value color appeared first on The DO Loop.

3月 302022
 

In an article about how to visualize missing data in a heat map, I noted that the SAS SG procedures (such as PROC SGPLOT) use the GraphMissing style element to color a bar or tile that represents a missing value. In the HTMLBlue ODS style, the color for missing values is gray. This article shows how to override the GraphMissing color by using a range attribute map in SAS. The appendix of this article includes links to articles that discuss range attribute maps in more detail.

A range attribute map is usually used to define a color model (also called a color ramp) and to associate each color with a value for a variable. However, a range attribute map also supports assigning the color of the missing category, as shown in this article.

Creating a range attribute map enables you to specify the missing-value color for any graph that uses the map. A future article shows how to override the GraphMissing color by modifying an ODS style. Modifying an ODS style enables you to change the missing-value color for all graphs.

Example Data

The following data and heat map are from a previous article. The data are for five patients in a clinical study. After the initial baseline measurement (Week=0), the patients were supposed to be measured weekly for 10 weeks. Only one patient kept all 10 appointments. The remaining patients missed at least two appointments. The following heat map (sometimes called a lasagna plot) shows the clinical measurement for each patient and for each week of the study.

data Clinical;
input patientID @;
do Week = 0 to 10;
   input Value @;
   output;
end;
/* ID Wk1  Wk2  Wk3  Wk4 ... Wk10*/
datalines;
1001  12.0 13.0 13.0   .   .   .  13.0 14.0 14.5 15.0 13.5 
1002  11.5 12.5   .  11.0  .   .    .    .    .   9.5  8.0 
1003  12.0   .    .  11.0  . 10.5 11.0   .    .  10.5  9.0 
1004  11.0 11.0 11.0   .  7.5 6.5   .   7.0  7.5  5.5  4.0 
1005  10.0 10.5 11.0  9.0 7.0 7.5  7.0  7.5  4.0  6.5  5.5 
;
 
%let WhiteYeOrRed = (CXFFFFFF CXFFFFB2 CXFECC5C CXFD8D3C CXE31A1C);
 
title "Missing Values Displayed in GraphMissing Color";
proc sgplot data=Clinical;
   heatmapparm x=Week y=PatientID colorresponse=Value / outline outlineattrs=(color=gray)
        colormodel=&WhiteYeOrRed; 
   gradlegend;
   refline (1000.5 to 1005.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(0 to 10) valueshint;
   legenditem type=fill name='missItem' / fillattrs=GraphMissing label="Missing Data";
   keylegend 'missItem';
run;

The value of the clinical measurement is indicated by using a white-yellow-orange-red color model. Missed appointments are displayed in gray, which is the color of the GraphMissing style element in the ODS style that I am using. Suppose you want to use a color other than gray. You can override the color used in the ODS style, which will affect all graphs, or you can create a range attribute map and use it for only this one graph. The next section shows how to define a range attribute map.

Define a range attribute map

The references in the appendix provide details, but the primary purpose of a range attribute map is to map a set of continuous values onto a spectrum of colors. In short, a range attribute map is a special SAS data set that enables you to define the colors in a custom color ramp and the values that the ramp represents.

The data set must contain variables named MIN and MAX, which you use to associate a range of values to colors. But there are special values that you can use in the MIN or MAX columns:

  1. MIN = _MIN_ specifies the smallest data value in a variable.
  2. MAX = _MAX_ specifies the largest data value in a variable.
  3. MIN = _MISSING_ specifies how to assign attributes to missing values for the variable.

To make sure we can clearly see the missing values, let's choose a bright and obnoxious color, such as cyan (bright blue). I wouldn't choose this color for serious work, but it will make it easy to see the missing values in this example.

/* create a range attribute data set */
data MyRangeAttrs;
retain ID "MapMissing";
length min $10 max $10 
       color altcolor colormodel1 colormodel2 colormodel3 colormodel4 colormodel5 $15;
input min max color altcolor colormodel1 colormodel2 colormodel3 colormodel4 colormodel5;
datalines;
_MISSING_  .     CYAN CYAN .        .        .        .        .
_MIN_      _MAX_ .    .   CXFFFFFF CXFFFFB2 CXFECC5C CXFD8D3C CXE31A1C
;

The variables in the data set must have certain names, as specified in the the documentation. The first observation specifies the colors for the missing values (MIN=_MISSING_). The COLOR variable specifies the color for areas and bars. The ALTCOLOR variable specifies the color for markers and lines. The second observation specifies a color model to use for nonmissing observations. For this example, I've used the same white-yellow-orange-red color model.

To use the range attribute map, specify the name of the data set by using the RATTRMAP= option on the PROC SGPLOT statement. A data set can include many different mappings, each defined by a unique ID. In this case, the data set contains only one mapping, and the ID value is 'MapMissing.' Use the RATTRID=MapMissing option to specify the ID value for the map. The following statements use the range attribute data set to assign colors to the heat map tiles. I also had to modify the LEGENDITEM statement so that a cyan-colored swatch appears in the legend.

title "Missing Values Displayed in Custom Color";
proc sgplot data=Clinical RATTRMAP=MyRangeAttrs;  /* <== HERE */
   heatmapparm x=Week y=PatientID colorresponse=Value / outline outlineattrs=(color=gray)
        RATTRID=MapMissing;                       /* <== AND HERE */
   gradlegend;
   refline (1000.5 to 1005.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(0 to 10) valueshint;
   /* if you use FILLATTRS=GraphMissing, you will get gray */
   legenditem type=fill name='missItem' / fillattrs=(color=CYAN) label="Missing Data";
   keylegend 'missItem';
run;

Success! The missed appointments are now displayed by using the (very bright!) cyan color. By using the range attribute map, I have complete control over the colors of the tiles, including the tiles that show missing values.

Summary

In summary, this article shows how to create a range attribute map for a heat map. The primary purpose of a range attribute map is to map a set of continuous values onto a spectrum of colors. However, by using the special keyword "_MISSING_" as a value for the MIN variable, you can control the color that is used to display missing values.

Further reading

The post Change the missing-value color by using a range attribute map appeared first on The DO Loop.

3月 282022
 

Longitudinal data are measurements for a set of subjects at multiple points in time. Also called "panel data" or "repeated measures data," this kind of data is common in clinical trials in which patients are tracked over time. Recently, a SAS programmer asked how to visualize missing values in a set of longitudinal data. In his data, the patient data was recorded every week during a 10-week trial, but some patients missed one or more appointments. The programmer wanted to visualize the missed appointments.

This article shows three related topics:

  • A DATA step technique that enables you to input longitudinal data when each subject has a different number of repeated measurements.
  • Two ways to visualize longitudinal data: a spaghetti plot and a heat map (sometimes called a "lasagna plot").
  • How to add missing values to a set of longitudinal data so that you can perform analysis on the missing values and visualize them.

Input longitudinal data: A DATA step technique

The SAS programmer shared a table that showed the structure of his data. To create the data in a SAS data set, I will demonstrate a DATA step technique that deserves to be better known. It involves using a trailing @ to create multiple observations from a single line of input. You can read about the trailing @ in the documentation of the INPUT statement. Basically, it enables you to use multiple INPUT statements to read a single line of data. The following DATA step illustrates the trailing @ technique:

/************************/
/* The trailing @ is useful for reading repeated measurements when each 
   subject has a different number of measurement (or are measured at different times) */
data Have;
input patientID NumVisits @;   /* note trailing @ */
do i = 1 to NumVisits;
   input Week Value @;
   output;
end;
drop i;
/* ID  N  Wk Val Wk Val Wk Val Wk Val Wk Val Wk Val ... */
datalines;
 1001 8   0 12   1 13    2 13   6 13   7 14   8 14.5 9 15  10 13.5 
 1002 5   0 11.5 1 12.5  3 11   9 9.5  10 8   
 1003 6   0 12   3 11    5 10.5 6 11   9 10.5 10 9   
 1004 9   0 11   1 11    2 11   4 7.5  5 6.5  7 7    8 7.5 9 5.5  10 4 
 1005 11  0 10   1 10.5  2 11   3 9    4 7    5 7.5  6 7   7 7.5   8 4  9 6.5  10 5.5
;

The DATA step creates data for five patients in a study. Each record starts with the patient ID and the number of visits for that patient. The first INPUT statement reads that information and uses the trailing @ to hold the pointer until the next INPUT statement. Because we know the number of visits for the patient, we can use a DO loop to read the (Week, Value) pairs for each visit, and use the OUTPUT statement to create one observation for each visit.

All patients have the initial baseline measurement (Week=0) but some patients did not show up for a subsequent weekly visit. Only one patient has measurements for all 11 weeks. One patient has only five measurements, and they are widely spaced apart in time. Overall, these patients attended 39 appointments and missed 16.

This is the form of the data that the SAS programmer was given. Notice that there are no SAS missing values (.) in the data. Instead, the missing appointments are implicitly defined by their absence. As we will see later, it is sometimes useful to explicitly represent the missed appointments by using missing values. But first, let's see how to visualize these data in their current form.

Spaghetti and lasagna plots

The traditional way to visualize longitudinal data is to create a spaghetti plot, as follows:

title "Spaghetti Plot of Patient Data";
proc sgplot data=Have;
   series x=Week y=Value / group=PatientID markers curvelabel;
   xaxis grid integer values=(0 to 10) valueshint;
   yaxis grid;
run;

As usual, it is difficult to visualize missing data. The locations of the missing appointments are not easy to see in this line plot. You have to look for line segments that span multiple weeks and do not have markers for one or more weeks, as seen in the lines for PatientID=1001 and PatientID=1002. However, when two lines are near each other (for example, PatientID=1004 and PatientID=1005), it is more difficult to see the missing markers.

Now imagine a study that has 20 or 50 patients. Many lines will overlap, and the missed appointments will be very difficult to see. Although the spaghetti plot is a good way to visualize nonmissing data, it is not good at showing patterns of missing values.

The lasagna plot is an alternative way to visualize longitudinal data. A lasagna plot is a heat map. Each subject is a row. Each time point is a column. Whereas spaghetti plots can be used for unevenly spaced time points, the lasagna plot is most useful when the measurements are taken at evenly spaced time intervals.

You can use the HEATMAPPARM statement in PROC SGPLOT to create a lasagna plot for these data. The following example creates the lasagna plot and uses a white-yellow-orange-red color map to visualize the measurements for each time point.

%let WhiteYeOrRed = (CXFFFFFF CXFFFFB2 CXFECC5C CXFD8D3C CXE31A1C);
/* lasagna plot: https://blogs.sas.com/content/iml/2016/06/08/lasagna-plot-in-sas.html */
title "Lasagna Plot of Patient Data";
proc sgplot data=Have;
heatmapparm x=Week y=PatientID colorresponse=Value / outline outlineattrs=(color=gray)
     colormodel=&WhiteYeOrRed; 
refline (1000.5 to 1005.5) / axis=Y lineattrs=(color=black thickness=2);
xaxis integer values=(0 to 10) valueshint;
run;

The lasagna plot does an excellent job of visualizing the data for each patient. You can see gaps for PatientID=1001 and PatientID=1002. These gaps are missing data.

However, there is a potential problem with this graph. I intentionally used white as one of the colors in the color model because I want to emphasize a problem that can occur when you create a heat map that has missing cells. In this graph, the color white is used for two different purposes! It is the color of the background, and it is the color for the lowest value of the measurement (Value=4.0). When a value is missing, the heat map does not display a cell, and the background color shows through as for PatientID=1001. On the other hand, PatientID=1005 does not have any missing values, but the patient does have a measurement (Week=8) for which Value=4.0. That cell is also white!

There are a few ways to handle this situation:

  • Use a color model that does not include white.
  • Change the color of the background to a color that is not in the color model.
  • Add the missing values to the data and display cells that contain missing values in a different color.

The first option (change the color model) is the easiest but might not be possible if your boss or client insists on a color model that uses white. The subsequent sections explore the other two options.

Change the background color

The STYLEATTRS statement in PROC SGPLOT enables you to control the colors of various graphical elements. You can specify the background color by using the WALLCOLOR= option. The following call sets the background color to gray:

/* quick and easy way: change the background color of the "wall" in the plot */
title "Change Background Color to Gray";
proc sgplot data=Have;
   styleattrs wallcolor=Gray;
   heatmapparm x=Week y=PatientID colorresponse=Value / outline outlineattrs=(color=gray)
        colormodel=&WhiteYeOrRed; 
   refline (1000.5 to 1005.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(0 to 10) valueshint;
run;

Now the gaps in the heat map are not white. The missing appointments are "visible" via the gray background. You can use the WALLCOLOR= option for all graphs. Notice, however, that the background color shows around the edges of the heat map. If you want to get rid of that edge effect, you can use the options OFFSETMIN=0 and OFFSETMAX=0 on the XAXIS and YAXIS statements. For example:

   yaxis offsetmin=0 offsetmax=0;

Display missing values as cells

The structure of the data makes it difficult to perform an analysis on the number and pattern of missed appointments. In its current form, the data set has 39 observations to represent the 39 visits from among the 55 scheduled appointments. If you want to analyze the missing values, it is better to restructure the data to include SAS missing values in the data.

An alternative structure for this data is for all patients to have 11 observations, but for the Value variable to be missing for any week in which the patient missed his appointment. The following DATA step creates a new data set for this alternative structure. First, a DATA step creates 11 weeks of missing values for each patient. The values of the PatientID variable are read by using a BY statement and the values of the FIRST.PatientID automatic variable. Then, this larger data set is merged with the observed data. Any observed values overwrite the missing values.

/* create sequence of missing values for each patient */
data AllMissing;
set Have;
by PatientID; /* assume sorted by PatientID */
if first.PatientID then do;
   do Week = 0 to 10;   
      Value = .;  output;
   end;
end;
run;
/* merge with observed data */
data Want;
   merge AllMissing Have;
   by PatientID Week;
run;

If you create a heat map for the data in this new format, every cell will be plotted. The cells that contain missing values will be displayed by using the color for the GraphMissing style element in the current ODS style. For my ODS style (HTMLBlue), the color for missing cells is gray.

/* the color for missing cells is the GraphMissing color */
title "Missing Values Displayed in GraphMissing Color";
proc sgplot data=Want;
   heatmapparm x=Week y=PatientID colorresponse=Value / outline outlineattrs=(color=gray)
        colormodel=&WhiteYeOrRed; 
   refline (1000.5 to 1005.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(0 to 10) valueshint;
   legenditem type=fill name='missItem' / fillattrs=GraphMissing label="Missing Data";
   keylegend 'missItem';
run;

I like this graph because it uses SAS missing values to visualize the patterns of missed appointments. If you look carefully, you can see that there are 55 cells, one for each appointment. The gray cells represent the 16 missed appointments whereas the other colors represent the 39 clinical measurements. This visualization makes it easy to answer questions such as "which patients missed two or more consecutive appointments?"

Summary

This article discusses some ways to visualize missing values in longitudinal data. The traditional spaghetti plot does not do a good job of visualizing missing values. A heat map (sometimes called a lasagna plot) is a better choice. Depending on the structure of your data, you might need to add missing values data. By explicitly including missing values, the patterns of missing values can be visualized more easily.

This article also demonstrates how to use a trailing @ in a DATA step. This enables you to create multiple observations from a single line of input. Thanks to Warren Kuhfeld, who showed me how to read data by using this useful technique many years ago.

The post Visualize missing values in longitudinal data appeared first on The DO Loop.

2月 162022
 

A SAS programmer asked an interesting question: If data in a time series has missing values, can you plot a dashed line to indicate that the response is missing at some times?

A simple way to achieve this is by overlaying two lines. The first line (the "bottom" line in the overlay) is dashed. The second line (the "top" line in the overlay) is solid and uses the BREAK option to add gaps where the series has missing data. The result is shown to the left.

Plotting gaps in the data

An important thing to remember about the SG graphics procedures in SAS is that points and lines are displayed in the same order as the statements that you specify. So, if you use two SERIES statements, the first line is plotted "on the bottom," and the second statement is plotted "on top."

A second important point is that you can use the BREAK option on the SERIES statement to force a break in the line for each missing value for the Y variable. The BREAK statement causes the line to appear as multiple line segments that do not "connect through" missing data. If you do not use the BREAK statement, the SERIES statement will connect each valid data point to the next valid data point.

You can therefore plot two lines. The bottom line is dashed and connects through missing values. The top line is solid and breaks at missing values. This is shown in the following call to PROC SGPLOT:

/* create data that has some missing Y values */
data Have;
do x = 0 to 6.2 by 0.2;
   cnt + 1;
   y = 3 + x/4 + 0.5*sin(x**1.5);
   if cnt=10 | cnt=11 | cnt=20 | cnt=21 | 
      cnt=30 | cnt=40 | cnt=41 then 
      y = .;
   output;
end;
run;
 
title "Series with Gaps Due to Missing Values";
proc sgplot data=Have noautolegend;
   series x=x y=y / lineattrs=GraphData1(pattern=dash);
   series x=x y=y / BREAK lineattrs=GraphData1(pattern=solid thickness=2);
run;

The graph is shown at the top of this article.

Display more information about nonmissing data

There might be times when you want to enhance the series plot by showing more information about the location of the nonmissing data. An easy way to do that is to use the MARKERS option to add markers to the graph. The markers are displayed only at locations for which both X and Y are nonmissing. A second way to visualize the locations of the nonmissing values is to add a fringe plot along the bottom of the line plot, as follows:

/* append the "fringe" data: the X value of points that have nonmissing Y value */
data Want;
set Have Have(rename=(x=t) where=(y ^= .));
keep x y t;
run;
 
title "Series and Fringe Plot";
proc sgplot data=Want noautolegend;
   series x=x y=y / lineattrs=GraphData1(pattern=dash);
   series x=x y=y / markers BREAK 
                    lineattrs=GraphData1(pattern=solid thickness=2);
   fringe t;
run;

This graph makes it easier to see the nonmissing values and the locations of the gaps in the data.

Summary

This article shows a cool trick for using a dashed line to indicate that a time series has missing values. The trick is to overlay a dashed line and a solid line. By using the BREAK option on the solid line, the underlying dashed line shows through and visually indicates that missing values are present in the data.

The post A trick to plot a time series that has missing values appeared first on The DO Loop.

1月 122022
 

Some colors have names, such as "Red," "Magenta," and "Dark Olive Green." But the most common way to specify a color is to use a hexadecimal value such as CX556B2F. It is not obvious that "Dark Olive Green" and CX556B2F represent the same color, but they do! I like to use color names (when possible) instead of hexadecimal values because the names make the program more readable than the hexadecimal values. For example, a color ramp that is defined by using the names ("DarkSeaGreen" "SandyBrown" "Tomato" "Sienna") is easier to interpret than the equivalent color ramp that is defined by using the hexadecimal values (CX8FBC8F CXF4A460 CXFF6347, CXA0522D).

This article shows how to find a "named color" that is close to any color that you specify. Shakespeare asked, "What's in a name?" To paraphrase his response, this article shows that the name of "Rose" looks just as sweet as CXFF6060 but is easier to use!

Colors in SAS

When you create a graph in SAS, there are three ways to specify colors: use a pre-defined color name from the SAS registry, use the SAS-naming convention to specify hues, or use hexadecimal values to specify an 8-bit color for the RGB color model. An example of a pre-defined color name is "DarkOliveGreen," an example of a hue-based color is "Dark Moderate Green," and an example of a hexadecimal value is CX556B2F. Each hexadecimal value encodes the three RGB values for the color. For example, the hexadecimal values 55, 6B, and 2F correspond to the decimal integers 85, 107, and 47, so CX556B2F can be thought of as the RGB triplet (85, 107, 47).

In my SAS registry, there are 151 pre-defined color names, whereas there are 2563 = 16.7 million 8-bit RGB colors. Clearly, there are many RGB colors that do not have names! I thought it would be interesting to write a program that finds the closest pre-defined name to any RGB color that you specify. You can think of each color as a three-dimensional point (R, G, B), where R, G, and B are integers and 0 ≤ R,G,B ≤ 255. Thus, the space of all 8-bit colors is a three-dimensional integer lattice. Colors that are close to each other (in the Euclidean metric) have similar shades. Consequently, you can find named color that is closest to another color by using the following steps:

  1. Load the pre-defined color names and their RGB values.
  2. For any specified hexadecimal value, convert it to an RGB value.
  3. In RGB coordinates, find the pre-defined color name that is closest (in the Euclidean metric) to the specified value.

For example, if you specify an unnamed color such as CXE99F62=RGB(244, 107, 53), the program can tell you that "SandyBrown"=RGB(244, 164, 96) is the closest pre-defined color to CXE99F62. If you want to make your program more readable (and don't mind modifying the hues a little), you can replace CXE99F62 with "SandyBrown" in your program.

Read colors from the SAS registry

For the reference set, I will use the pre-defined colors in the SAS registry, but you could use any other set of names and RGB values. The SAS documentation shows how to use PROC REGISTRY to list the colors in your SAS registry.

The following program modifies the documentation example and writes the registry colors to a temporary text file:

filename _colors TEMP;    /* create a temporary text file */
 
/* write text file with colors from the registry */
proc registry startat='HKEY_SYSTEM_ROOT\COLORNAMES' list export=_colors;
run; 
 
/* In the flat file, the colors look like this:
"AliceBlue"= hex: F0,F8,FF
"AntiqueWhite"= hex: FA,EB,D7
"Aqua"= hex: 00,FF,FF
"Aquamarine"= hex: 7F,FD,D4
"Azure"= hex: F0,FF,FF
"Beige"= hex: F5,F5,DC
"Bisque"= hex: FF,E4,C4
"Black"= hex: 00,00,00
...
*/

You can use a DATA step to read the registry values and create a SAS data set that contains the names, the hexadecimal representation, and the RGB coordinates for each pre-defined color:

data RegistryRGB;
   infile _colors end=eof;         /* read from text file; last line sets EOF flag to true */
   input;                          /* read one line at a time into _infile_ */
 
   length ColorName $32 hex $8; 
   retain hex "CX000000";
   s = _infile_;
   k = findw(s, 'hex:');          /* does the string 'hex' appear? */
   if k then do;                  /* this line contains a color */
      i = findc(s, '=', 2);       /* find the second quotation mark (") */
      ColorName = substr(s, 2, i-3);            /* name is between the quotes */
      /* build up the hex value from a comma-delimited value like 'FA,EB,D7' */
      substr(hex, 3, 2) = substr(s, k+5 , 2);
      substr(hex, 5, 2) = substr(s, k+8 , 2);
      substr(hex, 7, 2) = substr(s, k+11, 2);
 
      R = inputn(substr(hex, 3, 2), "HEX2."); /* get RGB coordinates from hex */
      G = inputn(substr(hex, 5, 2), "HEX2.");
      B = inputn(substr(hex, 7, 2), "HEX2.");
   end;
   if k;
   drop k i s;
run;
 
proc print data=RegistryRGB(obs=8); run;

The above program works in SAS 9 and also in SAS Viya if you submit the program through SAS Studio. My versions of SAS each have 151 pre-defined colors. The output from PROC PRINT shows that the RegistryRGB data set contains the ColorName, Hex, R, G, and B variables, which describe each pre-defined color.

Find the closest "named color"

The RegistryRGB data set enables you to answer the following question: Given an RGB color, what "named color" is it closest to?

For example, in the article "A statistical palette of Christmas colors," I created a palette of colors that had the values {CX545733, CX498B60, CX94AF77, CXE99F62, CXF46B35, CXAA471D}. These colors are shown to the right, but it would be challenging to look solely at the hexadecimal values and know what colors they represent. However, if told you that the colors were close to other colors such as "DarkOliveGreen," "SandyBrown," and "Tomato," you would have a clue about what colors are represented by the hexadecimal values.

The following program uses two functions from previous articles:

proc iml;
/* function to convert an array of colors from hexadecimal to RGB 
   https://blogs.sas.com/content/iml/2014/10/06/hexadecimal-to-rgb.html */
start Hex2RGB(_hex);
   hex = colvec(_hex);        /* convert to column vector */
   rgb = j(nrow(hex),3);      /* allocate three-column matrix for results */
   do i = 1 to nrow(hex);     /* for each color, translate hex to decimal */
      rgb[i,] = inputn(substr(hex[i], {3 5 7}, 2), "HEX2.");
   end;
   return( rgb);
finish;
 
/* Compute indices (row numbers) of k nearest neighbors.
   INPUT:  S    an (n x d) data matrix
           R    an (m x d) matrix of reference points
           k    specifies the number of nearest neighbors (k>=1) 
   OUTPUT: idx  an (n x k) matrix of row numbers. idx[,j] contains the
                row numbers (in R) of the j_th closest elements to S
           dist an (n x k) matrix. dist[,j] contains the distances
                between S and the j_th closest elements in R
   https://blogs.sas.com/content/iml/2016/09/28/distance-between-two-group.html
*/
start PairwiseNearestNbr(idx, dist, S, R, k=1);
   n = nrow(S);
   idx = j(n, k, .);
   dist = j(n, k, .);
   D = distance(S, R);          /* n x m */
   do j = 1 to k;
      dist[,j] = D[ ,><];       /* smallest distance in each row */
      idx[,j] = D[ ,>:<];       /* column of smallest distance in each row */
      if j < k then do;         /* prepare for next closest neighbors */
         ndx = sub2ndx(dimension(D), T(1:n)||idx[,j]);
         D[ndx] = .;            /* set elements to missing */
      end;      
   end;
finish;

With those two functions defined, the remainder of the program is easy: read the reference RGB colors, define a palette of colors, and find the closest reference color to each specified color.

/* read the set of reference colors, which have names */
use RegistryRGB;
read all var {ColorName hex};
close;
RegRGB = Hex2RGB(hex);         /* RGB values for the colors in the SAS registry */
 
/* define the hex values that you want to test */
HaveHex = {CX545733, CX498B60, CX94AF77, CXE99F62, CXF46B35, CXAA471D};
HaveRGB = Hex2RGB(HaveHex);    /* convert test values to RGB coordinates */
 
run PairwiseNearestNbr(ClosestIdx, Dist, HaveRGB, RegRGB);
ClosestName = ColorName[ClosestIdx];  /* names of closest reference colors */
ClosestHex = hex[ClosestIdx];         /* hex values for closest reference colors */
print HaveHex ClosestHex ClosestName Dist;

The table shows the closest reference color to each specified color. For example, the color CX545733 is closest to the reference color "DarkOliveGreen." How close are they? In three-dimensional RGB coordinates, they are about 20.4 units apart. If you want to see the difference in each coordinate direction, you can print the difference between the RGB values:

/* how different is each coordinate? */
Diff = HaveRGB - RegRGB[Closestidx,];
print Diff[c={R G B}];

You can see that the red and blue coordinates of CX545733 and "DarkOliveGreen" are almost identical. The green coordinates differ by 20 units or about 8%. The "SandyBrown" color is a very good approximation to CXE99F62 because the distance between those colors is about 12.2 units. Every RGB coordinate of "SandyBrown" is within 11 units of the corresponding coordinate of CXE99F62.

You can display both palettes adjacent to each other to compare how well the reference colors approximate the test colors:

/* visualize the palettes */
ods graphics / width=640px height=160px;
k = nrow(HaveHex);
run HeatmapDisc(1:k, HaveHex) title="Original Palette" ShowLegend=0 
               xvalues=HaveHex yvalues="Colors";
run HeatmapDisc(1:k, ClosestHex) title="Nearby Palette of Registry Colors" ShowLegend=0 
               xvalues=ClosestName yvalues="Colors";

The eye can detect small differences in shades, but the overall impression is that the palette of named colors is very similar to the original palette. The palette of named colors is more informative in the sense that people have can visualize "SeaGreen" and "Tomato" without seeing the palette.

Summary

This article discusses how to create a SAS data set that contains the names and RGB values of a set of "named colors." For this article, I used the named colors in the SAS registry. You can use these named colors as reference colors. Given any other color, you can find the reference color that is closest to the specified color. This enables you to describe the color as being "close to SeaGreen" or "close to SandyBrown," which might help you when you discuss colors with your colleagues.

This article is about approximating colors by using a set of reference colors. If you want to visualize the reference colors themselves, Robert Allison has shown how to display a color swatch for each color in a SAS data set.

The post How to assign a name to a color appeared first on The DO Loop.

1月 032022
 

Last year, I wrote almost 100 posts for The DO Loop blog. My most popular articles were about data visualization, statistics and data analysis, and simulation and bootstrapping. If you missed any of these gems when they were first published, here are some of the most popular articles from 2021:

SAS programming and data visualization

Panel of Regression Diagnostic Plots in SAS

SAS programming and data visualization

  • Display the first or last observations in data: Whether your data are in a SAS data set or a SAS/IML matrix, this article describes how to display to print the top rows (and bottom rows!) of a rectangular data set.
  • Customize titles in a visualization of BY groups: Have you ever use the BY statement to graph data across groups, such as Males/Females or Control/Experimental groups? If so, you might want to learn how to use the #BYVAR and #BYVAL keywords to customize the titles that appear on each graph.
  • Horizontal Bar Chart in SAS

  • Reasons to prefer a horizontal bar chart: Bar charts are used to visualize the counts of categorical data. Vertical charts are used more often, but there are advantages to using a horizontal bar chart, especially if you are displaying many categories or categories that have long labels. This article shows how to create a horizontal bar chart in SAS and gives examples for which a horizontal chart is preferable.
  • Why you should visualize distributions: It is common to report the means (of difference between means) for different groups in a study. However, means and standard deviations only tell part of the story. This article shows four examples where the mean difference between group scores is five points. However, when you plot the data for each example, you discover additional information about how the groups differ.

Simulation and Bootstrapping

Since I am the guy who wrote the book on statistical simulation in SAS, it is not surprising that my simulation articles are popular. Simulation helps analysts understand expected values, sampling variation, and standard errors for statistics.

Bootstrapping Residuals for a Time Series Regression Model

Did you resolve to learn something new in the New Year? Reading these articles requires some effort, but they provide tips and techniques that make the effort worthwhile. So, read (or re-read!) these popular articles from 2021. To ensure you don't miss a single article in 2022, consider subscribing to The DO Loop.

The post Top 10 posts from <em>The DO Loop</em> in 2021 appeared first on The DO Loop.

12月 132021
 

In a previous article, I visualized seven Christmas-themed palettes of colors, as shown to the right. You can see that the palettes include many red, green, and golden colors. Clearly, the colors in the Christmas palettes are not a random sample from the space of RGB colors. Rather, they represent a specific subset of Christmas colors. I thought it would be fun to use principal component analysis to compute and visualize the linear subspace that best captures the set of Christmas colors!

Convert hexadecimal colors to RGB

Each color can be represented as an ordered triplet of unsigned integers in RGB space. The colors in SAS are 8-bit colors, which means that each coordinate in RGB space has a value between 0 and 255. The following DATA step reads the colors in the Christmas palettes and uses the HEX2. informat to convert the hexadecimal values to their equivalent RGB values. As explained in the previous article, the DATA step also creates an ID value for each color and creates a macro variable (AllColors) that contains the list of all colors.

/* read Christmas palettes. Convert hex to RGB */
data RGB;
length Name $30 color $8;
length palette $450; /* must be big enough to hold all colors */
retain ID 1 palette;
input Name 1-22 n @;
do col = 1 to n;
   input color @;
   R = inputn(substr(color, 3, 2), "HEX2."); /* get RGB colors from hex */
   G = inputn(substr(color, 5, 2), "HEX2.");
   B = inputn(substr(color, 7, 2), "HEX2.");
   palette = catx(' ', palette, color);
   output;
   ID + 1;                             /* assign a unique ID to each color */
end;
call symput('AllColors', palette);   /* concatenate colors; store in macro */
drop n col palette;
/* Palette Name      |n| color1 | color2 | color2 | color4 | ... */
datalines;
Jingle Bell           6 CX44690D CX698308 CXF3C543 CXFFEDC7 CXCA2405 CX9E1007
Holiday Red and Gold  6 CXCF1931 CXAD132D CXD9991A CXEAA61E CXF2BC13 CX216A1B
Green and Gold        6 CX03744B CX15885A CX1E9965 CXFBE646 CXFBC34D CXF69F44
Unwrapping My Gifts   5 CX2A5B53 CX5EB69D CXECEBF1 CXD34250 CX5F1326
Christmas Wrapping    6 CX237249 CX3B885C CXE5D6B5 CXE3CD8E CXDA111E CXC00214
Christmas Wedding     6 CX325C39 CX9C9549 CXDBAA46 CXFFE9D9 CXFF4A4A CXDB2121
Real Christmas Tree   6 CX779645 CX497542 CX274530 CX6E3C3B CXBF403B CXEDB361
;

The result of this DATA step is a data set that contains 41 rows. The R, G, and B variables contain the red, green, and blue components, respectively, which are in the range 0 to 255. Thus, the data contains 41 triplets of RGB values.

Let's use PROC PRINCOMP in SAS to run a principal component analysis of these 41 observations:

proc princomp data=RGB N=2 out=PCOut plots(only)=(Score Pattern);
   var R G B;
   ID color;
run;

Recall that each principal component is a linear combination of the original variables (R, G, and B). The table shows that the first principal component (PC) is the linear combination
PC1 = 0.39*R + 0.66*G + 0.64*B
The first PC is a weighted sum of the amount of color across all three variables. More weight is given to the green and blue components than to the red component. Along the axis of the first PC, black is on the extreme left since black has the RGB value (0, 0, 0). Similarly, white is on the extreme right since white has the RGB value (255, 255, 255).

The second principal component is the linear combination
PC2 = 0.91*R - 0.21*G - 0.35*B
The second PC is a contrast between the red coordinate and the green and blue coordinates. Along the axis of the second PC, colors that have a lot of green and blue (such as cyan, which is (0, 255, 255)) have extreme negative values whereas colors that are mostly red have extreme positive values.

An additional table (not shown) indicates that 89% of the variation in the data is explained by using the first two principal components. You could add a third principal component if you want to have a way to separate the green and blue colors.

The PRINCOMP procedure creates a graph that projects each RGB value onto the principal component axes. Because I put the COLOR variable on the ID statement, each marker is labeled by using its hexadecimal value:

This graph shows how the colors in the Christmas palettes are distributed in the space of the first two PCs. However, wouldn't it be cool if the markers in this plot were the colors that the markers represent? The next section adds color to the scatter plot by using PROC SGPLOT.

Adding color to the score plot

The previous call to PROC PRINCOMP includes the OUT= option, which writes the principal component scores to a SAS data set. The names of the scores are PRIN1 and PRIN2. If you use the PRIN1 and PRIN2 variables to create a scatter plot, you can re-create the score plot from PROC PRINCOMP. However, I am going to add a twist. As shown in my previous post, I will use the GROUP= option to specify that the marker colors be assigned according to the value of the ID variables, which are the integers 1, 2, ..., 41. I will use the STYLEATTRS statement and the DATACONTRASTCOLORS= option to specify that the colors that should be used for each group. The result is a scatter plot in which each marker is colored according to a specified list of colors.

title "Principal Component Scores for Christmas Colors";
proc sgplot data=PCOut noautolegend aspect=1;
   label Prin1="Component 1 (60.37%)" Prin2="Component 2 (28.62%)";
   styleattrs datacontrastcolors=(&AllColors);
   scatter x=Prin1 y=Prin2 / markerattrs=(symbol=CircleFilled size=16) group=ID;
   refline 0 / axis=x;
   refline 0 / axis=y;
run;

The graph shows the distribution of the colors, using the colors themselves as a visual cue. The graph shows that the first PC differentiates between dark colors (on the left) and light colors (on the right). The second PC differentiates between blue-green colors (on the bottom) and red colors (on the top).

Summary

I almost titled this article, "A Statistician Looks at Christmas Colors." I started with a set of Christmas-themed palettes with 41 colors that were primarily red, green, and gold. By performing a principal component analysis, you can project the RGB coordinates of these colors onto the plane that captures most of the variation in the data. If you color the markers in this projection, you obtain a scatter plot that shows green colors in one region, red colors in another region, and gold and silver colors in yet another.

The graph shows the various Christmas colors in one image. To me, the image seems more coherent and organized than the original strips of colors. I like that similar colors are grouped together, and that the three groups of colors (red, green, and blue) are visibly distinct.

The post A principal component analysis of color palettes appeared first on The DO Loop.