Statistical Graphics

1月 122022
 

Some colors have names, such as "Red," "Magenta," and "Dark Olive Green." But the most common way to specify a color is to use a hexadecimal value such as CX556B2F. It is not obvious that "Dark Olive Green" and CX556B2F represent the same color, but they do! I like to use color names (when possible) instead of hexadecimal values because the names make the program more readable than the hexadecimal values. For example, a color ramp that is defined by using the names ("DarkSeaGreen" "SandyBrown" "Tomato" "Sienna") is easier to interpret than the equivalent color ramp that is defined by using the hexadecimal values (CX8FBC8F CXF4A460 CXFF6347, CXA0522D).

This article shows how to find a "named color" that is close to any color that you specify. Shakespeare asked, "What's in a name?" To paraphrase his response, this article shows that the name of "Rose" looks just as sweet as CXFF6060 but is easier to use!

Colors in SAS

When you create a graph in SAS, there are three ways to specify colors: use a pre-defined color name from the SAS registry, use the SAS-naming convention to specify hues, or use hexadecimal values to specify an 8-bit color for the RGB color model. An example of a pre-defined color name is "DarkOliveGreen," an example of a hue-based color is "Dark Moderate Green," and an example of a hexadecimal value is CX556B2F. Each hexadecimal value encodes the three RGB values for the color. For example, the hexadecimal values 55, 6B, and 2F correspond to the decimal integers 85, 107, and 47, so CX556B2F can be thought of as the RGB triplet (85, 107, 47).

In my SAS registry, there are 151 pre-defined color names, whereas there are 2563 = 16.7 million 8-bit RGB colors. Clearly, there are many RGB colors that do not have names! I thought it would be interesting to write a program that finds the closest pre-defined name to any RGB color that you specify. You can think of each color as a three-dimensional point (R, G, B), where R, G, and B are integers and 0 ≤ R,G,B ≤ 255. Thus, the space of all 8-bit colors is a three-dimensional integer lattice. Colors that are close to each other (in the Euclidean metric) have similar shades. Consequently, you can find named color that is closest to another color by using the following steps:

  1. Load the pre-defined color names and their RGB values.
  2. For any specified hexadecimal value, convert it to an RGB value.
  3. In RGB coordinates, find the pre-defined color name that is closest (in the Euclidean metric) to the specified value.

For example, if you specify an unnamed color such as CXE99F62=RGB(244, 107, 53), the program can tell you that "SandyBrown"=RGB(244, 164, 96) is the closest pre-defined color to CXE99F62. If you want to make your program more readable (and don't mind modifying the hues a little), you can replace CXE99F62 with "SandyBrown" in your program.

Read colors from the SAS registry

For the reference set, I will use the pre-defined colors in the SAS registry, but you could use any other set of names and RGB values. The SAS documentation shows how to use PROC REGISTRY to list the colors in your SAS registry.

The following program modifies the documentation example and writes the registry colors to a temporary text file:

filename _colors TEMP;    /* create a temporary text file */
 
/* write text file with colors from the registry */
proc registry startat='HKEY_SYSTEM_ROOT\COLORNAMES' list export=_colors;
run; 
 
/* In the flat file, the colors look like this:
"AliceBlue"= hex: F0,F8,FF
"AntiqueWhite"= hex: FA,EB,D7
"Aqua"= hex: 00,FF,FF
"Aquamarine"= hex: 7F,FD,D4
"Azure"= hex: F0,FF,FF
"Beige"= hex: F5,F5,DC
"Bisque"= hex: FF,E4,C4
"Black"= hex: 00,00,00
...
*/

You can use a DATA step to read the registry values and create a SAS data set that contains the names, the hexadecimal representation, and the RGB coordinates for each pre-defined color:

data RegistryRGB;
   infile _colors end=eof;         /* read from text file; last line sets EOF flag to true */
   input;                          /* read one line at a time into _infile_ */
 
   length ColorName $32 hex $8; 
   retain hex "CX000000";
   s = _infile_;
   k = findw(s, 'hex:');          /* does the string 'hex' appear? */
   if k then do;                  /* this line contains a color */
      i = findc(s, '=', 2);       /* find the second quotation mark (") */
      ColorName = substr(s, 2, i-3);            /* name is between the quotes */
      /* build up the hex value from a comma-delimited value like 'FA,EB,D7' */
      substr(hex, 3, 2) = substr(s, k+5 , 2);
      substr(hex, 5, 2) = substr(s, k+8 , 2);
      substr(hex, 7, 2) = substr(s, k+11, 2);
 
      R = inputn(substr(hex, 3, 2), "HEX2."); /* get RGB coordinates from hex */
      G = inputn(substr(hex, 5, 2), "HEX2.");
      B = inputn(substr(hex, 7, 2), "HEX2.");
   end;
   if k;
   drop k i s;
run;
 
proc print data=RegistryRGB(obs=8); run;

The above program works in SAS 9 and also in SAS Viya if you submit the program through SAS Studio. My versions of SAS each have 151 pre-defined colors. The output from PROC PRINT shows that the RegistryRGB data set contains the ColorName, Hex, R, G, and B variables, which describe each pre-defined color.

Find the closest "named color"

The RegistryRGB data set enables you to answer the following question: Given an RGB color, what "named color" is it closest to?

For example, in the article "A statistical palette of Christmas colors," I created a palette of colors that had the values {CX545733, CX498B60, CX94AF77, CXE99F62, CXF46B35, CXAA471D}. These colors are shown to the right, but it would be challenging to look solely at the hexadecimal values and know what colors they represent. However, if told you that the colors were close to other colors such as "DarkOliveGreen," "SandyBrown," and "Tomato," you would have a clue about what colors are represented by the hexadecimal values.

The following program uses two functions from previous articles:

proc iml;
/* function to convert an array of colors from hexadecimal to RGB 
   https://blogs.sas.com/content/iml/2014/10/06/hexadecimal-to-rgb.html */
start Hex2RGB(_hex);
   hex = colvec(_hex);        /* convert to column vector */
   rgb = j(nrow(hex),3);      /* allocate three-column matrix for results */
   do i = 1 to nrow(hex);     /* for each color, translate hex to decimal */
      rgb[i,] = inputn(substr(hex[i], {3 5 7}, 2), "HEX2.");
   end;
   return( rgb);
finish;
 
/* Compute indices (row numbers) of k nearest neighbors.
   INPUT:  S    an (n x d) data matrix
           R    an (m x d) matrix of reference points
           k    specifies the number of nearest neighbors (k>=1) 
   OUTPUT: idx  an (n x k) matrix of row numbers. idx[,j] contains the
                row numbers (in R) of the j_th closest elements to S
           dist an (n x k) matrix. dist[,j] contains the distances
                between S and the j_th closest elements in R
   https://blogs.sas.com/content/iml/2016/09/28/distance-between-two-group.html
*/
start PairwiseNearestNbr(idx, dist, S, R, k=1);
   n = nrow(S);
   idx = j(n, k, .);
   dist = j(n, k, .);
   D = distance(S, R);          /* n x m */
   do j = 1 to k;
      dist[,j] = D[ ,><];       /* smallest distance in each row */
      idx[,j] = D[ ,>:<];       /* column of smallest distance in each row */
      if j < k then do;         /* prepare for next closest neighbors */
         ndx = sub2ndx(dimension(D), T(1:n)||idx[,j]);
         D[ndx] = .;            /* set elements to missing */
      end;      
   end;
finish;

With those two functions defined, the remainder of the program is easy: read the reference RGB colors, define a palette of colors, and find the closest reference color to each specified color.

/* read the set of reference colors, which have names */
use RegistryRGB;
read all var {ColorName hex};
close;
RegRGB = Hex2RGB(hex);         /* RGB values for the colors in the SAS registry */
 
/* define the hex values that you want to test */
HaveHex = {CX545733, CX498B60, CX94AF77, CXE99F62, CXF46B35, CXAA471D};
HaveRGB = Hex2RGB(HaveHex);    /* convert test values to RGB coordinates */
 
run PairwiseNearestNbr(ClosestIdx, Dist, HaveRGB, RegRGB);
ClosestName = ColorName[ClosestIdx];  /* names of closest reference colors */
ClosestHex = hex[ClosestIdx];         /* hex values for closest reference colors */
print HaveHex ClosestHex ClosestName Dist;

The table shows the closest reference color to each specified color. For example, the color CX545733 is closest to the reference color "DarkOliveGreen." How close are they? In three-dimensional RGB coordinates, they are about 20.4 units apart. If you want to see the difference in each coordinate direction, you can print the difference between the RGB values:

/* how different is each coordinate? */
Diff = HaveRGB - RegRGB[Closestidx,];
print Diff[c={R G B}];

You can see that the red and blue coordinates of CX545733 and "DarkOliveGreen" are almost identical. The green coordinates differ by 20 units or about 8%. The "SandyBrown" color is a very good approximation to CXE99F62 because the distance between those colors is about 12.2 units. Every RGB coordinate of "SandyBrown" is within 11 units of the corresponding coordinate of CXE99F62.

You can display both palettes adjacent to each other to compare how well the reference colors approximate the test colors:

/* visualize the palettes */
ods graphics / width=640px height=160px;
k = nrow(HaveHex);
run HeatmapDisc(1:k, HaveHex) title="Original Palette" ShowLegend=0 
               xvalues=HaveHex yvalues="Colors";
run HeatmapDisc(1:k, ClosestHex) title="Nearby Palette of Registry Colors" ShowLegend=0 
               xvalues=ClosestName yvalues="Colors";

The eye can detect small differences in shades, but the overall impression is that the palette of named colors is very similar to the original palette. The palette of named colors is more informative in the sense that people have can visualize "SeaGreen" and "Tomato" without seeing the palette.

Summary

This article discusses how to create a SAS data set that contains the names and RGB values of a set of "named colors." For this article, I used the named colors in the SAS registry. You can use these named colors as reference colors. Given any other color, you can find the reference color that is closest to the specified color. This enables you to describe the color as being "close to SeaGreen" or "close to SandyBrown," which might help you when you discuss colors with your colleagues.

This article is about approximating colors by using a set of reference colors. If you want to visualize the reference colors themselves, Robert Allison has shown how to display a color swatch for each color in a SAS data set.

The post How to assign a name to a color appeared first on The DO Loop.

1月 032022
 

Last year, I wrote almost 100 posts for The DO Loop blog. My most popular articles were about data visualization, statistics and data analysis, and simulation and bootstrapping. If you missed any of these gems when they were first published, here are some of the most popular articles from 2021:

SAS programming and data visualization

Panel of Regression Diagnostic Plots in SAS

SAS programming and data visualization

  • Display the first or last observations in data: Whether your data are in a SAS data set or a SAS/IML matrix, this article describes how to display to print the top rows (and bottom rows!) of a rectangular data set.
  • Customize titles in a visualization of BY groups: Have you ever use the BY statement to graph data across groups, such as Males/Females or Control/Experimental groups? If so, you might want to learn how to use the #BYVAR and #BYVAL keywords to customize the titles that appear on each graph.
  • Horizontal Bar Chart in SAS

  • Reasons to prefer a horizontal bar chart: Bar charts are used to visualize the counts of categorical data. Vertical charts are used more often, but there are advantages to using a horizontal bar chart, especially if you are displaying many categories or categories that have long labels. This article shows how to create a horizontal bar chart in SAS and gives examples for which a horizontal chart is preferable.
  • Why you should visualize distributions: It is common to report the means (of difference between means) for different groups in a study. However, means and standard deviations only tell part of the story. This article shows four examples where the mean difference between group scores is five points. However, when you plot the data for each example, you discover additional information about how the groups differ.

Simulation and Bootstrapping

Since I am the guy who wrote the book on statistical simulation in SAS, it is not surprising that my simulation articles are popular. Simulation helps analysts understand expected values, sampling variation, and standard errors for statistics.

Bootstrapping Residuals for a Time Series Regression Model

Did you resolve to learn something new in the New Year? Reading these articles requires some effort, but they provide tips and techniques that make the effort worthwhile. So, read (or re-read!) these popular articles from 2021. To ensure you don't miss a single article in 2022, consider subscribing to The DO Loop.

The post Top 10 posts from <em>The DO Loop</em> in 2021 appeared first on The DO Loop.

12月 132021
 

In a previous article, I visualized seven Christmas-themed palettes of colors, as shown to the right. You can see that the palettes include many red, green, and golden colors. Clearly, the colors in the Christmas palettes are not a random sample from the space of RGB colors. Rather, they represent a specific subset of Christmas colors. I thought it would be fun to use principal component analysis to compute and visualize the linear subspace that best captures the set of Christmas colors!

Convert hexadecimal colors to RGB

Each color can be represented as an ordered triplet of unsigned integers in RGB space. The colors in SAS are 8-bit colors, which means that each coordinate in RGB space has a value between 0 and 255. The following DATA step reads the colors in the Christmas palettes and uses the HEX2. informat to convert the hexadecimal values to their equivalent RGB values. As explained in the previous article, the DATA step also creates an ID value for each color and creates a macro variable (AllColors) that contains the list of all colors.

/* read Christmas palettes. Convert hex to RGB */
data RGB;
length Name $30 color $8;
length palette $450; /* must be big enough to hold all colors */
retain ID 1 palette;
input Name 1-22 n @;
do col = 1 to n;
   input color @;
   R = inputn(substr(color, 3, 2), "HEX2."); /* get RGB colors from hex */
   G = inputn(substr(color, 5, 2), "HEX2.");
   B = inputn(substr(color, 7, 2), "HEX2.");
   palette = catx(' ', palette, color);
   output;
   ID + 1;                             /* assign a unique ID to each color */
end;
call symput('AllColors', palette);   /* concatenate colors; store in macro */
drop n col palette;
/* Palette Name      |n| color1 | color2 | color2 | color4 | ... */
datalines;
Jingle Bell           6 CX44690D CX698308 CXF3C543 CXFFEDC7 CXCA2405 CX9E1007
Holiday Red and Gold  6 CXCF1931 CXAD132D CXD9991A CXEAA61E CXF2BC13 CX216A1B
Green and Gold        6 CX03744B CX15885A CX1E9965 CXFBE646 CXFBC34D CXF69F44
Unwrapping My Gifts   5 CX2A5B53 CX5EB69D CXECEBF1 CXD34250 CX5F1326
Christmas Wrapping    6 CX237249 CX3B885C CXE5D6B5 CXE3CD8E CXDA111E CXC00214
Christmas Wedding     6 CX325C39 CX9C9549 CXDBAA46 CXFFE9D9 CXFF4A4A CXDB2121
Real Christmas Tree   6 CX779645 CX497542 CX274530 CX6E3C3B CXBF403B CXEDB361
;

The result of this DATA step is a data set that contains 41 rows. The R, G, and B variables contain the red, green, and blue components, respectively, which are in the range 0 to 255. Thus, the data contains 41 triplets of RGB values.

Let's use PROC PRINCOMP in SAS to run a principal component analysis of these 41 observations:

proc princomp data=RGB N=2 out=PCOut plots(only)=(Score Pattern);
   var R G B;
   ID color;
run;

Recall that each principal component is a linear combination of the original variables (R, G, and B). The table shows that the first principal component (PC) is the linear combination
PC1 = 0.39*R + 0.66*G + 0.64*B
The first PC is a weighted sum of the amount of color across all three variables. More weight is given to the green and blue components than to the red component. Along the axis of the first PC, black is on the extreme left since black has the RGB value (0, 0, 0). Similarly, white is on the extreme right since white has the RGB value (255, 255, 255).

The second principal component is the linear combination
PC2 = 0.91*R - 0.21*G - 0.35*B
The second PC is a contrast between the red coordinate and the green and blue coordinates. Along the axis of the second PC, colors that have a lot of green and blue (such as cyan, which is (0, 255, 255)) have extreme negative values whereas colors that are mostly red have extreme positive values.

An additional table (not shown) indicates that 89% of the variation in the data is explained by using the first two principal components. You could add a third principal component if you want to have a way to separate the green and blue colors.

The PRINCOMP procedure creates a graph that projects each RGB value onto the principal component axes. Because I put the COLOR variable on the ID statement, each marker is labeled by using its hexadecimal value:

This graph shows how the colors in the Christmas palettes are distributed in the space of the first two PCs. However, wouldn't it be cool if the markers in this plot were the colors that the markers represent? The next section adds color to the scatter plot by using PROC SGPLOT.

Adding color to the score plot

The previous call to PROC PRINCOMP includes the OUT= option, which writes the principal component scores to a SAS data set. The names of the scores are PRIN1 and PRIN2. If you use the PRIN1 and PRIN2 variables to create a scatter plot, you can re-create the score plot from PROC PRINCOMP. However, I am going to add a twist. As shown in my previous post, I will use the GROUP= option to specify that the marker colors be assigned according to the value of the ID variables, which are the integers 1, 2, ..., 41. I will use the STYLEATTRS statement and the DATACONTRASTCOLORS= option to specify that the colors that should be used for each group. The result is a scatter plot in which each marker is colored according to a specified list of colors.

title "Principal Component Scores for Christmas Colors";
proc sgplot data=PCOut noautolegend aspect=1;
   label Prin1="Component 1 (60.37%)" Prin2="Component 2 (28.62%)";
   styleattrs datacontrastcolors=(&AllColors);
   scatter x=Prin1 y=Prin2 / markerattrs=(symbol=CircleFilled size=16) group=ID;
   refline 0 / axis=x;
   refline 0 / axis=y;
run;

The graph shows the distribution of the colors, using the colors themselves as a visual cue. The graph shows that the first PC differentiates between dark colors (on the left) and light colors (on the right). The second PC differentiates between blue-green colors (on the bottom) and red colors (on the top).

Summary

I almost titled this article, "A Statistician Looks at Christmas Colors." I started with a set of Christmas-themed palettes with 41 colors that were primarily red, green, and gold. By performing a principal component analysis, you can project the RGB coordinates of these colors onto the plane that captures most of the variation in the data. If you color the markers in this projection, you obtain a scatter plot that shows green colors in one region, red colors in another region, and gold and silver colors in yet another.

The graph shows the various Christmas colors in one image. To me, the image seems more coherent and organized than the original strips of colors. I like that similar colors are grouped together, and that the three groups of colors (red, green, and blue) are visibly distinct.

The post A principal component analysis of color palettes appeared first on The DO Loop.

12月 082021
 

In data visualization, colors can represent the values of a variable in a choropleth map, a heatmap, or a scatter plot. But how do you visualize a palette of colors from the RGB or hexadecimal values of the colors? One way is to use the HEATMAPDISC subroutine in SAS/IML, which enables you to specify a color ramp as part of the syntax. If you do not have SAS/IML software, you can use the HEATMAPPARM statement in PROC SGPLOT and use a discrete attribute map to define which color is used for each cell in the heat map. If you do not yet know about discrete attribute maps, I highly recommend that you take a moment to learn about them; they are very useful!

If you do not want to use a discrete attribute map and you do not have access to SAS/IML, there is a third way to visualize color palettes: use the POLYGON statement in PROC SGPLOT. As discussed in this article, each polygon has a unique value of an ID variable. You can use the GROUP= option to associate a color with each value of the ID variable. You can use the DATACOLORS= option on the STYLEATTRS statement to specify the colors for each polygon. This article uses rectangular polygons and arranges them so that you can visualize a palette of colors. Because I am writing this article in the Christmas season, I will use a set of Christmas-themed red-and-green palettes to demonstrate the technique. The final visualization is shown to the right.

How to use the POLYGON statement to visualize a palette

The following list outlines the main steps for using the POLYGON statement in PROC SGPLOT to visualize a set of palettes:

  1. Use the DATA step to read the colors in each palette.
  2. Assign a unique ID to each color.
  3. Concatenate all the colors and store the set of colors in a macro variable. This macro variable will be used to assign a color to each polygon.
  4. Create a polygon for each color. This example uses rectangles. The i_th color for the j_th palette is represented by a rectangle that is centered at the coordinate (i,j).
  5. Optionally, use the TEXT statement in PROC SGPLOT to overlay the hexadecimal values of the colors on the graph.
  6. Optionally, use the TEXT statement to display the name of each palette.
  7. Use the POLYGON and TEXT statements to display the palettes of colors.

Read the colors in each palette

I found several Christmas-themed palettes at www.schemecolor.com. The specific palettes are cited in the comments of the following program, which implements the first three steps of the outline:

ods graphics/reset;
/* Acknowledgements:
   Jingle Bell palette:  https://www.schemecolor.com/jingle-bells.php
   Holiday Red and Gold: https://www.schemecolor.com/holiday-red-gold.php
   Green and Gold:       https://www.schemecolor.com/green-and-gold.php
   Unwrapping My Gifts:  https://www.schemecolor.com/unwrapping-my-gifts.php
   Christmas Wrapping:   https://www.schemecolor.com/christmas-wrapping.php
   Christmas Wedding:    https://www.schemecolor.com/christmas-wedding.php
   Real Christmas Tree:  https://www.schemecolor.com/real-christmas-tree.php
*/
/* 1. Use the DATA step to read the colors in each palette. */
data Palettes;
length Name $30 color $8;
length palette $450; /* must be big enough to hold all colors */
retain ID 1 row 1 palette;
input Name 1-22 n @;
do col = 1 to n;
   input color @;
   palette = catx(' ', palette, color);
   output;
   ID + 1;    /* Assign a unique ID to each color. */
end;
row + 1;
call symput('AllPalettes', palette);  /* 3. Concatenate colors; store in macro. */
drop palette;
/* Palette Name      |n| color1 | color2 | color2 | color4 | ... */
datalines;
Jingle Bell           6 CX44690D CX698308 CXF3C543 CXFFEDC7 CXCA2405 CX9E1007
Holiday Red and Gold  6 CXCF1931 CXAD132D CXD9991A CXEAA61E CXF2BC13 CX216A1B
Green and Gold        6 CX03744B CX15885A CX1E9965 CXFBE646 CXFBC34D CXF69F44
Unwrapping My Gifts   5 CX2A5B53 CX5EB69D CXECEBF1 CXD34250 CX5F1326
Christmas Wrapping    6 CX237249 CX3B885C CXE5D6B5 CXE3CD8E CXDA111E CXC00214
Christmas Wedding     6 CX325C39 CX9C9549 CXDBAA46 CXFFE9D9 CXFF4A4A CXDB2121
Real Christmas Tree   6 CX779645 CX497542 CX274530 CX6E3C3B CXBF403B CXEDB361
;
 
%put &AllPalettes;   /* check that all colors are here */

The following sections implement the visualization in three stages: an initial visualization that uses a scatter plot, an improved visualization that uses rectangular polygons, and a final visualization.

An initial visualization

After running this DATA step, the Palettes data set contains all the information you need to visualize the colors in each palette. In fact, you can use a scatter plot to visualize the data and make sure it was read correctly. The following call to PROC SGPLOT creates a graph for which each row represents a palette of colors. Recall that the STYLEATTRS statement supports two different options for specifying how colors are assigned to groups when you use the GROUP= option:

  • The DATACOLORS= option specifies the fill color for bars and polygons.
  • The DATACONTRASTCOLORS= option specifies the colors for lines and markers.

For a scatter plot, you can use the DATACONTRASTCOLORS= option to assign group colors, as follows:

%let gray   = CXF8F8F8;
title "Initial Visualization of Christmas Palettes";
proc sgplot data=palettes noautolegend noborder;
   styleattrs backcolor=&gray wallcolor=&gray datacontrastcolors=(&AllPalettes);
   scatter x=col y=row / group=ID markerattrs=(symbol=SquareFilled size=45);
   text x=col y=row text=color;
run;

This is not a professional-looking graph, but it does provide a quick-and-easy way to visualize the colors. Each row is a palette of colors. The colors are labeled by using their hexadecimal values. In the next section, the markers are replaced by rectangular polygons.

A better visualization of palettes

In the data set, the (col, row) coordinates are the center of each polygon. The following DATA step creates new variables (x, y) for the vertices of rectangles:

/* 4. Create a polygon for each color. 
      The i_th color for the j_th palette is represented by a rectangle 
      that is centered at the coordinate (i,j). */
data Poly;
set Palettes;
w = 0.5; h = 0.48;                /* width and height of rectangle */
x = col - w; y = row - h; output; /* four vertices of square */
x = col + w; y = row - h; output;
x = col + w; y = row + h; output;
x = col - w; y = row + h; output;
drop w h;
run;

With that minor change, you can use a POLYGON statement to display colored rectangles. (Be sure to use the DATACOLORS= option on the STYLEATTRS statement.) The resulting visualization looks much better:

%let gray   = CXF8F8F8;
title "Second Visualization of Christmas Palettes";
title2 "A Polygon Plot";
proc sgplot data=Poly noautolegend;
      styleattrs wallcolor=&gray datacolors=(&AllPalettes);
   polygon x=x y=y ID=ID / group=ID fill;
   text x=col y=row text=color;
run;

The final visualization

Although both previous graphs show the palettes of colors, you can improve the graphs by making minor changes:

  • Because each rectangular polygon is defined by using four rows in the data, the hexadecimal value of the colors is plotted four times. You can see in the previous graphs that the color values appear to be boldfaced. You only need to plot the value once for each value of the ID variable.
  • Add the name of the palette to the graph.
  • Do not display the axes. They are not necessary.
/* 5. Optionally, use the TEXT statement to display the name of each palette. */
data Temp;
set Poly;
by ID;
if ^first.ID then  color=" ";
run;
 
/* 6. Optionally, use the TEXT statement to overlay the hex values of the colors */
data PlotPalettes;
set Temp;
by Name notsorted;
label = Name;
if first.Name then do;
   lx = col - 0.6;    ly = row;
end;
else do;
   label=" ";  lx = .;  ly = .;
end;
run;
 
/* 7. Use the POLYGON and TEXT statements to display the palettes of colors. */
%let gray   = CXF8F8F8;
title "Visualization of Christmas Palettes";
proc sgplot data=PlotPalettes noautolegend noborder;
   styleattrs backcolor=&gray wallcolor=&gray datacolors=(&AllPalettes);
   polygon x=x y=y ID=ID / group=ID fill;
   text x=col y=row text=color;
   text x=lx y=ly text=label / position=left textattrs=(size=12);
   xaxis display=none offsetmax=0;
   yaxis display=none thresholdmin=0 thresholdmax=0;
run;

The final graph is shown at the top of this article. It enables you to see that most of the palettes are green-to-red and include either a golden or creamy-white color. Three palettes are different from the others:

  • The "Real Christmas Tree" palette ends with gold instead of red.
  • The "Green and Gold" palette does not contain any red.
  • The "Holiday Red and Gold" palette starts with red and ends with green.

Summary

This article shows how to use a polygon plot to visualize a set of Christmas-themed color palettes. By using the STYLEATTRS statement in PROC SGPLOT, you can specify how colors are assigned to groups when you use the GROUP= option. You can use the DATACOLORS= option to specify the fill color for polygons. You can use the POLYGON statement and the GROUP=ID option to assign each color to a rectangular polygon, thus visualizing each palette as a strip of colors.

The post Visualize palettes of colors in SAS appeared first on The DO Loop.

11月 152021
 

Given a cloud of points in the plane, it can be useful to identify the convex hull of the points. The convex hull is the smallest convex set that contains the observations. For a finite set of points, it is a convex polygon that has some of the points as its vertices. An example of a convex hull is shown to the right. The convex hull is the polygon that encloses the points.

This article describes the CVEXHULL function in the SAS/IML language, which computes the convex hull for a set of planar points.

How to compute a 2-D convex hull in SAS

The CVEXHULL function takes a N x 2 data matrix. Each row of the matrix is the (x,y) coordinates for a point. The CVEXHULL function returns a column vector of length N.

  • The first few elements are positive integers. They represent the rows of the data matrix that form the convex hull.
  • The positive integers are sorted so that you can visualize the convex hull by connecting the points in order.
  • The remaining elements are negative integers. The absolute values of these integers represent the rows of the data matrix that are contained in the convex hull or are on the boundary of the convex hull but are not vertices.

The following example comes from the SAS/IML documentation. The data matrix contains 18 points. Of those, six are the vertices of the convex hull. The output of the CVEXHULL function is shown below:

proc iml; 
points = {0  2, 0.5 2, 1 2, 0.5 1, 0 0, 0.5 0, 1  0, 
          2 -1,   2 0, 2 1,   3 0, 4 1,   4 0, 4 -1, 
          5  2,   5 1, 5 0,   6 0 }; 
 
/* Find the convex hull:
   - indices on the convex hull are positive
   - the indices for the convex hull are listed first, in sequential order
   - interior indices are negative
 */
Indices = cvexhull( points ); 
reset wide;
print (Indices`)[L="Indices"];

I have highlighted the first six elements. The indices tell you that the convex hull is formed by using the 1st, 5th, 8th, 14th, 18th, and 15th points of the data matrix. You can use the LOC statement to find the positive values in the Indices vector. You can use those values to extract the points on the convex hull, as follows:

hullIdx = indices[loc(indices>0)];   /* the positive indices */
convexHull = points[hullIdx, ];      /* extract rows */
print hullIdx convexHull[c={'cx' 'cy'} L=""];

The output shows that the convex hull is formed by the six points (0,2), (0,0), ..., (5,2).

Visualize the convex hull

The graph at the beginning of this article shows the convex hull as a shaded polygon. The original points are overlaid on the polygon and labeled by the observation number. The six points that form the convex hull are colored red. This section shows how to create the graph.

The graph uses the POLYGON statement to visualize the convex hull. This enables you to shade the interior of the convex hull. If you do not need the shading, you could use a SERIES statement, but to get a closed polygon you would need to add the first point to the end of the list of vertices.

To create the graph, you must write the relevant information to a SAS data set so that you can use PROC SGPLOT to create the graph. The following statements write the (x,y) coordinates of the point, the observation numbers (for the data labels), the coordinates of the convex hull vertices (cx, cy), and an ID variable, which is required to use the POLYGON statement. It also creates a binary indicator variable that is used to color-code the markers in the scatter plot:

x = points[,1]; y = points[,2]; 
obsNum = t(1:nrow(points));  /* optional: use observation numbers for labels */
/* The points on the convex hull are sorted in counterclockwise order.
   If you use a series plot, you must repeat the first point 
   so that the polygon is closed. For example, use
   convexHull = convexHull // convexHull[1,]; */
cx = convexHull[,1]; cy = convexHull[,2]; 
ID = j(nrow(cx),1,1);        /* create ID variable for POLYGON statement */
/* create a binary (0/1) indicator variable */ 
OnHull = j(nrow(x), 1, 0);   /* most points NOT vertices of the convex hull */
OnHull[hullIdx] = 1;         /* these points are the vertices */
 
create CHull var {'x' 'y' 'cx' 'cy' 'ID' 'obsNum' 'OnHull'};
append;
close;
QUIT;

In the graph at the top of this article, vertices of the convex hull are colored red and the other points are blue. When you use the GROUP= option in PROC SGPLOT statements, the group colors might depend on the order of the observations in the data. To ensure that the colors are consistent regardless of the order of the data set, you can use a discrete attribute map to associate colors and values of the grouping variable. For details about using a discrete attribute maps, see Kuhfeld's 2016 article.

To use a discrete attribute map, you need to define it in a SAS data set, read it by using the DATTRMAP= option on the PROC SGPLOT statement, and specify it by using the ATTRID= statement on the SCATTER statement, as follows:

data DAttrs;                             /* use DATTRMAP=<data set name> */
length MarkerStyleElement $11.;
ID = "HullAttr";                         /* use ATTRID=<ID value> */
Value = 0; MarkerStyleElement = "GraphData1"; output; /* 0 ==> 1st color */
Value = 1; MarkerStyleElement = "GraphData2"; output; /* 1 ==> 2nd color */
run;
 
title "Points and Convex Hull";
proc sgplot data=CHull DATTRMAP=DAttrs;
   polygon x=cx y=cy ID=ID / fill outline lineattrs=GraphData2;
   scatter x=x y=y / datalabel=obsNum group=OnHull 
                     markerattrs=(symbol=CircleFilled) ATTRID=HullAttr;
run;

The graph is shown at the top of this article. Notice that the points (0.5, 2) and (1, 2) are on the boundary of the convex hull, but they are drawn in blue because they are not vertices of the polygon.

Summary

In summary, you can compute a 2-D convex hull by using the CVEXHULL function in SAS/IML software. The output is a set of indices, which you can use to extract the vertices of the convex hull and to color-code markers in a scatter plot.

By the way, there is a hidden message in the graph of the convex hull. Can you see it? It has been hiding in the SAS/IML documentation for more than 20 years.

In closing, I'll mention that a 2-D convex hull is one computation in the general field of computational geometry. The SAS/IML group is working to add additional functionality to the language, including convex hulls in higher dimensions. In your work, do you have specific needs for results in computational geometry? If so, let me know the details in the comments.

The post Two-dimensional convex hulls in SAS appeared first on The DO Loop.

11月 102021
 

I was recently asked how to create a frequency polygon in SAS. A frequency polygon is an alternative to a histogram that shows similar information about the distribution of univariate data. It is the piecewise linear curve formed by connecting the midpoints of the tops of the bins. The graph to the right shows a histogram and a frequency polygon for the same data. This article shows how to create a frequency polygon in SAS.

In practice, frequency polygons are not used as often as histograms are, but they are useful pedagogical tools for teaching the fundamentals of density estimation. The histogram is an estimate of the density of univariate data, but it is a bar chart. Accordingly, it looks different from density estimate curves, such as parametric densities and kernel density estimates. The frequency polygon shows the same information as a histogram but displays the information as a line plot. Therefore, you can more easily compare the frequency polygon curve and other density estimate curves.

A frequency polygon is also a good way to introduce the ideas behind a cumulative distribution. An ogive is a graph of the cumulative sum of the vertical coordinates of the frequency polygon. The ogive approximates the cumulative distribution in the same way that the frequency polygon approximates the density.

Create a frequency polygon in SAS

You can use the UNIVARIATE procedure in SAS to generate the points for a frequency polygon. You can use the OUTHIST= option to specify a data set that contains the counts for each bar in the histogram. The midpoints of the histogram bins are contained in the _MIDPT_ variable. The count in each bin is contained in the _COUNT_ variable.

If you do not like the default width of the histogram bins, you can use the MIDPOINTS= option to specify your own set of midpoints. For example, the following statements create a histogram for the EngineSize variable in the Sashelp.Cars data set. You can use the SERIES statement in PROC SGPLOT to create a line plot that displays the vertical height of each histogram bar, as follows:

proc univariate data=sashelp.cars(keep=EngineSize);
   var EngineSize;
   histogram / outhist=OutHist grid vscale=count
               midpoints=(1.4 to 8.4 by 0.4); /* use midpoints= option to specify midpoints */
run;
 
/* optionally, print the OutHist data */
/* proc print data=OutHist; run;      */
 
title "Frequency Polygon";
proc sgplot data=OutHist;
   series x=_MIDPT_ y=_COUNT_ / markers;
   yaxis grid values=(0 to 80 by 20) label="Count" offsetmin=0;
   xaxis grid values=(1.4 to 8.4 by 0.4) label="Engine Size (L)";
run;

The frequency polygon is shown. Like a histogram, the shape of the frequency polygon depends on the bin width and anchor position. You can change those values by using the MIDPOINTS= option.

Overlay a frequency polygon and a kernel density estimate

As I mentioned earlier, an advantage of the frequency polygon is that it is a curve, not a bar chart. As such, it is easier to compare to other density estimates curves. In PROC UNIVARIATE, you can use the KERNEL option to overlay a kernel density curve on a histogram. You can use the OUTKERNEL= option to write the kernel density estimate to a data set. You can then overlay and compare the frequency curve (a crude histogram-based estimate) and the kernel density estimate, as follows:

proc univariate data=sashelp.cars(keep=EngineSize);
   var EngineSize;
   histogram / outhist=OutHist grid vscale=count
               kernel outkernel=OutKer
               midpoints=(1.4 to 8.4 by 0.4); /* use midpoints= option to specify midpoints */
   ods select Moments Histogram;
run;
 
data Density;  /* combine the estimates */
set OutHist OutKer(rename=(_Count_=KerCount));
run;
 
title "Frequency Polygon and Kernel Density Estimate";
proc sgplot data=Density;
   series x=_MIDPT_ y=_COUNT_ / legendlabel="Frequency Polygon";
   series x=_VALUE_ y=KerCount / legendlabel="Kernel Density Estimate";
   yaxis offsetmin=0 grid values=(0 to 80 by 20) label="Estimated Count";
   xaxis label="Engine Size (L)";
run;

As shown in the graph, a kernel density estimate is a smoother version of the frequency polygon.

Summary

This article shows how to create a graph of the frequency polygon in SAS. A frequency polygon is a piecewise linear curve formed by connecting the midpoints of the tops of the bars in a histogram. The frequency polygon is a curve, so it is easier to compare it with other parametric or nonparametric density estimates.

One final remark: I don't like the name "frequency polygon." A polygon is a closed planar region formed by connecting a set of points and then connecting the first and last points. The density estimate in this article is not closed. I would prefer a term such as "frequency polyline" or "frequency curve," but "polygon" seems to be the standard term that appears in introductory statistics textbooks.

The post Create a frequency polygon in SAS appeared first on The DO Loop.

11月 032021
 

A SAS programmer asked whether it is possible to add reference lines to the categorical axis of a bar chart. The answer is yes. You can use the VBAR statement, but I prefer to use the VBARBASIC (or VBARPARM) statement, which enables you to overlay a wide variety of graphs on a bar chart. I have previously written about using the VBARBASIC statement to overlay graphs on bar charts. The VBARBASIC chart is compatible with more graphs than the VBAR chart. See the documentation for a complete discussion of "compatible" plot types.

This article shows two ways to overlay a reference line on the categorical axis of a bar chart. But the SAS programmer wanted more. He wanted to create a bar for each day of the year. That is a lot of bars! For bar charts that have many bars, I recommend using the NEEDLE statement to create a needle plot. The second part of this article demonstrates a needle plot and overlays reference lines for certain holidays.

For simplicity, this article discusses only vertical bar charts, but all programs can be adapted to display horizontal bar charts.

Reference lines and bar charts that use the VBAR statement

First, to be clear, you can easily add horizontal reference lines to a vertical bar chart. This is straightforward. The programmer wanted to add vertical reference lines to the categorical axis, as shown in the graph to the right. In this graph, reference lines are added behind the bars for Age=12 and Age=14. I made the bars semi-transparent so that the full reference lines are visible.

As the SAS programmer discovered, the following attempt to add reference lines does not display any reference lines:

title "Bar Chart with Reference Line on Categorical Axis";
proc sgplot data=Sashelp.Class;
  refline 12 14 / axis=x lineattrs=(color=red); /* DOES NOT WORK */
  vbar Age / response=Weight transparency=0.2;
run;

Why don't the reference lines appear? As I have previously written, you must specify the formatted values for a categorical axis. This is mentioned in the documentation for the REFLINE statement, which states that "unformatted numeric values do not map to a formatted discrete axis. For example, if reference lines are drawn at points on a discrete X axis, the REFLINE values must be the formatted value that appears on the X axis." In other words, you must change the REFLINE values to be "the formatted values," which are '12' and '14'. The following call to PROC SGPLOT displays the vertical reference lines:

proc sgplot data=Sashelp.Class;
  refline '12' '14' / axis=x lineattrs=(color=red); /* YES! THIS WORKS! */
  vbar Age / response=Weight transparency=0.2;
run;

The reference lines are shown in the graph at the beginning of this section.

Reference lines and the VBARBASIC statement

I prefer to use the VBARBASIC statement for most bar charts. If you use the VBARBASIC statement, you can specify the raw reference values. To be honest, I am not sure why it works, but, in general, the VBARBASIC statement is better when you need to overlay a bar chart and other graphical elements. If you use the VBARBASIC statement, the natural syntax works as expected:

proc sgplot data=All;
  refline 12 14 / axis=x lineattrs=(color=red);   /* THIS WORKS, TOO! */
  vbarbasic Age / response=Weight transparency=0.2;
run;

The graph is the same as shown in the previous section.

Reference lines for holidays on a graph of sales by date

This section discusses an example that has hundreds of bars. Suppose you want to display a bar chart for sales by date for an entire year. For data like these, I have two recommendations:

  1. Do not use a vertical bar chart. Even if each bar requires only three pixels, the chart will be more than 3*365 ≈ 1,100 pixels wide. On a monitor that displays 72 pixels per inch, this graph would be about 40 cm (15.3 inches) wide. A better choice is to use a needle plot, which is essentially a bar chart where each bar is represented as a vertical line.
  2. The horizontal axis cannot be discrete. If it is, you will get 365 dates printed along the axis. Instead, you want to use the XAXIS TYPE=TIME option to display the bars along an axis where tick marks are placed according to months, not days. (If the categories are not dates but are "days since the beginning," you can use the XAXIS TYPE=LINEAR option instead.)

Recall that the SAS programmer wanted to display holidays on the graph of sales for each day. Rather than specify the holidays on the REFLINE statement (for example, '01JAN2003'd '25DEC2003'd), it is more convenient to put the reference line values into a SAS data set and specify the name of the You can use the HOLIDAY function in SAS to get the date associated with major government holidays.

The following SAS DATA step extracts a year's worth of data for the sale of potato chips (in 2003) from the Sashelp.Snacks data set. These data are concatenated with a separate data set that contains the holidays that you want to display by using reference lines. A needle plot shows the daily sales and the reference lines.

data Snacks;       /* sales of potato chips for each date in 2003 */
set Sashelp.Snacks;
where '01JAN2003'd <= Date <= '31DEC2003'd AND Product="Classic potato chips";
run;
 
data Reflines;     /* holidays to overlay as reference lines */
format RefDate DATE9.;
RefDate = holiday("Christmas", 2003);       output;
RefDate = holiday("Halloween", 2003);       output;
RefDate = holiday("Memorial", 2003);        output;
RefDate = holiday("NewYear", 2003);         output;
RefDate = holiday("Thanksgiving", 2003);    output;
RefDate = holiday("USIndependence", 2003);  output;
RefDate = holiday("Valentines", 2003);      output;
run;
 
data All;          /* concatentate the data and reference lines */
set Snacks RefLines;
run;
 
title "Sales and US Holidays";
title2 "Needle Plot";
proc sgplot data=All;
  refline RefDate / axis=x lineattrs=(color=red);
  needle x=Date y=QtySold;
run;

Notice that you do not have to use the XAXIS TYPE=TIME option with the NEEDLE statement. The SGPLOT procedure uses TYPE=TIME option by default when the X variable has a time, date, or datetime format. If you decide to use the VBARBASIC statement, you should include the XAXIS TYPE=TIME statement.

Summary

In summary, this article shows how to add vertical reference lines to a vertical bar chart. You can use the VBAR statement and specify the formatted reference values, but I prefer to use the VBARBASIC statement whenever I want to overlay a bar chart and other graphical elements. You can also use a needle plot, which is especially helpful when you need to display 100 or more bars.

The post Add reference lines to a bar chart in SAS appeared first on The DO Loop.

9月 272021
 

Graphing data is almost always more informative than displaying a table of summary statistics. In a recent article about "dynamite plots," I briefly mentioned that graphs such as box plots and strip plots are better at showing data than graphs that merely show the mean and standard deviation. This article expands on that idea.

It is well known that summary statistics (means, standard deviations, ...) do not uniquely define a data distribution. You might be familiar with Anscombe Quartet, which is a collection of four bivariate data sets for which the variables have the same means, standard deviations, regression lines, and correlation. However, when you graph the data, you see that the four data sets are radically different from each other.

This article shows a similar (but simpler) idea. The article presents multiple examples of two groups. In each case, the mean of one group of 5 units higher than the mean of the other group. By graphing the data, you can see HOW the groups differ. Knowing HOW two groups differ can be important to decision-makers who base their decisions on data.

Example: An enrichment program leads to higher test scores

Suppose that a school district is considering a new educational enrichment program. Advocates for the program claim that it increases the mean standardized test scores of students by five points. Sounds great, right? However, the mean statistic uses one number to summarize a distribution. If the school board looks only at the mean scores, they might miss important details in the data. For example:

  • Do most or all students benefit from the program?
  • Is the program biased against gender? Do boys and girls benefit equally?
  • Is the program biased against socioeconomic factors? Do students in high-performing schools benefit more than students in low-performing schools?

This article uses an artificial set of hypothetical scores (N=20) to discuss and visualize these ideas. In each example, the mean scores after the enrichment are five points higher than before. But the way that the scores increase is different for each example.

The data

Let's create some fake data. Assume that a set of students take a test, then take the enrichment program, then retake the test. The following SAS DATA step defines the data. A call to PROC SGPLOT visualizes the distribution of the test scores before the enrichment program. I use a strip plot overlayed on a box plot to visualize these data.

data Pre(drop=i);
Test = "Pre-Test ";
retain ID;
input School Sex $ @;
do i = 1 to 5;
   input Score @;
   ID + 1;
   output;
end;
datalines;
1 M  510 510 500 500 450 
1 F  530 520 490 480 480 
2 M  470 470 400 390 390 
2 F  470 460 440 410 380 
;
 
/* see https://blogs.sas.com/content/iml/2016/06/13/overlay-box-plot-in-sas-discrete.html */
ods graphics / width=240px height=360px;
title "Initial Scores";
proc sgplot data=Pre noautolegend ;
   vbox Score / category=Test;
   scatter y=Score x=Test / jitter transparency=0.5 group=Sex
            markerattrs=(symbol=CircleFilled size=12);
   xaxis discreteorder=data display=(nolabel);
run;

These example data are not realistic. In practice, there would probably be a control group and an experimental group. Also, N=20 is a very small sample. Nevertheless, these data suffice to illustrate some important facts about visualizing the differences between two groups.

Scenario 1: All students benefit

One possible scenario is that all students benefit equally by participating in the enrichment program. In reality, there will be individual variations among the students, with some benefitting more than others. However, let's assume that every student scores exactly 5 points higher on the post-program test:

title "Distribution of Scores";
title2 "All Students Benefit";
data Post;
set Pre;
Test = "Post-Test";
Score = Score + 5;
run;

The following SAS statements combine the pre- and post-program scores and use a box plot to visualize the distribution of the scores. Because this is a matched-pair study, you can draw lines that connect the scores of the same student. I encapsulate the visualization code into a macro so that I can easily reuse it for other scenarios.

%macro CombineAndViz();
data PrePost;
   set Pre Post;
run;
 
proc sgplot data=PrePost noautolegend ;
   vbox Score / category=Test;
   series y=Score x=Test / group=ID break lineattrs=GraphData1 grouplc=Sex transparency=0.7;
   scatter y=Score x=Test / jitter transparency=0.5 group=Sex
            markerattrs=(symbol=CircleFilled size=12);
   xaxis discreteorder=data display=(nolabel);
run;
%mend;
 
ods graphics / width=360px height=360px;
%CombineAndViz;

The graph shows how student scores change after the enrichment program. In this case, every student benefits. Light red lines connect the scores of girls; light blue lines connect the scores of boys.

Statisticians know that "an average increase of five points" does not mean that all students increase their scores by five points. However, news articles sometimes forget to use the words "on average," especially in headlines. Consciously or unconsciously, many people think of this scenario when they read that "a program increases tests scores by five points."

Scenario 2: Only some schools benefit

Another mathematical way to obtain an average increase of five points is if half the scores increase by 10 points and the other half do not change. In this example, half the students are from a high-performing school. Perhaps that school has more resources and more college-bound students. Does the enrichment program benefit only those students? If so, the pre/post scores could look like the following:

title2 "Only Some Students Benefit";
data Post;
set Pre;
Test = "Post-Test";
if School=1 then 
   Score = Score + 10;
run;
 
%CombineAndViz;

The box plot alone does not do an adequate job of showing the bias in the program. Only by connecting the pre- and post-program scores is it apparent that half the students benefit. And you can see in the graph that the students who benefit are those who already had high test scores before the program.

Scenario 3: Only boys benefit

In a similar way, half the students are boys and half are girls. Does the enrichment program only benefit one gender and not the other? If so, the pre/post scores could look like the following:

title2 "Only Boys Benefit";
data Post;
set Pre;
Test = "Post-Test";
if Sex='M' then 
   Score = Score + 10;
run;
 
%CombineAndViz;

In the graph, the light red lines (girls) are flat whereas the light blue lines (boys) show a positive slope. The post-program scores are, on average, five points higher, but not all students benefit from the program.

Scenario 4: Some students benefit, others fall behind

An even worse scenario is a program that actually hinders student learning for a group of students even as it benefits others. For example, a program that relies on technology might work well in a school that has the infrastructure (laptops, wireless connectivity) to support the program, but might fail in a school that lacks the infrastructure. Such a scenario might look like the following:

title2 "Some Students Benefit, Others Fall Further Behind";
data Post;
set Pre;
Test = "Post-Test";
if School=1 then 
   Score = Score + 15;
else 
   Score = Score - 5;
run;
 
%CombineAndViz;

In this scenario, students in the first school benefit from the program whereas students in the second school fall further behind their peers.

Summary

A graph of the distribution of data can illustrate nuances in the data that a table of statistics cannot. This article shows four examples where the mean difference between group scores is five points. However, that statistic does not indicate HOW the groups differ. This article shows how to overlay a strip plot on a box plot in order to visualize the differences between the groups. For another strip plot example, see "Create a strip plot in SAS."

The post Why you should visualize distributions instead of report means appeared first on The DO Loop.

9月 092021
 

A statistical programmer asked how to simulate event-trials data for groups. The subjects in each group have a different probability of experiencing the event. This article describes one way to simulate this scenario. The simulation is similar to simulating from a mixture distribution. This article also shows three different ways to visualize the results of the simulation.

Modeling the data-generation process

A simulation is supposed to reproduce a data-generation process. Typically, a simulation is based on some real data. You (the programmer) need to ensure that the simulation reflects how the real data are generated. For example, there are often differences between simulating a designed experiment or simulating an observational study:

  • In a designed experiment, the number of subjects in each group is determined by the researcher. For example, a researcher might choose 100 subjects, 50 men and 50 women. If so, each simulated data sample should also contain 50 males and 50 females.
  • In an observational study, the number of subjects in each group is a random variable that is based on the proportion of the population in each group. For example, a researcher might choose 100 subjects at random. In the data, there might be 53 men and 47 women, but that split is the result of random chance. Each simulated data sample should generate the number of males and females according to a random binomial variable. Some samples might have a 52:48 split, others a 49:51 split, and so on. In general, the group sizes are simulated from a multinomial distribution.

For other ways to choose the number of subjects in categories, see the article, "Simulate contingency tables with fixed row and column sums in SAS."

Data that specify events and trials

Suppose you have the following data from a pilot observational study. Subjects are classified into six groups based on two factors. One factor has three levels (for example, political party) and another has two levels (for example, sex). For each group, you know the number of subjects (Ni) and the number of events (Ei). You can therefore estimate the proportion of events in the i_th group as Ei / Ni. Because you know the total number of subjects (Sum(N)), you can also estimate the probability that an individual belongs to each group (πi = Ni / Sum(N)).

The following SAS DATA step defines the proportions for each group:

%let NTotal = 107;
data Events;
length Group $ 3;
input Cat1 $ Cat2 $ Events N;
Group = catx(" ", Cat1, Cat2); /* combine two factors into group ID */
p  = Events / N;               /* estimate P(event | Group[i]) */
pi = N / &NTotal;              /* estimate P(subject in Group[i]) */
drop Cat1 Cat2;
format p pi Best5.;
datalines;
A  F  1  10
A  M  8  24
B  F  4  13
B  M  12 36
C  F  12 16
C  M  6  8
;
 
proc print data=Events; run;

This pilot study has 107 subjects. The first group ("A F") contained 10 subjects, which is π1 = 9.3% of the total. In the first group, one person experienced the event, which is p1 = 10% of the group. The other rows of the table are similar. Some people call this "event-trials" data because the quantity of interest is the proportion of events that occurred in the subjects. The next section shows how to use these estimates to simulate new data samples.

Simulate group sizes and proportions in groups

One of the advantages of simulation studies is that you can take preliminary data from a pilot study and "scale it up" to create larger samples, assuming that the statistics from the pilot study are representative of the parameters in the population. For example, the pilot study involved 107 subjects, but you can easily simulate samples of size 250 or 1000 or more.

The following SAS/IML program simulates samples of size 250. The statistics from the pilot study are used as parameters in the simulation. For example, the simulation assumes that 0.093 of the population belongs to the first group, and that 10% of the subjects in the first group experience the event of interest. Notice that some groups will have a small number of subjects (such as Group 1 and Group 5, which have small values for π). Among the groups, some will have a small proportion of events (Group 1) whereas others will have a large proportion (Group 5 and Group 6).

The following simulation uses the following features:

  • The RandMultinomial function in SAS/IML generates a random set of samples sizes based on the π vector, which estimates the proportion of the population that belongs to each group.
  • For each group, you can use the binomial distribution to randomly generate the number of events in the group.
  • The proportion of events is the ratio (Number of Events) / (Group Size).
/* simulate proportions of events in groups */
proc iml;
use Events;
read all var _ALL_;  /* creates vectors Groups, Events, N, p, pi, etc */
close;
numGroups = nrow(Group);
 
call randseed(12345);
nTrials = 250;       /* simulate data for a study that contains 250 subjects */
 
Size = T( RandMultinomial(1, nTrials, pi) ); /* new random group sizes */
nEvents = j(numGroups, 1, .);                /* vector for number of events */
do i = 1 to nrow(nEvents);
   nEvents[i] = randfun(1, "Binomial", p[i], Size[i]);
end;
Proportion = nEvents / Size;                 /* proportion of events in each group */
 
results = Size || nEvents || Proportion;
print results[r=Group c={'Size' 'nEvents' 'Proportion'} F=Best5.];

The table shows one random sample of size 250 based on the statistics in the smaller pilot study. The group size and the number of events are both random variables. If you rerun this simulation, the number of subjects in the groups will change, as will the proportion of subjects in each group that experience the event. It would be trivial to adapt the program to handle a designed experiment in which the Size vector is constant.

Simulating multiple samples

An important property of a Monte Carlo simulation study is that it reveals the distribution of statistics that can possibly result in a future data sample. The table in the previous section shows one possible set of statistics, but we can run the simulation additional times to generate hundreds or thousands of additional variations. The following program generates 1,000 possible random samples and outputs the results to a SAS data set. You can then graph the distribution of the statistics. This section uses box plots to visualize the distribution of the proportion of events in each group:

/* simulate 1000 random draws under this model */
SampleID = j(numGroups, 1, .);
nEvents = j(numGroups, 1, .);                /* vector for number of events */
create Sim var {'SampleID' 'Group' 'Size' 'nEvents' 'Proportion'};
do ID = 1 to 1000;
   /* assign all elements the same value:
      https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
   SampleID[,] = ID;
   Size = T( RandMultinomial(1, nTrials, pi) );  /* new random group sizes */
   do i = 1 to nrow(nEvents);
      nEvents[i] = randfun(1, "Binomial", p[i], Size[i]);
   end;
   Proportion = nEvents / Size;
   append;
end;
close Sim;
QUIT;
 
data PlotAll;
set Sim Events(keep=Group p);
run;
 
title "Proportion of Events in Each Group";
title2 "Simulated N=250; NumSim=1000";
/* box plot */
proc sgplot data=PlotAll noautolegend;
   hbox Proportion / category=Group;
   scatter y=Group x=p / markerattrs=GraphData2(symbol=CircleFilled size=11);
   yaxis type=discrete display=(nolabel); /* create categorical axis */
   xaxis grid; 
run;

The box plot shows the distribution of the proportion of events in each group. For example, the proportion in the first group ("A F"), ranges from a low of 0% to a high of 33%. The proportion in the sixth group ("C M"), ranges from a low of 22% to a high of 100%. The red markers are the parameter values used in the simulation. These are the estimates from the pilot study.

Alternative ways to visualize the distribution within groups

The box plot shows a schematic representation of the distribution of proportions within each group. However, there are alternative ways to visualize the distributions. One way is to use a strip plot, as follows:

/* strip plot */
proc sgplot data=PlotAll noautolegend;
   scatter y=Group x=Proportion / 
           jitter transparency=0.85       /* handle overplotting */
           markerattrs=(symbol=CircleFilled);
   scatter y=Group x=p / markerattrs=(symbol=CircleFilled size=11);
   yaxis type=discrete reverse display=(nolabel); /* create categorical axis */
   xaxis grid; 
run;

The strip plot uses a semi-transparent marker to display each statistic. (Again, the red markers indicate the parameters for the simulation.) The density of the distribution is visualized by the width of the strips and by the darkness of the plot, which is due to overplotting the semi-transparent markers.

This plot makes it apparent that the variation between groups is based on the relative size of the groups in the pilot study. Large groups such as Group 4 ("B M") have less variation than small groups such as Group 6 ("C M"). That's because the standard error of the proportion is inversely proportional to the square root of the sample size. Thus, smaller groups have a wider distribution than the larger groups.

You can see the same features in a panel of histograms. In the following visualization, the distribution of the proportions is shown by using a histogram. Red vertical lines indicate the parameters for the simulation. This graph might be easier to explain to non-statistician.
/* plot of histograms */
proc sgpanel data=PlotAll;
   panelby Group / onepanel layout=rowlattice novarname;
   histogram Proportion;
   refline p / axis=x lineattrs=GraphData2(thickness=2);
   colaxis grid; 
run;

Summary

This article shows how to simulate event-trials data in SAS. In this example, the data belong to six different groups, and the probability of experiencing the event varies between groups. The groups are also different sizes. In the simulation, the group size and the number of events in each group are random variables. This article also shows three different ways to visualize the results of the simulation: a box plot, a strip plot, and a panel of histograms.

The post Simulate proportions for groups appeared first on The DO Loop.

9月 072021
 

A colleague spent a lot of time creating a panel of graphs to summarize some data. She did not use SAS software to create the graph, but I used SAS to create a simplified version of her graph, which is shown to the right. (The colors are from her graph.) The graph displays a panel of bar charts with uncertainty intervals. The height of the bar is the mean value of the response. I don't know which statistic she used for the "error bars." It could be one of three common statistics that visualize uncertainty: the standard deviation of the data, the standard error of the mean, or a 95% confidence interval for the mean.

Each cell in the panel shows a set of "dynamite plots," which is a common name for those bar charts with error bars. Data visualization experts generally agree that dynamite plots are not the best way to summarize a data distribution. Alternatives that show the distribution of the data include box plots, strip plots, and violin plots.

Even if you choose to present summarized data instead of the raw data, the dynamite plot is not the best choice. If you adhere to Tufte's advice to maximize the data-to-ink ratio, a simple dot plot with error bars conveys the same information with less ink.

This article discusses a makeover of the panel. In particular, here are some best practices for remaking this graph:

  • Replace the dynamite plots with a simple dot plot.
  • Exchange the axes so that the group labels can be written horizontally and the response variable is plotted along the X axis.
  • Do the colors improve or detract from the visualization? Try plotting the data with and without using colored bars.

Moving from bar charts to dot plots

Bar charts are best for representing frequencies and percentages. A dot plot is a better way to display the means and error bars for data. In the SGPLOT procedure in SAS, the DOT statement will summarize the data and visualize the summary statistics. However, I do not have access to the raw data, so I will use a SCATTER statement to create a panel of dot plots.

The following data step creates some data that are similar to the results that my colleague presented. I use PROC FORMAT to create a SAS format for the categories and groups. This enable readers to reuse my code for their own visualizations, and it keeps my colleague's data private:

proc format;
value CatFmt
      1 = "Category 1"   2 = "Category 2" 
      3 = "Category 3"   4 = "Category 4";
value GroupFmt
      1 = "Group 1"  2 = "Group 2"  3 = "Control";
run;
 
data Have;
format Category CatFmt. Group GroupFmt.;
input Category Group Mean W;
Lower = Mean - W;
Upper = Mean + W;
datalines;
1 1 35 2
1 2 55 5
1 3 70 9
2 1 55 10
2 2 30 7
2 3 10 4
3 1 5.5  3
3 2 8.2  4
3 3 7.2  1
4 1 24.5 7
4 2 27   15
4 3 12   11
;

The following makeover uses the following features of SAS graphics:

  • You can use PROC SGPANEL to create the panel of plots. The PANELBY statement specifies the categorical variable to use for the cells. You can use options on this statement to control the layout for the cells and the appearance of the cell headers.
  • The SCATTER statement plots the mean within each category. Use the GROUP= option to offset the means for each group. Use the XERRORLOWER= and XERRORUPPER= options to plot error bars. This plot uses the default colors for distinguishing groups. The colors are determined by the current ODS style. For the HTMLBlue style, the colors are dark shades of blue, red, and green. According to my colleague, she chose the purple-pink-yellow palette of colors "for aesthetic reasons." In a subsequent section, I show how to override the default colors for groups.
  • The COLAXIS and ROWAXIS statements are equivalent to the XAXIS and YAXIS statements in PROC SGPLOT. They enable you to add grid lines, control labels and tick marks, and modify axis-related properties.

The following call to PROC SGPANEL shows one way to remake my colleague's graph as a panel of dot plots:

ds graphics / width=480px height=400px;
title "Paneled Makeover Plot";
title2 "Horizontal Dot Plot with Error Bars";
footnote J=L "Without Colored Bands";
proc sgpanel data=Have noautolegend;
   panelby Category / novarname layout=rowlattice rows=4;
   scatter x=Mean y=Group / xerrorlower=Lower xerrorupper=Upper 
      errorbarattrs=(thickness=2) errorcapscale=0.5
      markerattrs=(symbol=CircleFilled) group=Group;
   colaxis grid display=(nolabel);
   rowaxis grid display=(nolabel) type=discrete REVERSE
           offsetmin=0.2 offsetmax=0.2;
run;

If you turn your head sideways, you can see that this graph displays exactly the same information as the original graph. However, it is easier to compare the means and intervals within and across categories. The group labels are displayed horizontally, which means that this same layout will support longer group labels without needing to rotate the text.

Furthermore, you can use the row-based layout even if there are additional categories or groups. The graph will be longer, but it will be the same width. That means that the graph will fit on one sheet of paper for many groups or categories. In contrast, the original layout gets wider as more categories are added, which might make it hard to fit the panel on a printed page in portrait mode.

Adding colored bands to a dot plot

In the previous graph, colors are used to distinguish the markers and lines for each group. This enables you to easily track the same group across different categories. The author of the original graph chose light colors for the bars, but light colors are not good choices for lines and markers. However, if those colors are important to the story that you want to tell, you can add colored bands to the background of the graph.

You can use the STYLEATTRS statement to tell the procedure what colors to use for groups. You can use the HIGHLOW statement to create colored bands, which is a tip I learned from my colleague Sanjay Matange. To use the HIGHLOW statement, you need to add new variables to the data to indicate the minimum and maximum values for the bands. The following statements create an alternative visualization:

/* set minimum and maximum values for colored bars */
data Want / view=Want;
set Have;
xmin=0; xmax=80;  
run;
 
footnote J=L "With Colored Bands";
proc sgpanel data=Want noautolegend;
   styleattrs datacolors=(Lavender Salmon LightYellow);
   panelby Category / novarname layout=rowlattice rows=4;
   /* plot colored bands in the background */
   highlow y=Group low=xmin high=xmax / group=Group type=bar groupdisplay=cluster
           barwidth=1.0 clusterwidth=0.9 transparency=0.5 fill nooutline;
   scatter x=Mean y=Group / xerrorlower=Lower xerrorupper=Upper 
      markerattrs=(symbol=CircleFilled color=Black) 
      errorbarattrs=(thickness=2 color=Black) errorcapscale=0.5;
   colaxis grid display=(nolabel);
   rowaxis grid display=(nolabel) type=discrete REVERSE
           offsetmin=0.2 offsetmax=0.2;
run;

Since the colored bands identify the groups, I used black to display the means and error bars. The information in this graph is the same as for the previous graph, but this graph emphasizes the colors more. In general, I prefer the graph without the color bands, but there might be situations in which colors enhance the presentation of the data.

Summary

In summary, this article shows how to remake a panel of dynamite plots, which are bar charts with error bars. The article shows that you can display the same information by using a dot plot with error bars. Furthermore, it is often helpful to rotate the plot so the response variable is plotted horizontally instead of vertically. This redesign can support many groups and categories because adding more categories makes the graph grow taller, not wider. Lastly, the article shows how to add colored bands in the background of a dot plot.

Appendix: Create the original graph

For completeness, the following SAS code creates the panel that is shown at the top of this article:

title "Paneled Dynamite Plot";
title2 "Vertical Bar Charts with Error Bars";
footnote;
proc sgpanel data=Have noautolegend;
   styleattrs datacolors=(CX6f6db2 CXe18dac CXe9ef7a); /* colors of original graph */
   panelby Category / novarname layout=columnlattice columns=4;
   vbarbasic  Group / response=Mean transparency=0.5 group=Group;
   scatter x=Group y=Mean / yerrorlower=Lower yerrorupper=Upper
           markerattrs=(size=0) errorbarattrs=(color=Black);
   colaxis grid display=(nolabel) fitpolicy=rotate;
   rowaxis grid display=(nolabel);
run;

The post Remaking a panel of dynamite plots appeared first on The DO Loop.