Statistical Graphics

2月 012023
 

SAS supports the ColorBrewer system of color palettes from the ColorBrewer website (Brewer and Harrower, 2002). The ColorBrewer color ramps are available in SAS by using the PALETTE function in SAS IML software. The PALETTE function supports all ColorBrewer palettes, but some palettes are not interpretable by people with color deficiencies (usually called "colorblind" people). Fortunately, the ColorBrewer website includes information about which palettes are colorblind-safe and which are not. You should strive to use only the colorblind-safe palettes when you create graphs that you intend to show to others.

Which palettes are colorblind-safe?

As discussed previously, the most common color vision deficiencies are the inability to distinguish red and green. Therefore, as a general rule, you should avoid using palettes that contain both red and green. The ColorBrewer palettes are split into three different types of palettes:

  • A sequential palette enables you to distinguish low and high values for an interval measurement.
  • A diverging palette enables you to compare high and low values to an average.
  • A qualitative palette enables you to visualize different values of a categorical variable, such as sex, political affiliation, or an arm of a clinical study.

For each type, the following list reveals which palettes are colorblind-safe:

  • Sequential palettes: All sequential palettes from ColorBrewer are colorblind-safe. This includes the one-hue palettes (for example, "Blues"), the two-hue palettes (for example, "OrRed," which is an orange-red palette), and three-hue palettes (such as "YlOrRd," which is a yellow-orange-red palette).
  • Diverging palettes: Six diverging palettes are colorblind-safe, including the popular "BrBG," which is a brown-blue-green palette. However, three diverging palettes are not colorblind-safe. Avoid using "RdGy" (red-gray), "RdYlGn" (red-yellow-green), and "Spectral" (red-yellow-blue, which passes through green).
  • Qualitative palettes: Very few of the qualitative palettes are colorblind-safe. For three categories, you can use "Dark2", "Paired", and "Set2". For four categories, only "Paired" is a colorblind-safe palette. Many qualitative palettes use green as one of the colors.

A SAS IML module for colorblind-safe palettes

After discovering which palettes are colorblind-safe, it is easy to write a wrapper for the PALETTE function in SAS IML that adds a bit of colorblind-safe logic. The following function returns the requested palette if it is colorblind-safe. Otherwise, it returns a vector of missing values and writes an error to the log by using the PrintToLog subroutine.

/* In SAS 9.4, the macro defines a function that emulates the PrintToLog call.
   See https://blogs.sas.com/content/iml/2023/01/25/printtolog-iml.html */
%macro DefinePrintToLog;
%if %sysevalf(&sysver = 9.4) %then %do;
start PrintToLog(msg,errCode=-1);
   if      errCode=0 then prefix = "NOTE: ";
   else if errCode=1 then prefix = "WARNING: ";
   else if errCode=2 then prefix = "ERROR: ";
   else prefix = "";
   stmt = '%put ' + prefix + msg + ';';
   call execute(stmt);
finish;
%end;
%mend;
 
/* Support colorblind-safe palettes as reported by https://colorbrewer2.org/
   If the requested palette is not colorblind-safe, return missing values and log a warning. */
proc iml;
%DefinePrintToLog
 
start ColorblindSafe(name, numColors );
/* Create a denylist of palettes that are NOT colorblind-safe. 
   The first column is the name of a palette.
   The second column is 'N' if not safe for any number of colors. 
   Otherwise, '3' or '4' indicates the max number of colors for which the palette is safe. */
NotCBS = {
         RDGY     N,
         RDYLGN   N,
         SPECTRAL N,
         ACCENT   N,
         DARK2    3,  /* only numColors <= 3 */
         PAIRED   4,  /* only numColors <= 4 */
         PASTEL1  N,
         PASTEL2  N,
         SET1     N,
         SET2     3,  /* only numColors <= 3 */
         SET3     N 
         };
 
/* if the requested palette is not colorblind-safe, return a vector of missing values */
ErrPalette = j(1, numColors, .);
idx = loc(NotCBS[,1] = upcase(name));
if ncol(idx)=0 then /* name is not on the denylist. Use it! */
   return palette(name, numColors);
 
isCBS = NotCBS[idx,2];  /* get value in second column */
if isCBS='N' then do;   /* 'N' = not colorblind-safe */
   msg = cat("The palette '", strip(name), "' is not colorblind-safe.");
   call PrintToLog(msg, 1); /* WARNING */
   return ErrPalette;
end;
else do;
   n = num(isCBS);     /* '4' = colorblind-safe only for <= 4 colors */
   if numColors <= n then 
      return palette(name, numColors);
   else do;
      msg = cat("The palette '", strip(name), "' is colorblind-safe only for ",
                strip(isCBS) + " or fewer colors.");
      call PrintToLog(msg, 1); /* WARNING */
      return ErrPalette;
   end;
end;
finish;
 
/* test the ColorblindSafe function on a few examples */
RedBlue = ColorblindSafe("RdBu", 7);
Spectral = ColorblindSafe("Spectral", 5);
Pair4 = ColorblindSafe("Paired", 4);
Pair5 = ColorblindSafe("Paired", 5);
print RedBlue, Spectral, Pair4, Pair5;

The output shows that the 'Spectral' palette is not colorblind-safe. The 'Paired' palette is safe for four or fewer colors. If the caller requests a colorblind-safe palette, it is returned. Otherwise, the function returns a vector of missing values. The program also writes the following messages to the log:

The messages inform the user that some of the requested palettes are not colorblind-safe.

Summary

SAS software supports the ColorBrewer family of color palettes. Not all the palettes are colorblind-safe. Among the palettes, all sequential palettes are safe, and most of the diverging palettes are safe. However, most of the qualitative palettes are not safe. The SAS IML function in this article creates a denylist of the ColorBrewer palettes that are not colorblind-safe. The function returns a colorblind-safe palette when one is requested. However, the function returns a vector of missing values if the palette is on the denylist.

The post Colorblind-safe palettes in SAS appeared first on The DO Loop.

1月 302023
 

Did you know that about 8% of the world's men are colorblind? (More correctly, 8% of men are "color vision deficient," since they see colors, but not all colors.) Because of the "birthday paradox," in a room that contains eight men, the probability is 50% that at least one is colorblind. If you know 100 people, chances are very high (about 98.7%) that at least one friend is color vision deficient.

Consequently, if you are presenting graphs to a large audience, it is important to think about how your graphs might appear to those who are colorblind. This point was recently made in a well-written article by Sarah Kate Schuhler, a student at the NC State Institute for Advanced Analytics. Schuhler points out that there are various online tools, such as a Color Blindness Simulator (CoBliS) that you can use to see how one of your graphs will appear to a person who is colorblind.

This article runs some SAS graphs through the CoBliS simulator and gives tips on how to create graphs in that are interpretable by those who have color vision deficiency.

Run SAS graphs through a colorblindness simulator

For brevity, I will only look at the effect of deuteranopia, which is the most common type of color blindness. Deuteranopia is one form of "red-green color blindness" because people who have deuteranopia see green shades as red. (Another red-green color blindness is protanomaly, which makes red shades appear green.) There is a simple rule to make your graphs interpretable to people who have deuteranopia: Don't use red and green shades in the same graph if the colors are required to distinguish some elements (lines or markers) from others.

The next sections show several SAS graphs. Some are interpretable to someone who has deuteranopia, whereas others are not.

A colorblind-safe statistical graph

First, let's show an example of a graph that is interpretable to someone who has deuteranopia. A general rule is that the graph should not use a color ramp that includes both red and green. The following graph is from my 2022 article, "Use a heat map to visualize missing values in longitudinal data." I almost always use color palettes from the ColorBrewer web site (Brewer and Harrower, 2002). The ColorBrewer color ramps are supported in SAS by using the PALETTE function in SAS IML software. For the following heat map, the color ramp contains five colors. I used white as the lowest color of a color ramp and appended the colors for ColorBrewer's four-color "YlOrRd" (yellow-orange-red) color ramp.

The image also uses a gray color to visualize missing values. Notice that this color ramp has reds and oranges, but no greens. Consequently, I expect it to be interpretable to someone who has deuteranopia. To see how the image would appear to someone who has deuteranopia, I uploaded the image to the CoBliS website. The result is shown below:

The deuteranopia image is different, even though the original image did not explicitly use any shade of green. Many colors (including gray) have a green component, and these colors look different to someone with deuteranopia. Overall, the reds and oranges in the image are shifted towards brown, and the bright colors are muted. Nevertheless, the graph is useful because the relative light and dark shades in the graph are distinguishable.

A graph that is not colorblind-safe

The simplest example of a SAS graph that is not colorblind-safe is a scatter plot or line plot that shows several groups, where each group is distinguished only by a color. If you are using the HTMLBlue ODS style, then the second group is colored brick red and the third group is colored forest green. Thus, the second and third groups might be indistinguishable to people with deuteranopia.

The following call to PROC SGPLOT in SAS creates a scatter plot of Fisher's Iris data in which each species of Iris is assigned a different color. The result is shown for the HTMLBlue style and for the ATTRPRIORITY=COLOR option, which tells SAS to use only colors to differentiate groups:

/* Using HTMLBlue ODS style */
ods graphics / AttrPriority=COLOR;
title "Indicate Groups by Using Colors";
title2 "Use AttrPriority=COLOR";
proc sgplot data=sashelp.iris;
   scatter x=PetalWidth y=SepalWidth/ group=Species jitter markerattrs=(symbol=CircleFilled size=12);
   xaxis grid;   yaxis grid;
run;

The output is shown for the original graph and for the same graph as seen by someone with deuteranopia. (Click to enlarge.) Because the graph uses only colors to distinguish groups and because the colors include both red and green, it is harder to distinguish between the Versicolor and Virginica species.

An easy solution is to use the ATTRPRIORITY=NONE option, which tells SAS to vary several attributes (colors, marker symbols, and line styles) when assigning attributes to graphical elements. The following SAS statements are essentially the same, except for the ATTRPRIORITY= option. I removed the call to set the symbol of markers so that the markers will vary among groups.

ods graphics / PUSH AttrPriority=NONE;
title "Indicate Groups by Using Colors and Symbols";
title2 "Use AttrPriority=NONE";
proc sgplot data=sashelp.iris;
   scatter x=PetalWidth y=SepalWidth/ group=Species jitter markerattrs=(size=12);
   xaxis grid;   yaxis grid;
run;
ods graphics / POP;

Although the colors are still difficult to distinguish if you have deuteranopia, the marker symbols make it clear which observations belong to which species.

The 'Daisy' ODS style

SAS has put a lot of effort into making sure that all output (tables and graphs) can be accessible to a wide range of users. The documentation section, "Creating Accessible Graphs," recommends several best practices for creating accessible graphs. Of these, I want to emphasize that there is an ODS style, called the 'Daisy' style, that is designed to maximize the interpretability of graphs for people with color vision deficiencies. If you run the previous example under the Daisy style, you get the following graph (on the left). I ran the graph through the CoBliS simulator so that you can see how it appears to someone with deuteranopia (on the right).

Although the olive-green color appears orange and the reddish color appears brown, the three colors are distinguishable to someone with deuteranopia.

Tips for making colorblind-safe statistical graphs

The following tips are simple and easy to follow, but can improve how well your graphs are perceived by people with color vision deficiencies:

  • Avoid using colors in the same graph that colorblind people will be unable to distinguish. The most common deficiency is red-green, but some people are unable to distinguish blue-yellow.
  • In SAS, use the ATTRPRIORITY=NONE option on the ODS GRAPHICS statement to ensure that non-color attributes (such as marker symbols and line styles) are used to encode group information.
  • In SAS, use the Daisy ODS option, which cycles through colors that are more easily distinguished by people with color vision deficiencies.
  • Avoid using a green-yellow-red palette for "traffic lighting" in dashboards. Or choose a "warm green," light yellow, and "cool red" so that the shades of the colors are distinguishable even if the colors are not.

This article is a brief introduction to making graphs accessible to everyone. The following references describe additional tips and best practices for creating accessible graphs in SAS:

The post Tips for making colorblind-safe statistical graphs appeared first on The DO Loop.

1月 182023
 

A previous article shows that you can use the Intercept parameter to control the ratio of events to nonevents in a simulation of data from a logistic regression model. If you decrease the intercept parameter, the probability of the event decreases; if you increase the intercept parameter, the probability of the event increases.

The probability of Y=1 also depends on the slope parameters (coefficients of the regressor effects) and on the distribution of the explanatory variables. This article shows how to visualize the probability of an event as a function of the regression parameters. I show the visualization for two one-variable logistic models. For the first model, the distribution of the explanatory variable is standard normal. In the second model, the distribution is exponential.

Visualize the probability as the Intercept varies

In a previous article, I simulated data from a one-variable binary logistic model. I simulated the explanatory variable, X, from a standard normal distribution. The linear predictor is given by η = β0 + β1 X. The probability of the event Y=1 is given by μ = logistic(η) = 1 / (1 + exp(-η)).

In the previous article, I looked at various combinations of Intercept (β0) and Slope (β1) parameters. Now, let's look systematically at how the probability of Y=1 depends on the Intercept for a fixed value of Slope > 0. Most of the following program is repeated and explained in my previous post. For your convenience, it is repeated here. First, the program simulates data (N=1000) for a variable, X, from a standard normal distribution. Then it simulates a binary variable for a series of binary logistic models as the intercept parameter varies on the interval [-3, 3]. (To get better estimates of the probability of Y=1, the program simulates 100 data sets for each value of Intercept.) Lastly, the variable calls PROC FREQ to compute the proportion of simulated events (Y=1), and it plots the proportion versus the Intercept value.

/* simulate X ~ N(0,1) */
data Explanatory;
call streaminit(12345);
do i = 1 to 1000;
   x = rand("Normal",  0, 1);   /* ~ N(0,1) */
   output;
end;
drop i;
run;
 
/* as Intercept changes, how does P(Y=1) change when Slope=2? */
%let Slope = 2;
data SimLogistic;
call streaminit(54321);
set Explanatory;
do Intercept = -3 to 3 by 0.2;
   do nSim = 1 to 100;               /* Optional: param estimate is better for large samples */
      eta = Intercept + &Slope*x;    /* eta = linear predictor */
      mu = logistic(eta);            /* mu = Prob(Y=1) */
      Y = rand("Bernoulli", mu);     /* simulate binary response */
      output;
   end;
end;
run;
 
proc sort data=SimLogistic; by Intercept; run;
ods select none;
proc freq data=SimLogistic;
   by Intercept;
   tables Y / nocum;
   ods output OneWayFreqs=FreqOut;
run;
ods select all;
 
title "Percent of Y=1 in Logistic Model";
title2 "Slope=&Slope";
footnote J=L "X ~ N(0,1)";
proc sgplot data=FreqOut(where=(Y=1));
   series x=Intercept y=percent;
   xaxis grid;
   yaxis grid label="Percent of Y=1";
run;

In the simulation, the slope parameter is set to Slope=2, and the Intercept parameter varies systematically between -3 and 3. The graph visualizes the probability (as a percentage) that Y=1, given various values for the Intercept parameter. This graph depends on the value of the slope parameter and on the distribution of the data. It also has random variation, due to the call to generate Y as a random Bernoulli variate. For these data, the graph shows the estimated probability of the event as a function of the Intercept parameter. When Intercept = –2, Pr(Y=1) = 22%; when Intercept = 0, Pr(Y=1) = 50%; and when Intercept = 2, Pr(Y=1) = 77%. The statement that Pr(Y=1) = 50% when Intercept=0 should be approximately true when the data is from a symmetric distribution with zero mean.

Vary the slope and intercept together

In the previous section, the slope parameter is fixed at Slope=2. What if we allow the slope to vary over a range of values? We can restrict our attention to positive slopes because logistic(η) = 1 – logistic(-η). The following program varies the Intercept parameter in the range [-3,3] and the Slope parameter in the range [0, 4].

/* now do two-parameter simulation study where slope and intercept are varied */
data SimLogistic2;
call streaminit(54321);
set Explanatory;
do Slope = 0 to 4 by 0.2;
do Intercept = -3 to 3 by 0.2;
   do nSim = 1 to 50;                  /* optional: the parameter estimate are better for larger samples */
   eta = Intercept + Slope*x;          /* eta = linear predictor */
   mu = logistic(eta);                 /* transform by inverse logit */
   Y = rand("Bernoulli", mu);          /* simulate binary response */
   output;
   end;
end;
end;
run;
 
/* Monte Carlo estimate of Pr(Y=1) for each (Int,Slope) pair */
proc sort data=SimLogistic2; by Intercept Slope; run;
ods select none;
proc freq data=SimLogistic2;
   by Intercept Slope;
   tables Y / nocum;
   ods output OneWayFreqs=FreqOut2;
run;
ods select all;

The output data set (FreqOut2) contains Monte Carlo estimates of the probability that Y=1 for each pair of (Intercept, Slope) parameters, given the distribution of the explanatory variable. You can use a contour plot to visualize the probability of the event for each combination of slope and intercept:

/* Create template for a contour plot
   https://blogs.sas.com/content/iml/2012/07/02/create-a-contour-plot-in-sas.html
*/
proc template;
define statgraph ContourPlotParm;
dynamic _X _Y _Z _TITLE _FOOTNOTE;
begingraph;
   entrytitle _TITLE;
   entryfootnote halign=left _FOOTNOTE;
   layout overlay;
      contourplotparm x=_X y=_Y z=_Z /
        contourtype=fill nhint=12  colormodel=twocolorramp name="Contour";
      continuouslegend "Contour" / title=_Z;
   endlayout;
endgraph;
end;
run;
/* render the Monte Carlo estimates as a contour plot */
proc sgrender data=Freqout2 template=ContourPlotParm;
where Y=1;
dynamic _TITLE="Percent of Y=1 in a One-Variable Logistic Model"
        _FOOTNOTE="X ~ N(0, 1)"
        _X="Intercept" _Y="Slope" _Z="Percent";
run;

As mentioned earlier, because X is approximately symmetric, the contours of the graph have reflective symmetry. Notice that the probability that Y=1 is 50% whenever Intercept=0 for these data. Furthermore, if the points (β0, β1) are on the contour for Pr(Y=1)=α, then the contour for Pr(Y=1)=1-α contains points close to (-β0, β1). The previous line plot is equivalent to slicing the contour plot along the horizontal line Slope=2.

You can use a graph like this to simulate data that have a specified probability of Y=1. For example, if you want approximately 70% of the cases to be Y=1, you can choose any pair of (Intercept, Slope) values along that contour, such as (0.8, 0), (1, 1), (2, 3.4), and (2.4, 4). If you want to see all parameter values for which Pr(Y=1) is close to a desired value, you can use the WHERE statement in PROC PRINT. For example, the following call to PROC PRINT displays all parameter values for which the Pr(Y=1) is approximately 0.7 or 70%:

%let Target = 70;
proc print data=FreqOut2;
   where Y=1 and %sysevalf(&Target-1) <= Percent <= %sysevalf(&Target+1);
   var Intercept Slope Y Percent;
run;

Probabilities for nonsymmetric data

The symmetries in the previous graphs are a consequence of the symmetry in the data for the explanatory variable. To demonstrate how the graphs change for a nonsymmetric data distribution, you can run the same simulation study, but use data that are exponentially distributed. To eliminate possible effects due to a different mean and variance in the data, the following program standardizes the explanatory variable so that it has zero mean and unit variance.

data Expo;
call streaminit(12345);
do i = 1 to 1000;
   x = rand("Expon", 1.5);   /* ~ Exp(1.5) */
   output;
end;
drop i;
run;
 
proc stdize data=Expo method=Std out=Explanatory;
   var x;
run;

If you rerun the simulation study by using the new distribution of the explanatory variable, you obtain the following contour plot of the probability as a function of the Intercept and Slope parameters:

If you compare this new contour plot to the previous one, you will see that they are very similar for small values of the slope parameter. However, they are different for larger values such as Slope=2. The contours in the new plot do not show reflective symmetry about the vertical line Intercept=0. Pairs of parameters for which Pr(Y=1) is approximately 0.7 include (0.8, 0) and (1, 1), which are the same as for the previous plot, and (2.2, 3.4), and (2.6, 4), which are different from the previous example.

Summary

This article presents a Monte Carlo simulation study in SAS to compute and visualize how the Intercept and Slope parameters of a one-variable logistic model affect the probability of the event. A previous article notes that decreasing the Intercept decreases the probability. This article shows that the probability depends on both the Intercept and Slope parameters. Furthermore, the probability depends on the distribution of the explanatory variables. You can use the results of the simulation to control the proportion of events to nonevents in the simulated data.

You can download the complete SAS program that generates the results in this article.

The ideas in this article generalize to logistic regression models that contain multiple explanatory variables. For multivariate models, the effect of the Intercept parameter is similar. However, the effect of the slope parameters is more complicated, especially when the variables are correlated with each other.

The post Visualize how parameters in a binary logistic regression model affect the probability of the event appeared first on The DO Loop.

12月 072022
 

For Christmas 2021, I wrote an article about palettes of Christmas colors, chiefly shades of red, green, silver, and gold. One of my readers joked that she would like to use my custom palette to design her own Christmas wrapping paper! I remembered her jest when I saw some artwork by the artist Angela Johal, who uses geometric shapes in her artwork. The image to the right is Johal's composition "RHYTHM & COLOUR NO. 3." When I saw this work, I thought, "If she had used colors from a Christmas palette, that would make an excellent design for wrapping paper!"

This article uses SAS to create a geometric image that is inspired by Johal's work but uses colors in a palette of Christmas colors. This whimsical task can actually teach us several programming techniques in SAS:

  • How to use the POLYGON statement to create display squares and other polygons.
  • How to use a discrete attribute map to associate a color with a value of a variable.
  • How to use PROC SURVEYSELECT to randomly select a set of foreground and background colors from a data set that defines a palette of colors.

I will break the article into a series of steps that lead us to the final goal. The steps are:

  1. Create the squares on a 6 x 6 grid. We need 36 colored background squares, 36 semi-transparent colored foreground squares, and 36 small black squares that vary in size.
  2. Create a discrete attribute map that contains the palette of Christmas colors.
  3. Randomly assign a color to each of the background and foreground squares. I chose shades of green and brown for the background squares and shades of red, silver, and gold for the foreground squares.

Display a grid of squares

The Johal-inspired image is composed of 36 subimages that are arranged on an evenly spaced 6 x 6 grid. For convenience, place the center of the subimages at the integer lattice points (1,1), (1,2), ..., (6,6). Each subimage consists of three squares:

  • A background (BG) square, which is offset slightly to the left and up.
  • A foreground (FG) square, which is offset slightly to the right and down. The foreground square is semi-transparent.
  • A black square, which is centered but randomly varies in size.

You can use a DATA step to define the centers of the subimages. You can use a second DATA step to offset the background and foreground squares and to vary the size of the centered black squares. Lastly, you can use three POLYGON statements in PROC SGPLOT to display the image. The POLYGON statement requires that each polygon have an ID variable for each vertex in the polygon. This is used to identify which vertices belong to each polygon.

/* Christmas wrapping paper. Inspired by 
   https://johalgeometrics.com/#/rhythm-colour-no-3/
*/
/* 1. Generate centers of squares on 6 x 6 grid */
data Squares;
length Type $10;
do Type = "Background", "Foreground", "Square";
   do row = 1 to 6;
      do col = 1 to 6;
         output;
      end;
   end;
end;
run;
 
/* 2. create the polygons: each observation (the center) becomes four observations (vertices) */
%let offset = 0.075;
%let s = 0.35;
data Pattern;
call streaminit(123);
c = &offset;         /* shift the FG and BG squares from center */
s = &s;              /* 2*s is the side length for FG and BG squares */
set Squares;
PolyID + 1;          /* create an ID variable for each polygon (square) */
if type = "Background" then do;
   /* (bx, by) are vertices of BG square: shift left and up */
   bx = col - s - c; by = row - s + c; output; 
   bx = col + s - c; by = row - s + c; output;
   bx = col + s - c; by = row + s + c; output;
   bx = col - s - c; by = row + s + c; output;
end;
else if type = "Foreground" then do;
   /* (fx, fy) are vertices of FG square: shift right and down */
   fx = col - s + c; fy = row - s - c; output; 
   fx = col + s + c; fy = row - s - c; output;
   fx = col + s + c; fy = row + s - c; output;
   fx = col - s + c; fy = row + s - c; output;
end;
else if type = "Square" then do;
   /* (sx, sy) are vertices of black square, which is centered but is a random size */
   w = rand("integer", 16, 40) /100 * s; 
   sx = col - w; sy = row - w; output; 
   sx = col + w; sy = row - w; output;
   sx = col + w; sy = row + w; output;
   sx = col - w; sy = row + w; output;
end;
drop s c w;
run;
 
/* POLYGON stmt 1: Squares and colors for background.
   POLYGON stmt 2: Use transparency for squares in foreground
   POLYGON stmt 3: Black squares of random sizes   */
%let gray   = CXF8F8F8;
%let black  = CX110F0C;
 
title "Christmas Wrapping Paper in the Style of Johal";
title2 "No Colors";
ods graphics / width=480px height=480px;
proc sgplot data=Pattern noautolegend noborder;
   styleattrs backcolor=&gray wallcolor=&gray;
   polygon x=bx y=by ID=PolyID / fill fillattrs=(color=darkgray);
   polygon x=fx y=fy ID=PolyID / fill fillattrs=(color=lightgray) transparency=0.3;
   polygon x=sx y=sy ID=PolyID / fill fillattrs=(color=&black);
   xaxis display=none offsetmax=0;
   yaxis display=none thresholdmin=0 thresholdmax=0;
run;

We are off to a good start. The graph shows 36 images on a 6 x 6 grid. The graphic already looks similar to Johal's artwork. In the next section, we will add some color.

Discrete attribute maps

There are 36 x 3 = 108 polygons in the graph, each with its own unique ID. We want to assign colors to the 36 background squares and the 36 foreground squares. (The 36 centered squares will remain black.) Eventually, we will assign the colors randomly, but in this section let's just get ANY Christmas colors into the graphic.

A quick way to add colors is to use the GROUP= options on the POLYGON statement and color each BG and FG polygon by the polygon's ID variable. This will use the colors in the current ODS style to assign colors to the squares. Unfortunately, there is no "Christmas style" that contains colors from a holiday-inspired palette. If we use one of the predefined ODS styles, the squares will be colored with an assortment of blues, purples, and other colors not usually associated with Christmas.

If you are using PROC SGPLOT and want to assign specific colors to specific values of a GROUP= variable, you should define a discrete attribute map. Let's use the 41 colors that I studied last year. We can assign the 17 green and brown colors to the background squares. We can assign the 24 red, yellow, and gold colors to the foreground squares. The following DATA steps define the hexadecimal version of the colors and merge them into one data set:

/* 3. Define colors to use for foreground and background rectangles */
data BGColors;
length Color $8;
input Color @@;
datalines;
CX44690D CX698308 CX216A1B CX03744B CX15885A CX1E9965
CX2A5B53 CX5EB69D CX5F1326 CX237249 CX3B885C CX325C39
CX9C9549 CX779645 CX497542 CX274530 CX6E3C3B
;
data FGColors;
length Color $8;
input Color @@;
datalines;
CXF3C543 CXFFEDC7 CXCA2405 CX9E1007 CXCF1931 CXAD132D
CXD9991A CXEAA61E CXF2BC13 CXFBE646 CXFBC34D CXF69F44
CXECEBF1 CXD34250 CXE5D6B5 CXE3CD8E CXDA111E CXC00214
CXDBAA46 CXFFE9D9 CXFF4A4A CXDB2121 CXBF403B CXEDB361
;
 
/* concatenate the BG and FG colors. Add the Type variable ("Background" or "Foreground"). */
data Colors;
length Type $10;
set BGColors(in=BG) FGColors;
if BG then Type = "Background";
else       Type = "Foreground";
ColorID + 1;
run;
 
/* Create a discrete data map that maps these colors and ID numbers:
  https://blogs.sas.com/content/iml/2012/10/17/specify-the-colors-of-groups-in-sas-statistical-graphics.html
  https://blogs.sas.com/content/iml/2019/07/15/create-discrete-heat-map-sgplot.html
*/
data ColorAttrMap;                    /* create discrete attribute map */
keep ID Show Value FillColor MarkerColor;
retain ID 'ChristmasColors'           /* name of map */
     Show 'AttrMap';                  /* always show all groups in legend */
set Colors(rename=(ColorID=Value Color=FillColor));
MarkerColor = FillColor;
run;
 
proc print data=ColorAttrMap;
run;

See the documentation for more information about the structure of a discrete attribute map and the name of the variables. This discrete attribute maps associates 41 colors to the values 1–41. If you create a graph that has a GROUP= variable, then the groups for values 1–41 will be assigned colors from the Christmas palette. The following call to PROC SGPLOT displays 41 foreground and background squares and all the black squares:

title "Christmas Wrapping Paper in the Style of Johal";
title2 "Color by Polygon ID";
title3 "Only 41 Squares Are Shown";
proc sgplot data=Pattern noautolegend noborder dattrmap=ColorAttrMap; /* specify discrete attr map */
   WHERE PolyID<=41 OR PolyID > 72;                                   /* TEMPORARY TEMPORARY TEMPORARY */
   styleattrs backcolor=&gray wallcolor=&gray;
   polygon x=bx y=by ID=PolyID / group=PolyID fill attrid=ChristmasColors;                  /* assign colors to PolyID */
   polygon x=fx y=fy ID=PolyID / group=PolyID fill transparency=0.3 attrid=ChristmasColors; /* assign colors to PolyID */
   polygon x=sx y=sy ID=PolyID / fill fillattrs=(color=&black);
   xaxis display=none offsetmax=0;
   yaxis display=none thresholdmin=0 thresholdmax=0;
run;

The PolyID variable was used to map colors to squares. Since there are only 41 colors, I've displayed only 41 foreground and background squares. The first 17 background squares (the lower three rows) are shades of green and brown. The remaining squares are shades of red, orange, silver, and gold. Clearly, the discrete attribute map is working.

In the next section, we randomly assign one of the 17 background colors to each of the 36 background squares, and we randomly assign one of the 24 foreground colors to each of the 36 foreground squares.

Random assignment of colors

Any time you need to perform a random assignment, you should consider using the SURVEYSELECT procedure. For this application, you can use the STRATA statement (which is like a BY statement) to apply the selection both to the set of background colors and to the set of foreground colors. You can use the SAMPSIZE=36 option to ensure that 36 colors from each group are selected. Since the number of colors is less than 36, you must use sampling with replacement (METHOD=URS). Lastly, you should use the OUTRANDOM option, which requires SAS 9.5M5 or later, so that the colors are written in a random order. The following calls generate the random colors, merge the colors into the data set that contains the squares, and uses the GROUP=ColorID option to create the final image:

/* 4. Randomly choose 36 foreground and 36 background colors.
   Use the STRATA statement in PROC SURVEYSELECT to choose 36 random color
   from each Type ("Foreground" or "Background").
   Use OUTRANDOM option to get colors in arbitrary order.
   Note that the data set must be sorted by the STRATA variable. */
proc surveyselect data=Colors out=RandColors(keep=Color Type ColorID)
     seed=12345 noprint
     sampsize=36
     method=URS  /* select with equal probability and with replacement */
     outhits     /* one row for each selected unit */
     outrandom;  /* permute the order of the units (SAS 9.5M5) */
strata Type;
run;
 
/* merge these random colors into the polygon data to establish
   a mapping between the PolyID and the ColorID variables */
data RandomColors;
set RandColors;
by Type;
PolyID + 1;
keep Type ColorID PolyID;
run;
 
data WrappingPaper;
merge Pattern RandomColors;
by Type PolyID;
run;
 
title "Christmas Wrapping Paper in the Style of Johal";
proc sgplot data=WrappingPaper noautolegend noborder dattrmap=ColorAttrMap;
   styleattrs backcolor=&gray wallcolor=&gray;
   polygon x=bx y=by ID=PolyID / group=ColorID fill attrid=ChristmasColors;
   polygon x=fx y=fy ID=PolyID / group=ColorID fill transparency=0.3 attrid=ChristmasColors;
   polygon x=sx y=sy ID=PolyID / fill fillattrs=(color=&black);
   xaxis display=none offsetmax=0;
   yaxis display=none thresholdmin=0 thresholdmax=0;
run;

Success! We have used SAS to create a Johal-inspired graphic in which the colors of the foreground and background squares are randomly assigned from a palette of Christmas colors. The background squares are shades of green and brown. The foreground squares are shades of red, gold, and silver. It might not be suitable for an art gallery, but I think it would look very nice on a wrapped present under my Christmas tree!

Summary

This fun article uses SAS to create a design for Christmas "wrapping paper" in the style of artwork by the artist Angela Johal. Although the article is meant to spread Christmas cheer, it also demonstrates a few useful graphical techniques in SAS. It shows an example of how to use the POLYGON statement in PROC SGPLOT, it shows how to create a discrete attribute map to associate user-specified colors with groups, and it shows how to use PROC SURVEYSELECT to randomly select items (with replacement) from a specified set of items.

Think you can do better? Modify my program! You can change the SEED= option on the PROC SURVEYSELECT statement, or you can change the colors used for the foreground or background squares. Have fun!

The post Art in SAS: Christmas wrapping paper appeared first on The DO Loop.

11月 162022
 

A profile plot is a way to display multivariate values for many subjects. The optimal linear profile plot was introduced by John Hartigan in his book Clustering Algorithms (1975). In Michael Friendly's book (SAS System for Statistical Graphics, 1991), Friendly shows how to construct an optimal linear profile by using the SAS/IML language. He displays the plot by using traditional SAS/GRAPH methods: the old GPLOT procedure and an annotation data set. If you do not have a copy of his book, you can download Friendly's %LINPRO macro from his GitHub site.

SAS graphics have changed a lot since 1991. It is now much easier to create the optimal linear profile plot by using modern graphics in SAS. This article reproduces Friendly's SAS/IML analysis to create the data for the plot but uses PROC SGPLOT to create the final graph. This approach does not require an annotate data set.

The optimal linear profile plot

Friendly (1991) illustrates the optimal linear profile plot by using the incidence rate for seven types of crime in 16 US cities in 1970. I have previously shown how to visualize the crime data by using line plots and heat maps. A profile plot for the raw data is shown to the right.

The optimal linear profile plot modifies the profile plot in two ways. It computes an optimal ordering of the categories (crimes) and transforms the raw responses (rates) so that the transformed data are as close to linear as possible. Let's focus on the meaning of the words "linear" and "optimal":

  • Linear: The goal of the linear profile plot is to show linear trends rather than exact data values. The method models the data to determine which cities tend to have relatively low rates of some crimes and relatively high rates of other crimes. Thus, the linear profile plot shows a linear model for the crime trends.
  • Optimal: In creating the optimal linear profiles, you can choose any positions for the categories on the X axis. You can also (affinely) scale the data for each variable in many ways. The algorithm chooses positions and scales that make the transformed data as linear as possible.

The optimal linear profile is constructed by using the first two principal coordinates (PC) for the data (in wide form). The optimal positions are found as the ratio of the elements of the first two eigenvectors. The slopes and intercepts for the linear profiles are obtained by projecting the centered and scaled data onto the eigenspace, so they are related to the PC scores. See Friendly (1991) or the SAS/IML code for details.

The following DATA step creates the crime data in wide form. The SAS/IML program is taken from the %LINPRO macro and is similar to the code in Friendly (1991). It contains a lot of print statements, which you can comment out if you do not want to see the intermediate calculations. I made a few minor changes and three major changes:

  • Friendly's code computed the covariance and standard deviations by using N (the sample size) as the denominator. My code uses the N-1 as the denominator so that the computations of the PCs agree with PROC PRINCOMP.
  • Friendly's code writes a data set (LINFIT), which is used to create an annotate data set. This data set is no longer needed, but I have retained the statements that create it. I have added code that creates two macro variables (POSNAMES and POSVALUES) that you can use to specify the positions of the crime categories.
  • The program uses PROC SGPLOT to create the optimal linear profile plot. The labels for the lines are displayed automatically by using the CURVELABEL option on the SERIES statement. The VALUES= and VALUESDISPPLAY= options are used to position the categories at their optimal positions.
/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city $16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* Program from Friendly (1991, p. 424-427) and slightly modified 
   by Rick Wicklin (see comments). You can download the LINPRO MACRO
   from https://github.com/friendly/SAS-macros/blob/master/linpro.sas
 
   The program assumes the data are in wide form:
   There are k response variables. Each row is the response variables 
   for one subject. The name of the subject is contained in an ID variable.
 
   The output for plotting is stored in the data set LINPLOT.
   Information about the linear regressions is stored in the data set LINFIT.
*/
 
%let data = crime;    /* name of data set */
%let var = _NUM_;     /* list of variable names or the _NUM_ keyword */
%let id = City;       /* subject names */
proc iml;
   *--- Get input matrix and dimensions;
   use &data;
   if upcase("&var") = "_NUM_" then
      read all var _NUM_ into X[colname=vars];
   else
      read all var {&var} into X[colname=vars];
   n = nrow(X);
   p = ncol(X);
   m = mean(X);
   D = X - m;         *- centered cols = deviations from means;
 
   if "&id" = " " then
      ID = T(1:nrow(X));
   else
      read all var{&id} into ID;
   if type(ID)='N' then ID = compress(char(id));
   close;
 
   *--- Compute covariance matrix of x;
   /* Modification by Wicklin: Friendly's COV and SD are different 
      because he uses a divisor of n for the COV matrix instead of n-1 */
   C = cov(X);   
   sd = std(X);
   /* If you want to exactly reproduce Friendly's macro, use the following: 
   c = d` * d /  n;               *- variance-covariance matrix;
   sd = sqrt( vecdiag( c))`;      *- standard deviations;
   */
 
   *--- Standardize if vars have sig diff scales;
   ratio = max(sd) / min(sd);
   if ratio > 3 then do;
      s = sd;             * s is a scale factor;
      C = cov2corr(C);
      Clbl = "Correlation Matrix";
   end;
   else do;
      s = j(1, p, 1);
      Clbl = "Covariance Matrix";
   end;
   print C[colname=vars rowname=vars f=7.3 L=Clbl];
 
   *--- Eigenvalues & vectors of C;
   call eigen(val, vec, C);
 
   *--- Optional: Display information about the eigenvectors;
   prop = val / val[+];
   cum = cusum(prop);
   T = val || dif(val,-1) || prop || cum;
   cl = {'Eigenvalue' 'Difference' 'Proportion' 'Cumulative'}; 
   lbl = "Eigenvalues of the " + Clbl;
   print T[colname=cl f=8.3 L=lbl];
 
   *--- Scale values;
   e1 = vec[ , 1] # sign( vec[<> , 1]);
   e2 = vec[ , 2];
   pos = e2 / e1;
   Ds = D;                /* the raw difference matrix */
   D = Ds / ( s # e1` );  /* a scaled difference matrix */
 
   *--- For case i, fitted line is  Y = F1(I) + F2(I) * X   ;
   f1 = ( D * e1 ) / e1[##];  /* ssq(e1) = ssq(e2) = 1 */
   f2 = ( D * e2 ) / e2[##];
   f = f1 || f2;
 
   *--- Optionally display the results;
   scfac = sd` # e1;   * NOTE: Friendly rounds this to the nearest 0.1 unit;
   table = m` || sd` || e1 || e2 || scfac || pos;
   ct = { 'Mean' 'Std_Dev' 'Eigvec1' 'Eigvec2' 'Scale' 'Position'};
   print table [ rowname=vars colname=ct f=8.2];
 
   *--- Rearrange columns;
   r = rank( pos);
   zz = table;  table [r, ] = zz;
   zz = vars;   vars [ ,r] = zz;
   zz = pos;    pos [r, ] = zz;
   zz = scfac;  scfac [r, ] = zz;
   zz = D;      D [ ,r] = zz;
   *--- Optionally display the results;
   print table[rowname=vars colname=ct f=8.2 L="Variables reordered by position"];
   lt = { 'Intercept' 'Slope'};
   print f[rowname=&id colname=lt f=7.2 L="Case lines"];
 
   *--- Fitted values, residuals;
   fit = f1 * j( 1 , p) + f2 * pos`;
   resid = D - fit;
   *--- Optionally display the results;
   print fit [ rowname=&id colname=vars format=7.3 L="Fitted values"];
   print resid [ rowname=&id colname=vars format=7.3 L="Residuals"];
 
   *--- Optionally display summary statistics for fit;
   sse = resid[##];   ssf = fit[##];
   sst = D[##];       vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   sse = (ds-fit)[##];   sst = ds[##];
   vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   *--- Construct output array for annotate data set -
          residuals bordered by fitted scale values;
   v1 = val[ 1 ] || {0};
   v2 = {0} || val[ 2 ];
   xout = ( resid || f ) //
          ( pos` || v1 ) //
          ( scfac` || v2 );
   rl = colvec(ID) // {'Variable','Scale'};
   cl = colvec(vars) // {'Intercept','Slope'};
   create LINFIT from xout[ rowname=rl colname=cl ];
      append from xout[ rowname= rl ];
   close;
   free rl cl xout;
 
   *--- Output the array to be plotted;
   do col = 1 to p;
       rows = j(n,1,pos[col,]) ||  fit[,col] || resid[,col];
       pout = pout // rows;
       rl = rl // shape(ID,1);
   end;
   cl = { 'Position' 'Fit' 'Residual'};
   create LINPLOT from pout[ rowname=rl colname=cl ];
      append from pout[ rowname=rl ];
   close;
 
   /* Modification by Wicklin: create two macro variables for 
      labeling the plotting positions. For example:
     POSNAME = 'larceny' 'burglary' 'auto_thf' 'rape' 'robbery' 'assault' 'murder'
     POSVALUES= -1.698 -1.014 -0.374 0.1389 0.5016 0.5805 2.1392
   */
   M1 = compress("'" + vars + "'");
   M1 = rowcat(M1 + " ");
   M2 = rowcat(char(pos`,6) + " ");
   call symputx("posNames", M1);
   call symputx("posValues",M2);
quit;
 
%put &=posNames;
%put &=posValues;
 
/* Use the POSNAMES and POSVALUES macro variables to add custom 
   tick labels and values. See
   https://blogs.sas.com/content/iml/2020/03/23/custom-tick-marks-sas-graph.html
*/
ods graphics / height=640px width=640px;
title "Optimal Linear Profiles for City Crime Rates";
footnote J=L "Friendly (1991, p. 430) Output 8.11";
proc sgplot data=LINPLOT(rename=(RL=&ID));
  label fit="Fitted Crime Index" position="Crime (Scaled Position)";
  series x=position y=fit / group=city curvelabel curvelabelloc=outside;
  xaxis values = (&posValues) valuesdisplay=(&posNames) fitpolicy=stagger;  
run;
footnote;title;

The optimal linear profile plot is shown for the crimes data. Unlike Friendly, I do not show the points along these lines. I think adding markers gives the false impression that the method has perfectly linearized the data.

The X axis shows the optimal positions of the crimes. They are roughly ordered so that crimes against property are on the left and violent crimes against persons are on the right. As is often the case, the PC interpretation does not perfectly agree with our intuitions. For these data, the analysis has placed rape to the left of robbery and assault. Notice that I use the VALUES= and VALUESDISPLAY= to specify the positions of the labels, and I use the FITPOLICY=STAGGER option to prevent the labels from colliding.

The cities can be classified according to their linear trends. For brevity, let's call larceny and burglary "property crimes," and let's call robbery, assault, and murder "violent crimes." Then you can observe the following classifications for the cities:

  • The lines for some cities (Dallas, Chicago, Houston) have a large positive slope. This indicates that the crime rates for property crimes are low relative to the rates of violent crimes.
  • The lines for other cities (Los Angeles, New York, Denver) have a large positive slope. This indicates that the crime rates for property crimes are high relative to the rates of violent crimes.
  • The lines for other cities (Kansas City, Tucson, Hartford) are relatively flat. This indicates that the crime rates for property crimes are about the same as the rates of violent crimes.
  • When the line for one city is above the line for another city, it means that the first city tends to have higher crime rates across the board than the second city. For example, the line for Los Angeles is above the line for New York. If you look at the raw data, you will see that the crime rates in LA is mostly higher, although NYC has higher rates for robbery and larceny. Similarly, the rates for Portland are generally higher than the rates for Honolulu, except for auto theft.
  • The ordering of the city labels is based primarily on the violent crime rates.

The following color image is copied from Friendly (1991, Appendix 3, p. 673) so that you can compare the SAS/GRAPH output to the newer output.

An alternative to the optimal linear profile plot

I'll be honest, I am not a big fan of the optimal linear profile plot. I think it can give a false impression about the data. It displays a linear model but does not provide diagnostic tools to assess issues such as outliers or deviations from the model. For example, the line for Boston shows no indication that auto theft in that city was extremely high in 1970. In general, there is no reason to think that one set of positions for the categories (crimes) leads to a linear profile for all subjects (cities).

If you want to show the data instead of a linear model, see my previous article, "Profile plots in SAS," and consider using the heat map near the end of the article. If you want to observe similarities and differences between the cities based on the multivariate data, I think performing a principal component analysis is a better choice. The following call to PROC PRINCOMP does a good job of showing similarities and differences between the categories and subjects. The analysis uses the correlation matrix, which is equivalent to saying that the crime rates have been standardized.

ods graphics/reset;
proc princomp data=crime out=Scores N=2 plot(only)=(PatternProfile Pattern Score);
   var murder rape robbery assault burglary larceny auto_thf;
   id City;
run;

For this analysis, the "Pattern Profile Plot" (not shown) shows that the first PC can be interpreted as the total crime rate. The second PC is a contrast between property crimes and violent crimes. The component pattern plot shows that the position of the variables along the second PC is nearly the same as the positions computed by the optimal linear profile algorithm. You can see property crimes in the lower half of the plot and violent crimes in the upper half. The "Score Plot" enables you to see the following characteristics of cities:

  • Cities to the right have high overall crime rates. Cities to the left have low overall crime rates.
  • Cities near the top have higher rates of violent crime relative to property crimes. Cities near the bottom have higher rates of property crimes. Cities near the middle have crime rates that are roughly equal for crimes.

In my opinion, the score plot provides the same information as the optimal linear profile plot. However, interpreting the score plot requires knowledge of principal components, which is an advanced skill. The optimal linear profile plot tries to convey this information in a simpler display that is easier to explain to a nontechnical audience.

Summary

This article reproduces Friendly's (1991) implementation of Hartigan's optimal linear profile plot. Friendly distributes the %LINPRO macro, which uses SAS/IML to carry out the computations and uses SAS/GRAPH to construct the linear profile plot. This article reproduces the SAS/IML computations but replaces the SAS/GRAPH calls with a call to PROG SGPLOT.

This article also discusses the advantages and disadvantages of the optimal linear profile plot. In my opinion, there are better ways to explore the similarities between subjects across many variables, including heat maps and score plots in principal component analysis.

The post Optimal linear profile plots in SAS appeared first on The DO Loop.

11月 162022
 

A profile plot is a way to display multivariate values for many subjects. The optimal linear profile plot was introduced by John Hartigan in his book Clustering Algorithms (1975). In Michael Friendly's book (SAS System for Statistical Graphics, 1991), Friendly shows how to construct an optimal linear profile by using the SAS/IML language. He displays the plot by using traditional SAS/GRAPH methods: the old GPLOT procedure and an annotation data set. If you do not have a copy of his book, you can download Friendly's %LINPRO macro from his GitHub site.

SAS graphics have changed a lot since 1991. It is now much easier to create the optimal linear profile plot by using modern graphics in SAS. This article reproduces Friendly's SAS/IML analysis to create the data for the plot but uses PROC SGPLOT to create the final graph. This approach does not require an annotate data set.

The optimal linear profile plot

Friendly (1991) illustrates the optimal linear profile plot by using the incidence rate for seven types of crime in 16 US cities in 1970. I have previously shown how to visualize the crime data by using line plots and heat maps. A profile plot for the raw data is shown to the right.

The optimal linear profile plot modifies the profile plot in two ways. It computes an optimal ordering of the categories (crimes) and transforms the raw responses (rates) so that the transformed data are as close to linear as possible. Let's focus on the meaning of the words "linear" and "optimal":

  • Linear: The goal of the linear profile plot is to show linear trends rather than exact data values. The method models the data to determine which cities tend to have relatively low rates of some crimes and relatively high rates of other crimes. Thus, the linear profile plot shows a linear model for the crime trends.
  • Optimal: In creating the optimal linear profiles, you can choose any positions for the categories on the X axis. You can also (affinely) scale the data for each variable in many ways. The algorithm chooses positions and scales that make the transformed data as linear as possible.

The optimal linear profile is constructed by using the first two principal coordinates (PC) for the data (in wide form). The optimal positions are found as the ratio of the elements of the first two eigenvectors. The slopes and intercepts for the linear profiles are obtained by projecting the centered and scaled data onto the eigenspace, so they are related to the PC scores. See Friendly (1991) or the SAS/IML code for details.

The following DATA step creates the crime data in wide form. The SAS/IML program is taken from the %LINPRO macro and is similar to the code in Friendly (1991). It contains a lot of print statements, which you can comment out if you do not want to see the intermediate calculations. I made a few minor changes and three major changes:

  • Friendly's code computed the covariance and standard deviations by using N (the sample size) as the denominator. My code uses the N-1 as the denominator so that the computations of the PCs agree with PROC PRINCOMP.
  • Friendly's code writes a data set (LINFIT), which is used to create an annotate data set. This data set is no longer needed, but I have retained the statements that create it. I have added code that creates two macro variables (POSNAMES and POSVALUES) that you can use to specify the positions of the crime categories.
  • The program uses PROC SGPLOT to create the optimal linear profile plot. The labels for the lines are displayed automatically by using the CURVELABEL option on the SERIES statement. The VALUES= and VALUESDISPPLAY= options are used to position the categories at their optimal positions.
/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city $16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* Program from Friendly (1991, p. 424-427) and slightly modified 
   by Rick Wicklin (see comments). You can download the LINPRO MACRO
   from https://github.com/friendly/SAS-macros/blob/master/linpro.sas
 
   The program assumes the data are in wide form:
   There are k response variables. Each row is the response variables 
   for one subject. The name of the subject is contained in an ID variable.
 
   The output for plotting is stored in the data set LINPLOT.
   Information about the linear regressions is stored in the data set LINFIT.
*/
 
%let data = crime;    /* name of data set */
%let var = _NUM_;     /* list of variable names or the _NUM_ keyword */
%let id = City;       /* subject names */
proc iml;
   *--- Get input matrix and dimensions;
   use &data;
   if upcase("&var") = "_NUM_" then
      read all var _NUM_ into X[colname=vars];
   else
      read all var {&var} into X[colname=vars];
   n = nrow(X);
   p = ncol(X);
   m = mean(X);
   D = X - m;         *- centered cols = deviations from means;
 
   if "&id" = " " then
      ID = T(1:nrow(X));
   else
      read all var{&id} into ID;
   if type(ID)='N' then ID = compress(char(id));
   close;
 
   *--- Compute covariance matrix of x;
   /* Modification by Wicklin: Friendly's COV and SD are different 
      because he uses a divisor of n for the COV matrix instead of n-1 */
   C = cov(X);   
   sd = std(X);
   /* If you want to exactly reproduce Friendly's macro, use the following: 
   c = d` * d /  n;               *- variance-covariance matrix;
   sd = sqrt( vecdiag( c))`;      *- standard deviations;
   */
 
   *--- Standardize if vars have sig diff scales;
   ratio = max(sd) / min(sd);
   if ratio > 3 then do;
      s = sd;             * s is a scale factor;
      C = cov2corr(C);
      Clbl = "Correlation Matrix";
   end;
   else do;
      s = j(1, p, 1);
      Clbl = "Covariance Matrix";
   end;
   print C[colname=vars rowname=vars f=7.3 L=Clbl];
 
   *--- Eigenvalues & vectors of C;
   call eigen(val, vec, C);
 
   *--- Optional: Display information about the eigenvectors;
   prop = val / val[+];
   cum = cusum(prop);
   T = val || dif(val,-1) || prop || cum;
   cl = {'Eigenvalue' 'Difference' 'Proportion' 'Cumulative'}; 
   lbl = "Eigenvalues of the " + Clbl;
   print T[colname=cl f=8.3 L=lbl];
 
   *--- Scale values;
   e1 = vec[ , 1] # sign( vec[<> , 1]);
   e2 = vec[ , 2];
   pos = e2 / e1;
   Ds = D;                /* the raw difference matrix */
   D = Ds / ( s # e1` );  /* a scaled difference matrix */
 
   *--- For case i, fitted line is  Y = F1(I) + F2(I) * X   ;
   f1 = ( D * e1 ) / e1[##];  /* ssq(e1) = ssq(e2) = 1 */
   f2 = ( D * e2 ) / e2[##];
   f = f1 || f2;
 
   *--- Optionally display the results;
   scfac = sd` # e1;   * NOTE: Friendly rounds this to the nearest 0.1 unit;
   table = m` || sd` || e1 || e2 || scfac || pos;
   ct = { 'Mean' 'Std_Dev' 'Eigvec1' 'Eigvec2' 'Scale' 'Position'};
   print table [ rowname=vars colname=ct f=8.2];
 
   *--- Rearrange columns;
   r = rank( pos);
   zz = table;  table [r, ] = zz;
   zz = vars;   vars [ ,r] = zz;
   zz = pos;    pos [r, ] = zz;
   zz = scfac;  scfac [r, ] = zz;
   zz = D;      D [ ,r] = zz;
   *--- Optionally display the results;
   print table[rowname=vars colname=ct f=8.2 L="Variables reordered by position"];
   lt = { 'Intercept' 'Slope'};
   print f[rowname=&id colname=lt f=7.2 L="Case lines"];
 
   *--- Fitted values, residuals;
   fit = f1 * j( 1 , p) + f2 * pos`;
   resid = D - fit;
   *--- Optionally display the results;
   print fit [ rowname=&id colname=vars format=7.3 L="Fitted values"];
   print resid [ rowname=&id colname=vars format=7.3 L="Residuals"];
 
   *--- Optionally display summary statistics for fit;
   sse = resid[##];   ssf = fit[##];
   sst = D[##];       vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   sse = (ds-fit)[##];   sst = ds[##];
   vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   *--- Construct output array for annotate data set -
          residuals bordered by fitted scale values;
   v1 = val[ 1 ] || {0};
   v2 = {0} || val[ 2 ];
   xout = ( resid || f ) //
          ( pos` || v1 ) //
          ( scfac` || v2 );
   rl = colvec(ID) // {'Variable','Scale'};
   cl = colvec(vars) // {'Intercept','Slope'};
   create LINFIT from xout[ rowname=rl colname=cl ];
      append from xout[ rowname= rl ];
   close;
   free rl cl xout;
 
   *--- Output the array to be plotted;
   do col = 1 to p;
       rows = j(n,1,pos[col,]) ||  fit[,col] || resid[,col];
       pout = pout // rows;
       rl = rl // shape(ID,1);
   end;
   cl = { 'Position' 'Fit' 'Residual'};
   create LINPLOT from pout[ rowname=rl colname=cl ];
      append from pout[ rowname=rl ];
   close;
 
   /* Modification by Wicklin: create two macro variables for 
      labeling the plotting positions. For example:
     POSNAME = 'larceny' 'burglary' 'auto_thf' 'rape' 'robbery' 'assault' 'murder'
     POSVALUES= -1.698 -1.014 -0.374 0.1389 0.5016 0.5805 2.1392
   */
   M1 = compress("'" + vars + "'");
   M1 = rowcat(M1 + " ");
   M2 = rowcat(char(pos`,6) + " ");
   call symputx("posNames", M1);
   call symputx("posValues",M2);
quit;
 
%put &=posNames;
%put &=posValues;
 
/* Use the POSNAMES and POSVALUES macro variables to add custom 
   tick labels and values. See
   https://blogs.sas.com/content/iml/2020/03/23/custom-tick-marks-sas-graph.html
*/
ods graphics / height=640px width=640px;
title "Optimal Linear Profiles for City Crime Rates";
footnote J=L "Friendly (1991, p. 430) Output 8.11";
proc sgplot data=LINPLOT(rename=(RL=&ID));
  label fit="Fitted Crime Index" position="Crime (Scaled Position)";
  series x=position y=fit / group=city curvelabel curvelabelloc=outside;
  xaxis values = (&posValues) valuesdisplay=(&posNames) fitpolicy=stagger;  
run;
footnote;title;

The optimal linear profile plot is shown for the crimes data. Unlike Friendly, I do not show the points along these lines. I think adding markers gives the false impression that the method has perfectly linearized the data.

The X axis shows the optimal positions of the crimes. They are roughly ordered so that crimes against property are on the left and violent crimes against persons are on the right. As is often the case, the PC interpretation does not perfectly agree with our intuitions. For these data, the analysis has placed rape to the left of robbery and assault. Notice that I use the VALUES= and VALUESDISPLAY= to specify the positions of the labels, and I use the FITPOLICY=STAGGER option to prevent the labels from colliding.

The cities can be classified according to their linear trends. For brevity, let's call larceny and burglary "property crimes," and let's call robbery, assault, and murder "violent crimes." Then you can observe the following classifications for the cities:

  • The lines for some cities (Dallas, Chicago, Houston) have a large positive slope. This indicates that the crime rates for property crimes are low relative to the rates of violent crimes.
  • The lines for other cities (Los Angeles, New York, Denver) have a large positive slope. This indicates that the crime rates for property crimes are high relative to the rates of violent crimes.
  • The lines for other cities (Kansas City, Tucson, Hartford) are relatively flat. This indicates that the crime rates for property crimes are about the same as the rates of violent crimes.
  • When the line for one city is above the line for another city, it means that the first city tends to have higher crime rates across the board than the second city. For example, the line for Los Angeles is above the line for New York. If you look at the raw data, you will see that the crime rates in LA is mostly higher, although NYC has higher rates for robbery and larceny. Similarly, the rates for Portland are generally higher than the rates for Honolulu, except for auto theft.
  • The ordering of the city labels is based primarily on the violent crime rates.

The following color image is copied from Friendly (1991, Appendix 3, p. 673) so that you can compare the SAS/GRAPH output to the newer output.

An alternative to the optimal linear profile plot

I'll be honest, I am not a big fan of the optimal linear profile plot. I think it can give a false impression about the data. It displays a linear model but does not provide diagnostic tools to assess issues such as outliers or deviations from the model. For example, the line for Boston shows no indication that auto theft in that city was extremely high in 1970. In general, there is no reason to think that one set of positions for the categories (crimes) leads to a linear profile for all subjects (cities).

If you want to show the data instead of a linear model, see my previous article, "Profile plots in SAS," and consider using the heat map near the end of the article. If you want to observe similarities and differences between the cities based on the multivariate data, I think performing a principal component analysis is a better choice. The following call to PROC PRINCOMP does a good job of showing similarities and differences between the categories and subjects. The analysis uses the correlation matrix, which is equivalent to saying that the crime rates have been standardized.

ods graphics/reset;
proc princomp data=crime out=Scores N=2 plot(only)=(PatternProfile Pattern Score);
   var murder rape robbery assault burglary larceny auto_thf;
   id City;
run;

For this analysis, the "Pattern Profile Plot" (not shown) shows that the first PC can be interpreted as the total crime rate. The second PC is a contrast between property crimes and violent crimes. The component pattern plot shows that the position of the variables along the second PC is nearly the same as the positions computed by the optimal linear profile algorithm. You can see property crimes in the lower half of the plot and violent crimes in the upper half. The "Score Plot" enables you to see the following characteristics of cities:

  • Cities to the right have high overall crime rates. Cities to the left have low overall crime rates.
  • Cities near the top have higher rates of violent crime relative to property crimes. Cities near the bottom have higher rates of property crimes. Cities near the middle have crime rates that are roughly equal for crimes.

In my opinion, the score plot provides the same information as the optimal linear profile plot. However, interpreting the score plot requires knowledge of principal components, which is an advanced skill. The optimal linear profile plot tries to convey this information in a simpler display that is easier to explain to a nontechnical audience.

Summary

This article reproduces Friendly's (1991) implementation of Hartigan's optimal linear profile plot. Friendly distributes the %LINPRO macro, which uses SAS/IML to carry out the computations and uses SAS/GRAPH to construct the linear profile plot. This article reproduces the SAS/IML computations but replaces the SAS/GRAPH calls with a call to PROG SGPLOT.

This article also discusses the advantages and disadvantages of the optimal linear profile plot. In my opinion, there are better ways to explore the similarities between subjects across many variables, including heat maps and score plots in principal component analysis.

The post Optimal linear profile plots in SAS appeared first on The DO Loop.

11月 142022
 

A profile plot is a compact way to visualize many variables for a set of subjects. It enables you to investigate which subjects are similar to or different from other subjects. Visually, a profile plot can take many forms. This article shows several profile plots: a line plot of the original data, a line plot of standardized data, and a heat map.

This article uses crime data to illustrate the various kinds of plots. The line plot at the right shows the incidence rate for seven types of crime in 16 US cities in 1970. The profile plot looks somewhat like a spaghetti plot, but a spaghetti plot typically features a continuous variable (often, time) along the vertical axis, whereas the horizontal axis in a profile plot features discrete categories. The vertical axis shows a response variable, which is often a count, a rate, or a proportion. The line segments enable you to trace the rates for each city. You can use the graph to discover which cities have high crime rates, which have low rates, and which are high for some crimes but low for others.

In the example, the subjects are cities; the variables are the seven crime rates. However, a profile plot is applicable to many fields. For example, in a clinical setting, the subjects could be patients and the variables could be the results of certain lab tests, such as blood levels of sugar, cholesterol, iron, and so forth. Be aware, however, that there are several different graphs that are known as "patient profile plots." Some are more complicated than the profile plots in this article.

A simple profile plot

For the city crime data, all variables are measured as rates per 100K people. Because all variables are measured on the same scale, you can plot the raw data. You could use a box plot if you want to show the distribution of rates for each crime. However, the purpose of a profile plot is to be able to trace each subject's relative rates across all variables. To do that, let's use a line plot.

The data are originally from the United States Statistical Abstracts (1970) and were analyzed by John Hartigan (1975) in his book Clustering Algorithms. The data were reanalyzed by Michael Friendly (SAS System for Statistical Graphics, 1991). This article follows Friendly's analysis for the line plots. The heat map in the last section is not considered in Friendly (1991).

The following SAS DATA step defines the data in the wide format (seven variables and 16 rows). However, to plot the data as a line plot in which colors identify the cities, it is useful to convert the data from wide to long format. For this example, I use PROC TRANSPOSE. You can then use the SERIES statement in PROC SGPLOT to visualize the rates for each crime and for each city.

/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city &$16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* convert to long format and use alphabetical ordering along X axis */
proc transpose data=crime out=crime2(rename=(col1=rate)) name=crime;
   by city;
run;
proc sort data=crime2;  by crime;  run;
 
title "Profile Plot of Crime Rates in US Cities in 1970";
footnote J=L "Friendly (1991, p. 421) Output 8.9";
proc sgplot data=crime2;
  label rate="Crime Rate (per 100K population)" crime="Crime";
  series x=crime y=rate / group=city markers;
run;

The graph is shown at the top of this article. It is not very useful. The graph does not enable you to trace the crime rate patterns for each city. To quote Friendly (1991, p. 422): "In plotting the raw data, the high-frequency crimes such as burglary completely dominate the display, and the low-frequency crimes (murder, rape) are compressed to the bottom of the plot. Because of this, the possibility that the cities may differ in their patterns is effectively hidden."

Whenever you are trying to graph data that are on different scales, it is useful to transform the scale of the data. You could do this globally (for example, a logarithmic transformation) or you can focus on relative aspects. Following Friendly (1991), the next section standardizes the data, thereby focusing the attention on which cities have relatively high or low crime rates for various crimes.

A profile plot for standardized data

The most common way to standardize data is to use an affine transformation for each variable so that the standardized variables have mean 0 and unit variance. You can use PROC STANDARD in SAS to standardize the variables. You can specify the name of variables by using the VAR statement, or you can use the keyword _NUMERIC_ to standardize all numeric variables in the data. The following statements standardize the original wide data and then convert the standardized data to long format for graphing.

proc standard data=crime mean=0 std=1 out=stdcrime;
   var _numeric_;
run;
/* convert to long format and ensure a consistent ordering along X axis */
proc transpose data=stdcrime out=stdcrime2(rename=(col1=rate)) name=crime;
   by city;
run;
proc sort data=stdcrime2;  by crime;  run;
 
title "Profile Plot of Standardized Crime Rates in 1970";
footnote J=L "Friendly (1991, p. 423) Output 8.10";
proc sgplot data=stdcrime2;
  label rate="Crime Rate (standard score)" crime="Crime";
  series x=crime y=rate / group=city markers;
run;

Unfortunately, for these data, this graph is not much better than the first profile plot. Friendly (1991, p. 423) calls the plot "dismal" because it is "hard to see any similarities among the cities in their patterns of crime." With a close inspection, you can (barely) make out the following facts:

  • For most crimes, Los Angeles, Detroit, and New York have high rates.
  • For most crimes, Tucson and Hartford have low rates.
  • The crime rate in Boston is low except for auto theft.

This plot is similar to a parallel coordinates plot. There are three ways to improve the standardized profile plot:

  • Rearrange the order of the categories along the X axis by putting similar categories next to each other. I created one possible re-ordering, in which property crimes are placed on the left and crimes against persons are placed on the right. The profile plot is slightly improved, but it is still hard to read.
  • Change the affine transformation for the variables. Instead of always using mean 0 and unit variance, perhaps there is a better way to transform the response values? And perhaps using nonuniform spacing for the categories could make the profiles easier to read? These two ideas are implemented in the "optimal linear profile plot" (Hartigan, 1975; Friendly, 1991). I will show how to create the optimal linear profile plot in a subsequent article.
  • Abandon the spaghetti-like line plot in favor of a lasagna plot, which uses a heat map instead of a line plot. This is shown in the next section.

A lasagna plot of profiles

The main advantage of a lasagna plot is that data are neatly displayed in rows and columns. For each subject, you can examine the distribution of the response variable across categories. Furthermore, a rectangular heat map is symmetric: for each category, you can examine the distribution of the response variable across subjects! For these data, the subjects are cities, the categories are crimes, and the response variable is either the raw crime rate or the standardized rate. The disadvantage of a heat map is that the response values are represented by colors, and it can be difficult to perceive changes in colors for some color ramps. On the other hand, you can always print the data in tabular form if you need to know the exact values of the response variables.

The following statements create a lasagna plot for the standardized data in this example. Like the line plot, the heat map uses the data in long form:

%let colormap = (CX0571B0 CX92C5DE CXF7F7F7 CXF4A582 CXCA0020);
/* lasagna plot: https://blogs.sas.com/content/iml/2016/06/08/lasagna-plot-in-sas.html 
   and https://blogs.sas.com/content/iml/2022/03/28/viz-missing-values-longitudinal.html */
title "Lasagna Plot of City Profiles";
title2 "Standardized Crime Rates in 1970";
proc sgplot data=stdcrime2;
   label city="City" crime="Crime" rate="Standardized Rate";
   heatmapparm x=crime y=city colorresponse=rate / outline outlineattrs=(color=gray) colormodel=&colormap; 
run;

This is a better visualization of the crime profiles for each city. You can easily see the standardized rates and detect patterns among the cities. You can see that Detroit, Kansas City, Los Angeles, New York, and Washington (DC) have high relative rates. You can see that Atlanta, Boston, Hartford, Portland, and Tucson tend to have relatively low rates. You can see the crimes for which Atlanta, Boston, and Portland do not have low rates.

This display uses an alphabetical ordering of both the categories (crimes) and the subject (cities). However, you can sort the rows and columns of a lasagna plot to improve the visualization. For example, the following heat map sorts the cities by the average standardized crime rate across all categories. It also arranges the horizontal axis so that nonviolent crimes against property are on the left and violent crimes against people on the right:

This arrangement makes it easier to see outliers such as the orange and red cells in the lower left of the graph. Both Portland and Honolulu have rates of property crime that are higher than you might suspect, given their overall low crime rates. Similarly, Boston has relatively high rates of auto theft compared to its otherwise low crime rates. In the same way, the relatively low rate of rapes in New York City is apparent.

Summary

This article shows three ways to create a profile plot in SAS. A profile plot enables you to visualize the values of many variables for a set of subjects. The simplest profile plot is a line plot of the raw values, where each line corresponds to a subject and the discrete horizontal axis displays the names of variables. However, if the variables are measured in different units or on different scales, it is useful to rescale the values of the variables. One way is to create a line plot of the standardized variables, which displays the relative values (high or low) of each variable. A third display is a heat map (sometimes called a lasagna plot), which uses color to display the high or low values of each variable.

In this article, the categories are uniformly spaced along the horizontal axis. In a subsequent article, I will show an alternative plot that uses nonuniform spacing.

The post Profile plots in SAS appeared first on The DO Loop.

10月 262022
 

A SAS programmer asked how to create a graph that shows whether missing values in one variable are associated with certain values of another variable. For example, a patient who is supposed to monitor his blood glucose daily might have more missing measurements near holidays and in the summer months due to vacations. In this example, the "missingness" in the glucose variable is related to the date variable. In general, we say that missing values depend on another variable when the probability of obtaining a missing value in one variable depends on the value of another variable.

It can be useful to visualize the dependency if you are trying to impute values to replace the missing values. In missing-value imputation, the analyst must decide how to model the missing values. Three popular assumptions are that the values are missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). You can read more about these missing-value assumptions. The visualization in this article can help you decide whether to treat the data as MCAR or MAR.

Generate fake data and missing values

Let's generate some fake data. The following DATA step creates four variables:

  • The DATE variable does not contain any missing values. It contains daily values between 01JAN2018 and 31DEC2020.
  • The X1 variable is missing only in 2020. In 2020, there is an 8% probability that X1 is missing.
  • The X2 variable is missing only on Sundays. On Sundays, there is a 10% probability that X2 is missing.
  • The X3 variable is nonmissing from March through October. During the winter months, there is a 5% probability that X3 is missing.
/* generate FAKE data to use as an example */
data Have;
call streaminit(1234);
format Date DATE9.;
do Date = '01JAN2018'd to '31DEC2020'd;
   x1 = rand("Normal");
   x2 = rand("Normal");
   x3 = rand("Normal");
   /* assign missing values:
      x1 is missing only in 2020
      x2 is missing only on Sundays
      x3 is missing only in Nov, Dec, Jan, and Feb
   */
   if year(Date)=2020            & rand("Bernoulli", 0.08) then x1 = .;
   if weekday(date)=1            & rand("Bernoulli", 0.10) then x2 = .;
   if month(date) in (11,12,1,2) & rand("Bernoulli", 0.05) then x3 = .;
   output;
end;
run;

The goal of this article is to create a graph that indicates how the missing values in the X1-X3 variables depend on time. In a previous article, I showed how to use heat maps in SAS/IML to visualize missing values. In the next section, I use the SAS DATA step to transform the data into a form that you can visualize by using the HEATMAPPARM statement in PROC SGPLOT.

Visualize patterns of missing values

To visualize the patterns of missing values, I use the following ideas:

/* Plot missing values of X1-X3 versus the values of Date. */
%let IndepVar = Date;           /* variable to show on Y axis */
%let DepVars  = x1 x2 x3;       /* visualize missing values for these variables */
 
/* 1. Filter the data to keep only observations that have missing values.
   2. Convert from wide to long. New variables VARNAME and ISMISSING contain the 
      information to plot.   */
data Want;
set Have;
length varName $32;
array xVars[3] &DepVars;        /* response variables (inputs) */
ObsNum = _N_;                   /* if you want the original row number */
if nmiss(of xVars[*]) > 0;      /* keep only obs that have missing values */
do i = 1 to dim(xVars);
   varName = vname(xVars[i]);   /* assign the name of the original variable */
   isMissing = nmiss(xVars[i]); /* binary indicator variable */
   output;
end;
keep ObsNum &IndepVar varName isMissing;
run;
 
/* 3. Define discrete attribute map: missing values are black and nonmissing values are white. */
data AttrMap;
  retain id 'HeatMiss';
  length fillcolor $20;
  value=0; fillcolor='White'; output;  /* nonmissing = white */
  value=1; fillcolor='Black'; output;  /* missing = black */
run;
 
/* 4. Create a heat map that shows the missing values for each variable for each date. */
/*    Thanks to Dan Heath for suggesting that I use INTERVAL=QUARTER
      for major ticks and MINORINTERVAL=MONTH for minor ticks */
ods graphics / width=400px height=600px;
title "Visualize Missing Values";
proc sgplot data=Want dattrmap=AttrMap;
   heatmapparm x=varName y=&IndepVar colorgroup=isMissing / attrid=HeatMiss;
   xaxis display=(nolabel);
   yaxis display=(nolabel) type=time interval=quarter valuesformat=monyy7. 
                           minor minorinterval=month;
run;

The heat map is shown to the right. It displays dates along the vertical axis. The horizontal axis displays the three numerical variables that have missing values. You can use the graph to determine whether the pattern of missing values depend on the Date variable. In the program, I used macro variables to store the name of the X and Y variables so that the program is easier to reuse.

What does the heat map show?

  • For the X1 variable, the heat map clearly shows that the variable does not have any missing values prior to 2020, but then has quite a few in 2020. The graph is consistent with the fact that the missing values in 2020 are distributed uniformly. You can see gaps and clusters, which are common features of uniform randomness.
  • For the X2 variable, the heat map is less clear. It is possible that the missingness occurs uniformly at random for any date. You would need to run an additional analysis to discover that all the missing values occur on Sundays.
  • For the X3 variable, the heat map suggests that the missing values occur in the winter months. However, you should probably run an additional analysis to discover that all the missing values occur between November and February.

You can use PROC FREQ to perform additional analyses on the X2 and X3 variables. The following calls to PROC FREQ use the MONTHw. and WEEKDAYw. formats to format the Date variable. PROC FREQ can perform a two-way analysis to investigate whether missingness in X2 and X3 is associated with days or months, respectively. You can use a mosaic plot to show the results, as follows:

ods graphics / reset;
proc freq data=Want order=formatted;
   where varName="x2";
   format Date WEEKDAY.;
   label IsMissing = "Missing Value for X2" Date = "Day of Week";
   tables IsMissing * Date / plots=MosaicPlot;
run;
 
proc freq data=Want order=formatted;
   where varName="x3";
   format Date MONTH.;
   label IsMissing = "Missing Value for X3" Date = "Month of Year";
   tables IsMissing * Date / plots=MosaicPlot;
run;

For brevity, only the mosaic plot for the month of the year is shown. The mosaic plot shows that all the missing values for X3 occur in November through February. A similar plot shows that the missing values for X2 occur on Sunday.

Summary

This article shows how to use a heat map to plot patterns of missing values for several variables versus a date. By plotting the missing values of X1, X2, X3,..., you can assess whether the missing values depend on time. This can help you to decide whether the missing values occur completely at random (MCAR). The missing values in this article are not MCAR. However, that fact is not always evident from the missing value pattern plot, and you might need to run additional analyses on the missing values.

Although this analysis used a date as the independent variable, that is not required. You can perform similar visualizations for other choices of an independent variable.

The post Visualize dependencies of missing values appeared first on The DO Loop.

9月 142022
 

I recently showed how to represent positive integers in any base and gave examples of base 2 (binary), base 8 (octal), and base 16 (hexadecimal). One fun application is that you can use base 26 to associate a positive integer to every string of English characters. This article shows how to use base 26 to map integers to strings and vice versa. Here's a sneak peek: the string CAT (base 26) is associated with the integer 2398 (base 10).

What is base 26?

For any base b, you can express an integer as a sum of powers of b, such as
\(\sum\nolimits_{i=0}^p {c_i} b^i\)
where the \(c_i\) are integers \(0 \leq c_i < b\). From the coefficients, you can build a string that represents the number in the given base. Traditionally, we use the symbols 0, 1, ..., 9 to represent the first 10 coefficients and use letters of the English alphabet for higher coefficients. However, in base 26, it makes sense to break with tradition and use English characters for all coefficients. This is done in a natural way by associating the symbols {A, B, C, ..., Z} with the coefficient values {0, 1, 2, ..., 25}. Notice that the symbols for the coefficients in base 26 are zero-based (A=0) and not one-based (A≠1), which is different than you might have seen in other applications.

For example, the number 2398 (base 10) can be written as the sum 3*262 + 14*261 + 6*260. If you use English letters to represent the coefficients, then this number equals DOG (base 26) because 3→D, 14→O, and 6→G. In a similar way, the number 1371 (base 10) can be written as 2*262 + 0*261 + 19*260, which equals CAT (base 26) because 2→C, 0→A, and 19→T.

Recall that for base 10 numbers, we typically do not write the numbers with leading zeros. For example, when considering three-digit numbers in base 10, we do not write the numbers 0-99. But if we use leading zeros, we can write these integers as three-digit numbers: 000, 001, 002, ..., 099. In a similar way, you can represent all three-character strings in base 26 (such as AAA, ABC, and ANT) if you include one or more leading zeros. In base 26, a "leading zero" means that the string starts with A. Unfortunately, if you include leading zeros, you lose a unique representation of the integers because A = AA = AAA, and similarly Z = AZ = AAZ. However, it is a small price to pay. To represent character strings that start with the letter A, you must allow leading zeros.

A SAS program to represent integers in base 26

It is straightforward to adapt the SAS DATA step program in my previous article to base 26. (See the previous article for an explanation of the algorithm.) In this version, I represent each integer in the range [0, 17575] in base 26 by using a three-digit string. The number 17575 (base 10) is the largest integer that can be represented by using a three-digit string because 17575 (base 10) = 25*262 + 25*261 + 25*260 = ZZZ (base 26).

The following statements put a few integers in the range [0, 17575] into a SAS data set:

/* Example data: Base 10 integers in the [0, 17575] */
data Base10;
input x @@;      /* x >= 0 */
datalines;
0 25 28 17575 16197 13030 1371 341 11511 903 13030 2398
;

The following DATA step converts these integers to three-character strings in base 26.

/* For simplicity, only consider three-digit strings. The strings
   will contain 'leading zero', which means strings like
   ABC (base 26) = 28 (base 10)
   Three-digit strings correspond to integers 0 - 17575. */
%let maxCoef = 3;      /* number of characters in string that represents the number */
%let base = 26;        /* base for the representation */
%let valueList = ('A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 
                  'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z');
 
data Base26;
array values[&base] $ _temporary_ &valueList;  /* characters to use to encode values */
array c[0:%eval(&maxCoef-1)]  _temporary_;     /* integer coefficients c[0], c[1], ... */
length ID $&maxCoef;                           /* string for base b */
b = &base;                                     /* base for representation */
 
set Base10;                                    /* x is a positive integer; represent in base b */
/* compute the coefficients that represent x in Base b */
y = x;
do k = 0 to &maxCoef-1;
   c[k] = mod(y, b);                           /* remainder when r is divided by b */
   y = floor(y / b);                           /* whole part of division */
   substr(ID,&maxCoef-k,1) = values[c[k]+1];    /* represent coefficients as string */
end;
drop b y k;
run;
 
proc print data=Base26 noobs label;
   label x="Base 10" ID="Base 26";
   var x ID;
run;

After converting a few arbitrary base 10 numbers to base 26, I show a few special numbers that correspond to three-character English words. The last seven numbers in the data set are integers that, when represented in base 26, correspond to the sentence THE CAT AND RAT BIT THE DOG!

Convert from base 26 to base 10

In the previous section, I used base 10 numbers that converted to a complete English sentence: THE CAT AND RAT BIT THE DOG. Obviously, I started with the sentence that I wanted to "find," and then figured out which decimal digits would produce the sentence. In other words, I started with the base 26 representation and computed the base 10 (decimal) representation.

For compactness, I will use SAS/IML software to compute the integer value for each base 26 representation. You could use the DATA step, if you prefer. The SAS/IML language supports a little-known feature (the MATTRIB statement) that enables you to index a matrix by using a string instead of an integer. This enables you to put the decimal numbers 0-25 into an array and index them according to the corresponding base 26 characters. This feature is demonstrated in the following example:

proc iml;
n = 0:25;              /* decimal values */
L = "A":"Z";           /* base 26 symbols */
mattrib n[colname=L];  /* enable indexing n by the letters A-Z */
C = n["C"];
A = n["A"];
T = n["T"];
print C A T;
D = n["D"];
O = n["O"];
G = n["G"];
print D O G;

The output shows that an expression like n['D'] results in the numerical value of the symbol 'D'. For any base 26 string, you can use the SUBSTR function to extract each character in the string. You can use the MATTRIB trick to find the corresponding base 10 value. You can use the position of each character and the definition of base 26 to find the integer represented by each string:

/* convert base 26 strings to integers */
str = {"AAA" "AAZ" "ABC" "ZZZ" "XYZ" "THE" "CAT" "AND" "RAT" "BIT" "THE" "DOG"}`;
Decimal = j(nrow(str),1);
do j = 1 to nrow(str);      
   s = str[j];               /* get the j_th string */
   k = length(s);            /* how many characters? */
   x = 0;
   do i = 0 to k-1;          /* for each character in the string */
      c = substr(s, k-i, 1); /* extract the character at position k-i */
      x = x + n[c]*26**i;    /* use n[c] as the coefficient in the base 26 representation */
   end;
   Decimal[j] = x;           /* save the integer value for the j_th string */
end;
 
print str[L="Base 26"] Decimal;

The output shows the base 10 representation for a set of three-digit base 26 strings. These are the values that I used in the first part of this article. (NOTE: You can vectorize this program and eliminate the inner loop! I leave that as an exercise.)

Summary

This article shows a fun fact: You can use base 26 to associate an integer to every string of English characters. In base 26, the string 'CAT' is associated with 2398 (base 10) and the string 'DOG' is associated with the number 1371 (base 10). This article uses three-digit character strings to demonstrate the method, but the algorithms apply to character strings that contain an arbitrary number of characters.

The post Base 26: A mapping from integers to strings appeared first on The DO Loop.

8月 292022
 

A SAS programmer was trying to understand how PROC SGPLOT orders categories and segments in a stacked bar chart. As with all problems, it is often useful to start with a simpler version of the problem. After you understand the simpler situation, you can apply that understanding to the more complicated situation.

This article shows how PROC SGPLOT in SAS orders categories in a bar chart in three scenarios:

  • Order the categories alphabetically. This is the default order.
  • Order the categories by the height of the bars, which is called the frequency order. This is accomplished by using the CATEGORYORDER= option on the VBAR (or HBAR) statement.
  • Order the categories in a user-specified manner. This is accomplished by using the VALUES= option on the XAXIS statement.

After you understand these three ways to order categories for a simple bar chart, you can investigate how these orderings work for a stacked bar chart.

Three ways to order bar charts

For data, let's use vehicles in the Sashelp.Cars data set and create bar charts that visualize the number of SUVs, sports cars, wagons, and trucks. This data also contains the Origin variable, which specifies the regions (Asia, USA, or Europe) that each vehicle was manufactured:

/* create example data */
data Have;
   set sashelp.cars;
   where Type in ('SUV' 'Sports' 'Truck' 'Wagon');
   keep Type Origin;
run;

Let's visualize the number of SUVs, sports cars, wagons, and trucks. You can use PROC SGPLOT to order the categories of a bar charts in three ways: alphabetical order, ascending (or descending) order by frequency, and a user-specified order. Each bar chart shows the same data, but the order of the bars is different.

ods graphics / width=300px height=240px;
/* three ways to order categories in a bar chart */
title "Categories in Alphabetical Order";
proc sgplot data=Have;
   vbar Type;
   xaxis display=(nolabel);
run;
 
title "Categories in Frequency Order";
proc sgplot data=Have;
   vbar Type / categoryorder=respdesc;
   xaxis display=(nolabel);
run;
 
title "Categories in Arbitrary Order";
proc sgplot data=Have;
   vbar Type;
   xaxis display=(nolabel) values=('Wagon' 'Sports' 'SUV' 'Truck' );
run;

For these data, you can divide each category into subgroups by using the Origin variable. The next section shows how these plots change if you add a GROUP= variable to the VBAR statement.

Order of clustered and stacked bar charts

When you use the GROUP= option on the VBAR statement, you have an option to display the subgroups as a cluster of bars (GROUPDISPLAY=CLUSTER) or as a stacked bar chart (GROUPDISPLAY=STACK). The clustered option is easier to understand, so let's start with that. You can use the GROUP=Origin option to create clustered bar charts that display the number of cars for each category in each of three subsets: vehicles that were manufactured in 'Asia', 'USA', or 'Europe'.

When you introduce groups and different orderings of the data, you need to ensure that the colors of the groups are consistent across graphs. One way to do this is to create and use a discrete attribute map that associates each subgroup with a color: the bars for Origin='Asia' are red, the bars for Origin='USA' are blue, and the bars for Origin='Europe' are gold.

/* create a discrete attribute map to associate a color with values of the Origin variable */
data BarMap;
length ID $10 value $6 linecolor $ 9 fillcolor $ 9;
input ID $ value $ linecolor $ fillcolor $;
datalines;
BarAttr Asia   DarkGray FireBrick
BarAttr USA    DarkGray DarkBlue
BarAttr Europe DarkGray Gold
;

You can now create bar charts that consistently use these colors regardless of the order of the bars. To make the output easy to see, the following program uses the ODS LAYOUT GRIDDED statement to arrange the output in one row that contains three graphs:

/* for GROUP=Origin, examine the three ways to order categories in a bar chart */
%let method = CLUSTER;   /* use CLUSTER or STACK */
ODS LAYOUT GRIDDED columns=3 advance=table column_gutter=8px;
 
title "Categories in Alphabetical Order";
title2 "GROUPDISPLAY = &method";
proc sgplot data=Have dattrmap=BarMap;
   vbar Type / attrid=BarAttr group=Origin groupdisplay=&method;
   xaxis display=(nolabel);
run;
 
title "Categories in Frequency Order";
title2 "GROUPDISPLAY = &method";
proc sgplot data=Have dattrmap=BarMap;
   vbar Type / attrid=BarAttr categoryorder=respdesc group=Origin groupdisplay=&method;
   xaxis display=(nolabel);
run;
 
title "Categories in Arbitrary Order";
title2 "GROUPDISPLAY = &method";
proc sgplot data=Have dattrmap=BarMap;
   vbar Type / attrid=BarAttr group=Origin groupdisplay=&method;
   xaxis display=(nolabel) values=('Wagon' 'Sports' 'SUV' 'Truck' );
run;
ODS LAYOUT END;

The graphs show that the ordering method (alphabetical or frequency) also extends to the subgroups. For example:

  • In the first graph, the categories are in alphabetical order and so are the subgroups.
  • In the second graph, the categories are in frequency order and so are the subgroups. The order of the subgroups changes depending on the relative frequencies within each category. For example, for the SUV category, the subgroups are ordered as 'Asia', 'USA', and 'Europe'. However, for the Sports category, the subgroups are ordered as 'Europe', 'Asia', and 'USA' because most of the sports cars are European.
  • The third graph manually sets the order for the categories, but the subgroups are plotted in alphabetical order, which is the default. You can also choose reverse-alphabetical order by using the GROUPORDER=DESCENDING option, or you can use the ordering of the groups in the data set (GROUPORDER=DATA).

Ordering categories and groups in stacked bar charts

There is a simple way to understand the order of bars and bar segments in a stacked bar chart in PROC SGPLOT. First, create a clustered bar chart, as shown in the previous section. Then, change GROUPDISPLAY=CLUSTER to GROUPDISPLAY=STACK and rerun the program. In the previous section, I wrote the code so that I only need to change one line:
%let method = STACK; /* use CLUSTER or STACK */
With that change, you can re-run the program to obtain the following graphs:

The graphs show that the order of segments in a stacked bar chart is the same as the order of bars in a clustered bar chart (from left to right). Specifically:

  • In the first stacked bar chart, the subgroups are in alphabetical order from bottom to top.
  • In the second stacked bar chart, the subgroups are in frequency order. For example, for the SUV category, the subgroups are ordered as 'Asia' (the bottom segment), 'USA' (the middle segment), and 'Europe' (the top segment). However, for the Sports category, the subgroups are ordered as 'Europe', 'Asia', and 'USA'. Be careful if you use the CATEGORYORDER=RESPDESC option: it can be confusing to view a graph in which the order of the segments differs between bars.
  • The third stacked bar chart modifies the order for the categories, but the subgroups are plotted from bottom to top in alphabetical order. This provides a viewer with consistency across the bars.

Summary

This article shows how to understand how PROC SGPLOT in SAS orders bars in a bar chart. There are essentially three ways to order bars: alphabetically, by frequency, or by specifying the order manually. When you use the GROUP= option, you get either a clustered bar chart or a stacked bar chart. The order of the subgroups is best understood by looking at the clustered bar chart. The order of bars in the clusters (from left to right) is the same as the order of the segments in a stacked bar chart (from bottom to top). Be aware that the CATEGORYORDER= option also orders the subgroups. This can be confusing to the viewer because segments in the stacked bars might "move around" depending on their relative frequencies.

The post Order the bars in a bar chart with PROC SGPLOT appeared first on The DO Loop.