data analysis

7月 232014

In a previous blog post, I showed how to overlay a prediction ellipse on a scatter plot in SAS by using the ELLIPSE statement in PROC SGPLOT. The ELLIPSE statement draws the ellipse by using a standard technique that assumes the sample is bivariate normal. Today's article describes the technique and shows how to use SAS/IML to compute a prediction ellipse from a 2 x 2 covariance matrix and a mean vector. This can be useful for plotting ellipses for subgroups, ellipses that correspond to robust covariance estimates, or an ellipse for a population (rather than for a sample).

SAS/IML routines for computing prediction ellipses have been around for a long time. A module called the CONTOUR module was in the version 6 (1989) documentation for SAS/IML. In version 6.12, the module was used to compare prediction ellipses for robust and classical covariance matrices. A module appears in Michael Friendly's 1991 book The SAS System for Statistical Graphics, and in several of his papers, including a 1990 SUGI article. Friendly continues to distribute the %ELLIPSES macro for displaying ellipses on scatter plots. The SAS/IML function in this article is similar to these earlier modules.

Prediction ellipses: The main ideas

You can compute a prediction ellipse for sample data if you provide the following information:

  • m: A vector for the center of the ellipse.
  • S: A covariance matrix. This can be a classical covariance matrix or a robust covariance matrix.
  • n: The number of nonmissing observations in the sample.
  • p: The confidence level for the prediction ellipse. Equivalently, you could specify a significance level, α, which corresponds to a 1 – α confidence level.

The implicit formula for the prediction ellipse is given in the documentation for the CORR procedure as the set of points that satisfy a quadratic equation. However, to draw the ellipse, you should parameterize the ellipse explicitly. For example, when the axes of the ellipse are aligned with the coordinate axes, the equation of an ellipse with center (c,d) and with radii a and b is defined implicitly as the set of points (x,y) that satisfies the equation
      (x-c)2 / a2 + (y-d)2 / b2 = 1.
However, if you want to draw the ellipse, the parametric form is more useful:
      x(t) = c + a cos(t)
      y(t) = d + b sin(t)
for t in the interval [0, 2π].

An algorithm to draw a prediction ellipse

I can think of two ways to draw prediction ellipses. One way is to use the geometry of Mahalanobis distance. The second way, which is used by the classical SAS/IML functions, is to use ideas from principal components analysis to plot the ellipse based on the eigendecomposition of the covariance matrix:

  1. Find the eigenvalues (λ1 and λ2) and eigenvectors (e1 and e2) of the covariance matrix, S. The eigenvectors form an orthogonal basis for the coordinate system for plotting the ellipse. The first eigenvector (e1) points in the direction of the greatest variance and defines the major axis for the prediction ellipse. The second eigenvector (e2) points in the direction of the minor axis.
  2. As mentioned previously, sines and cosines parameterize an ellipse whose axes are aligned with the standard coordinate directions. It is just as simple to parameterize an ellipse in the coordinates defined by the eigenvectors:
    • The eigenvectors have unit length, so a circle is formed by the linear combination cos(t)*e1 + sin(t)*e2 for t in the interval [0, 2π].
    • The ellipse sqrt(λ1)cos(t)*e1 + sqrt(λ2)sin(t)*e2 is a "standardized ellipse" in the sense that the standard deviations of the variables are used to scale the semi-major axes.
  3. To get a prediction ellipse, scale the standardized ellipse by a factor that depends on quantiles of the F2,n-2 distribution, the confidence level, and an adjustment factor that depends on the sample size n. The SAS/IML module contains the details.
  4. Translate the prediction ellipse by adding the vector m.

A SAS/IML module for computing a prediction ellipse

The following module accepts a vector of k confidence levels. The module returns a matrix with three columns. The meaning of each column is described in the comments.

proc iml;
start PredEllipseFromCov(m, S, n, ConfLevel=0.95, nPts=128);
/* Compute prediction ellipses centered at m from the 2x2 covariance matrix S.
   The return matrix is a matrix with three columns.
   Col 1: The X coordinate of the ellipse for the confidence level.
   Col 2: The Y coordinate of the ellipse for the confidence level.
   Col 3: The confidence level. Use this column to extract a single ellipse.
   Input parameters:
   m           a 1x2 vector that specifies the center of the ellipse. 
               This can be the sample mean or median.
   S           a 2x2 symmetric positive definite matrix. This can be the 
               sample covariance matrix or a robust estimate of the covariance.
   n           The number of nonmissing observations in the data. 
   ConfLevel   a 1 x k vector of (1-alpha) confidence levels that determine the
               ellipses. Each entry is 0 < ConfLevel[i] < 1.  For example,
               0.95 produces the 95% confidence ellipse for alpha=0.05.
   nPts       the number of points used to draw the ellipse. The default is 0.95.
   /* parameterize standard ellipse in coords of eigenvectors */
   call eigen(lambda, evec, S);   /* eigenvectors are columns of evec */
   t = 2*constant("Pi") * (0:nPts-1) / nPts;
   xy  = sqrt(lambda[1])*cos(t) // sqrt(lambda[2])*sin(t);
   stdEllipse = T( evec * xy );   /* transpose for convenience */
   /* scale the ellipse; see documentation for PROC CORR */
   c = 2 * (n-1)/n * (n+1)/(n-2);          /* adjust for finite sample size */
   p = rowvec(ConfLevel);                  /* row vector of confidence levels */
   F = sqrt(c * quantile("F", p, 2, n-2)); /* p = 1-alpha */
   ellipse = j(ncol(p)*nPts, 3);  /* 3 cols: x y p */
   startIdx = 1;                  /* starting index for next ellipse */
   do i = 1 to ncol(p);           /* scale and translate */
      idx = startIdx:startIdx+nPts-1;
      ellipse[idx, 1:2] = F[i] # stdEllipse + m; 
      ellipse[idx, 3] = p[i];     /* use confidence level as ID */
      startIdx = startIdx + nPts;
   return( ellipse );

Compare a classical and robust prediction ellipse

The SAS/IML documentation includes an example in which a classical prediction ellipse is compared with a robust prediction ellipse for three-dimensional data that contain outliers. The following SAS/IML statements define the classical and robust estimates of location and scatter for two of the variables. The PredEllipseFromCov function is called twice: once for the classical estimates, which are based on all 21 observations, and once for the robust estimates, which are based on 17 observations:

/* Stackloss data example from SAS/IML doc */
/* classical mean and covariance for stackloss data */
/*      Rate      AcidConcentration */
N = 21;
mean = {60.428571 86.285714}; 
cov  = {84.057143 24.571429,
        24.571429 28.714286};
classicalEllipse = PredEllipseFromCov(mean, cov, N, 0.9);
/* robust estimates of location and scatter for stackloss data */
/*         Rate      AcidConcentration */
N = 17;
robMean = {56.7      85.5}; 
robCov  = {23.5     16.1,
           16.1      32.4};
robEllipse = PredEllipseFromCov(robMean, robCov, N, 0.9);
/* write the ellipse data to a SAS data set */
E = classicalEllipse || robEllipse[,1:2];
create Ellipse from E[c={"classX" "classY" "ConfLevel" "robX" "robY"}];
append from E;
close Ellipse;

The following SAS statements merge the data and the coordinates for the prediction ellipses. The POLYGON statement in the SGPLOT procedure is used to overlay the ellipses on a scatter plot of the data. The POLYGON statement is available in SAS 9.4M1. Notice that the PredEllipseFromCov function returns a matrix with three columns. The third column (the confidence level) is used as the ID= variable for the POLYGON statement:

data Stackloss;
input Rate Temperature AcidCon @@;
80  27  89  80  27  88  75  25  90  62  24  87  62  22  87
62  23  87  62  24  93  62  24  93  58  23  87  58  18  80
58  18  89  58  17  88  58  18  82  58  19  93  50  18  89
50  18  86  50  19  72  50  19  79  50  20  80  56  20  82
70  20  91
data All;
set Stackloss Ellipse;
title "Classical and Robust Prediction Ellipses";
proc sgplot data=All;
scatter x=Rate y=AcidCon / transparency=0.5 markerattrs=(symbol=CircleFilled size=12);
polygon x=classX y=classY id=ConfLevel / 
        lineattrs=GraphData1 name="classical" legendlabel="Classical 90% Ellipse";
polygon x=robX y=robY id=ConfLevel / 
        lineattrs=GraphData2 name="robust" legendlabel="Robust 90% Ellipse";
xaxis grid; yaxis grid; 
keylegend "classical" "robust";

The classical prediction ellipse is based on all 21 observations. The robust estimation method classified four observations as outliers, so the robust ellipse is based on 17 observations. The four outliers are the markers that are outside of the robust ellipse.

Plotting prediction ellipses for subgroups

You can also use this module to overlay prediction ellipses for subgroups of the data. For example, you can compute the mean and covariance matrix for each of the three species of flower in the sashelp.iris data. You can download the complete program that computes the prediction ellipses and overlays them on a scatter plot of the data. The following graph shows the result:


In summary, by using the SAS/IML language, you can write a short function that computes prediction ellipses from four quantities: a center, a covariance matrix, the sample size, and the confidence level. You can use the function to compute prediction ellipses for classical estimates, robust estimates, and subgroups of the data. You can use the POLYGON statement in PROC SGPLOT to overlay these ellipses on a scatter plot of the data.

tags: Data Analysis, Statistical Graphics
7月 212014

It is common in statistical graphics to overlay a prediction ellipse on a scatter plot. This article describes two easy ways to overlay prediction ellipses on a scatter plot by using SAS software. It also describes how to overlay multiple prediction ellipses for subpopulations.

What is a prediction ellipse?

A prediction ellipse is a region for predicting the location of a new observation under the assumption that the population is bivariate normal. For example, an 80% prediction ellipse indicates a region that would contain about 80% of a new sample that is drawn from a bivariate normal population with mean and covariance matrices that are equal to the sample estimates.

A prediction ellipse is helpful for detecting deviation from normality. If you overlay a prediction ellipse on your sample and the sample does not "fill" the ellipse in the way that bivariate normal data would, then you have a graphical indication that the data are not bivariate normal.

Because the center of the ellipse is the sample mean, a prediction ellipse can give a visual indication of skewness and outliers in the data. A prediction ellipse also displays linear correlation: If you standardize both variables, a skinny ellipse indicates highly correlated variables, whereas an ellipse that is nearly circular indicates little correlation.

How to draw a prediction ellipse in SAS

SAS provides two easy ways to overlay a prediction ellipse on a scatter plot. The SGPLOT procedure supports the SCATTER statement, which creates a scatter plot, and the ELLIPSE statement, which overlays a bivariate normal prediction ellipse, computed by using the sample mean and covariance. The following statements overlay a 95% prediction ellipse on 50 measurements of the width and length of petals for a particular species of flower:

title "95% Prediction Ellipse";
title2 "iris Versicolor";
proc sgplot data=sashelp.iris noautolegend;
   where Species='Versicolor';
   scatter x=PetalLength y=PetalWidth / jitter;  /* JITTER is optional    */
   ellipse x=PetalLength y=PetalWidth;           /* default is ALPHA=0.05 */

The JITTER option (which requires SAS 9.4) is used to slightly displace some of the observations. Without the option, some markers overlap because the data are rounded to the nearest millimeter. By default, the ELLIPSE statement computes and displays a 95% prediction ellipse, which tends to surround most of the data. However, you can use the ALPHA= option to display a 100(1-α)% prediction ellipse. Some researchers recommend overlaying a 68% prediction ellipse (ALPHA=0.32) because 68% is the probability of observing univariate normal data that is within one standard deviation of the mean. See the article by Michael Friendly for examples and a discussion.

Because a prediction ellipse gives a graphical indication of the linear correlation between variables, you can create a prediction ellipse as part of a correlation analysis in SAS. The following call to the CORR procedure uses the PLOTS=SCATTER option to overlay a 95% prediction ellipse. The output (not shown) is similar to the previous graph.

proc corr data=sashelp.iris plots=scatter(ellipse=prediction);
   where Species='Versicolor';
   var PetalLength PetalWidth;

Draw prediction ellipses for each group

Occasionally, you might want to overlay prediction ellipses for subsets of the data that correspond to subpopulations. For example, if your data contain both male and female subjects, it might be that the means and covariance for the variables are different between males and females. In that case, overlaying a prediction ellipse for each subpopulation helps you to visualize how characteristics vary between the groups.

There are several ways to draw prediction ellipses for groups within the data. Michael Friendly provides a macro that uses SAS/GRAPH to overlay ellipses. It supports a grouping variable, as shown in his 2006 paper in J. Stat. Soft. (This macro has been used in some examples by Robert Allison.)

If you want to use ODS statistical graphics to display the multiple ellipses, you need to use a little trick. Because the ELLIPSE statement in PROC SGPLOT does not support a GROUP= option as of SAS 9.4m2, you have to reshape the data so that each group becomes a new variable. This is equivalent to transposing the data from a "long form" to a "wide form." From my previous blog post, here is one way to create six variables that represent the petal length and width variables for each of the three species of iris in the sashelp.iris data set:

data Wide;
/*   |-- PetalLength --| |--- PetalWidth ---|  */
keep L_Set L_Vers L_Virg W_Set W_Vers W_Virg;  /* names of new variables */
merge sashelp.iris(where=(Species="Setosa") 
              rename=(PetalLength=L_Set PetalWidth=W_Set))
              rename=(PetalLength=L_Vers PetalWidth=W_Vers))
              rename=(PetalLength=L_Virg PetalWidth=W_Virg));

Now that the data are in separate variables, call PROC SGPLOT to overlay three scatter plots and three 95% prediction ellipses:

title "95% Prediction Ellipses for Each Group";
proc sgplot data=Wide;
  scatter x=L_Set  y=W_Set  / jitter name="Set"  legendlabel="Setosa";
  scatter x=L_Virg y=W_Virg / jitter name="Virg" legendlabel="Virginica";
  scatter x=L_Vers y=W_Vers / jitter name="Vers" legendlabel="Versicolor";
  ellipse x=L_Set  y=W_Set  / lineattrs=GraphData1;
  ellipse x=L_Virg y=W_Virg / lineattrs=GraphData2;
  ellipse x=L_Vers y=W_Vers / lineattrs=GraphData3;
  keylegend "Set" "Virg" "Vers" / title="Species:";
  xaxis label="Petal Length (mm)";
  yaxis label="Petal Width (mm)";

Notice that I used a trick to make the color of each prediction ellipse match the color of the data to which it is associated. The colors for markers in the scatter plots are assigned automatically according to the current ODS style. The style elements used for the markers are named GraphData1, GraphData2, and GraphData3. When I create the ellipses, I use the LINEATTRS= statement to set the style attributes of the ellipse to match the corresponding attributes of the data.

The graph visualizes the relationship between the petal length and the petal width variables in these three species. At a glance, three facts are evident:

  • The means of the variables (the centers of the ellipses) are different across the groups. This is a "graphical MANOVA test."
  • The Versicolor ellipse is "thinner" than the others, which indicates that the correlation between petal length and width is greater in that species.
  • The Virginica ellipse is larger than the others, which indicates greater variance within that species.

In summary, it is easy to use the ELLIPSE statement in PROC SGPLOT to add a prediction ellipse to a scatter plot. If you want to add an ellipse for subgroups of the data, use the trick from my previous article to reshape the data. Plotting ellipses for subgroups enables you to visually inspect covariance and correlation within groups and to compare those quantities across groups.

tags: Data Analysis, Statistical Graphics
7月 182014

An empty matrix is a matrix that has zero rows and zero columns. At first "empty matrix" sounds like an oxymoron, but when programming in a matrix language such as SAS/IML, empty matrices arise surprisingly often.

Sometimes empty matrices occur because of a typographical error in your program. If you try to print or compute with a matrix that has not been defined, the SAS/IML program will usually display an error message:

proc iml;
print UndefinedMatrix;
            ERROR: (execution) Matrix has not been set to a value.

However, sometimes an empty matrix is the correct result of a valid computation. For example:

  • If you use the LOC function to find elements in a vector that satisfy a criterion, the empty matrix results when no elements satisfy the criterion:
    x = {2 4 6};
    y = loc(x < 0);         /* y is empty matrix */
  • If you use the XSECT function to find the intersection of two sets, the empty matrix results when the intersection is empty:
    y = xsect({1}, {2});    /* empty intersection of sets */
  • If you use the REMOVE function to remove elements of a matrix, the empty matrix results when you remove all elements:
    y = remove({2}, 1);     /* remove 1st element from 1x1 matrix */

In my book Statistical Programming with SAS/IML Software, I mention that you can determine whether a matrix is empty by using one of three functions: TYPE, NROW, or NCOL, as follows:

print (type(y))[L="type"] (nrow(y))[L="nrow"] (ncol(y))[L="ncol"];
if ncol(y)=0 then print "y is empty";
else              print "y is not empty";

The output shows that the "type" of an empty matrix is 'U' (for "undefined"). In my book I use the NCOL function to test for empty matrices, but the other functions work just as well.

New function for detecting empty matrices

Recent releases of SAS/IML software support new ways to create and detect empty matrices:

  • In SAS/IML 12.1, you can use curly braces to define an empty matrix:
    y = {};                 /* y is an empty matrix */
  • In SAS/IML 12.3, you can use the ISEMPTY function to test whether a matrix is empty:
    if IsEmpty(y) then print "y is empty";  
    else               print "y is not empty";

I think that the new syntax produces SAS/IML programs that are easier to read and understand. Have you encountered empty matrices in your SAS/IML programs? Where? How did you handle them? Leave a comment.

tags: 12.1, 12.3, Data Analysis
7月 142014

In my four years of blogging, the post that has generated the most comments is "How to handle negative values in log transformations." Many people have written to describe data that contain negative values and to ask for advice about how to log-transform the data.

Today I describe a transformation that is useful when a variable ranges over several orders of magnitude in both the positive and negative directions. This situation comes up when you measure the net change in some quantity over a period of time. Examples include the net profit at companies, net change in the price of stocks, net change in jobs, and net change in population.

A state-by-state look at net population change in California


Here's a typical example. The US Census Bureau tracks state-to-state migration flows. (For a great interactive visualization of internal migration, see the Governing Data Web site.) The adjacent scatter plot shows the net number of people who moved to California from some other US state in 2011, plotted against the population of the state. (Click to enlarge.) For example, 35,650 people moved to California from Arizona, whereas 49,635 moved to Arizona from California, so Arizona was responsible for a net change of –13,985 to the California population. The population of Arizona is just under 5 million, so the marker for Arizona appears in the lower left portion of the graph.

Most states account for a net change in the range [–5000, 5000] and most states have populations less than 5 million. When plotted on the scale of the data, these markers are jammed into a tiny portion of the graph. Most of the graph is empty because of states such as Texas and Illinois that have large populations and are responsible for large changes to California's population.

As discussed in my blog post about using log-scale axes to visualize variables, when a variable ranges over several orders of magnitudes, it is often effective to use a log transformation to see large and small values on the same graph. The population variable is a classic example of a variable that can be log-transformed. However, 80% of the values for the net change in population are negative, which rules out the standard log transformation for that variable.

The log-modulus transformation


A modification of the log transformation can help spread out the magnitude of the data while preserving the sign of data. It is called the log-modulus transformation (John and Draper, 1980). The transformation takes the logarithm of the absolute value of the variable plus 1. If the original value was negative, "put back" the sign of the data by multiplying by –1. In symbols,
L(x) = sign(x) * log(|x| + 1)

The graph of the log-modulus transformation is shown to the left. The transformation preserves zero: a value that is 0 in the original scale is also 0 in the transformed scale. The function acts like the log (base 10) function when x > 0. Notice that L(10) ≈ 1, L(100) ≈ 2, and L(1000) ≈ 3. This property makes it easy to interpret values of the transformed data in terms of the scale of the original data. Negative values are transformed similarly.

Applying the log-modulus transformation


Let's see how the log-modulus transformation helps to visualize the state-by-state net change in California's population in 2011. You can download the data for this example and the SAS program that creates the graphs. A previous article showed how to use PROC SGPLOT to display a log axis on a scatter plot, and I have also discussed how to create custom tick marks for log axes.

The scatter plot to the left shows the data after using the log-modulus transformation on the net values. The state populations have been transformed by using a standard log (base 10) transformation. The log-modulus transformation divides the data into two groups: those states that contributed to a net influx of people to California, and those states that reduced the California population. It is now easy to determine which states are in which group: The states that fed California's population were states in New England, the Rust Belt, and Alaska. It is also evident that size matters: among states that lure Californians away, the bigger states tend to attract more.

The main effect of the log-modulus transformation is to spread apart markers that are near the origin and to pull in markers that are relatively far from the origin. By using the transformation, you can visualize variables that span several orders of magnitudes in both the positive and negative directions. The resulting graph is easy to interpret if you are familiar with logarithms and powers of 10.

tags: Data Analysis, Statistical Programming
7月 112014

In my previous blog post, I showed how to use log axes on a scatter plot in SAS to better visualize data that range over several orders of magnitude. Because the data contained counts (some of which were zero), I used a custom transformation x → log10(x+1) to visualize the data. You can download the data and the SAS program.

I left one problem unresolved at the end of my last post: The tick marks on the axes were labeled on the log scale so that, for example, a marker positioned at the tick mark labeled '2' actually represents a value of 102 = 100. If the graph is intended for people who do not use logarithms on a regular basis, this log-scale axis might hinder them from properly interpreting the graph.

Fortunately, the SGPLOT procedure in SAS supports custom tick marks. In the XAXIS and YAXIS statements, you can use the VALUES= option to specify the location of tick marks and you can use the VALUESDISPLAY= option to specify the label for each tick mark.

Determine the tick locations

The goal is to place tick marks on the plot for the transformed data, but label those ticks by using the original untransformed counts. For example, suppose that you decide to display tick marks that correspond to the following counts: 0, 5, 10, 25, 50, 100, 250, and 500. The following DATA step computes the log-x-plus-one transformation for those values:

data TickMarks;
input Count @@;
LogCountP1 = log10(Count+1);
0 5 10 25 50 100 250 500
proc print noobs; run;

The numbers in the second column are the locations of the tick marks in the scatter plot of the transformed data. Put those numbers on the VALUES= option. The numbers in the first column are the corresponding labels that we want to display with those tick marks. Put those numbers (as text strings) on the VALUESDISPLAY= option, as follows:

ods graphics / width=750 height=1000;
title "Custom Axes on Log Scale";
proc sgplot data=LogComments noautolegend;
   label logCommentP1="Number of Original Comments" logResponseP1="Number of Responses";
   scatter x=logCommentP1 y=logResponseP1 / datalabel=NickName;
   lineparm x=0 y=0 slope=1; 
   xaxis grid offsetmin=0.01 offsetmax=0.1
         values=(0 0.78 1.04 1.41 1.71 2.00)              /* tick locations   */
         valuesdisplay = ("0" "5" "10" "25" "50" "100");  /* labels displayed */
   yaxis grid offsetmin=0.05 offsetmax=0.1
         values=(0 0.78 1.04 1.41 1.71 2.00 2.40 2.70)
         valuesdisplay = ("0" "5" "10" "25" "50" "100" "250" "500");

This plot shows the custom tick marks for the axes. The data are plotted on a log scale, but the labels on the tick marks show the original scale of the data. It is easy to estimate the number of comments and responses for each individual. For example, Robert has 25 original comments and less than 250 responses. John has less than 10 original comments and 50 responses.

Of course, you still have to careful reading graphs that have nonlinear axes. For one thing, you can't compare distances between points. On the plot, it looks like Tricia and Michelle are about the same distance apart as Rick and Chris, but that is not true. Tricia and Michelle differ by 25 comments, whereas Rick and Chris differ by more than 150.

Automating the tick locations and labels

I have one final remark. When creating the plot, I used the DATA step to compute the locations for a selected set of tick marks, but then I entered those values by hand on the VALUES= and VALUESDISPLAY= options in PROC SGPLOT. The fancier approach is to pack information about the tick marks into SAS macro variables and use the macro variables in PROC SGPLOT. You can use the SYMPUTX routine and string concatenation routines to carry out this task. The following SAS/IML program shows how to assign macro variables in PROC IML. I will leave the analogous DATA step program as an exercise for an ambitious SAS programmer:

proc iml;
Count = {0 5 10 25 50 100 250 500};
d = ' "' + char(Count,3) + '"';
v = " " + putn(log10(Count+1), "Best6.");
call symputx("D2", rowcat(d));
call symputx("V2", rowcat(v));
%put _user_;

The values of the macro variables are displayed in the SAS Log. You can now use these macro variables in the PROC SGPLOT statements as follows:

proc sgplot data=LogComments noautolegend;
   yaxis grid offsetmin=0.05 offsetmax=0.1 values=(&V2) valuesdisplay=(&D2);
tags: Data Analysis, Statistical Graphics
7月 092014

If you are trying to visualize numerical data that range over several magnitudes, conventional wisdom says that a log transformation of the data can often result in a better visualization. This article shows several ways to create a scatter plot with logarithmic axes in SAS and discusses some of the advantages and disadvantages of using logarithmic transformations. It also discusses a common problem: How to transform data that range over several orders of magnitudes but that also contain zeros? (Recall that the logarithm of zero is undefined!)

Let's look at an example. My colleague, Chris Hemedinger, wrote a SAS program that collects data about comments on the Web site. For each comment, he recorded the name of the commenter and whether the comment was an original comment or a response to a previous comment. For example, "This is a great article. Thanks!" is classified as a comment, whereas "You're welcome. Glad you liked it!" is classified as a response. You can download the program that creates the data and the graphs in this article. I consider only commenters who have posted more than ten comments.

The following call to the SGPLOT procedure create a scatter plot of these data:

title "Comments and Responses on";
proc sgplot data=Comments noautolegend;
   scatter x=Comment y=Response / datalabel=TruncName;
   lineparm x=0 y=0 slope=1; 
   yaxis grid offsetmin=0.05;
   xaxis grid;

The scatter plot shows the number of comments and responses for 50 people. Those who have commented more than 30 times are labeled, and a line is drawn with unit slope. The line, which was drawn by using the LINEPARM statement, enables you to see who has initiated many comments and who has posted many responses. For example, Michelle and Tricia (lower right) often comment on blogs, but few of their comments are in response to others. In contrast, Sanjay and Robert are regular SAS bloggers and most of their remarks are responses to other people's comments. The big outlier is Chris, who has initiated almost 100 original comments while also posting more than 500 responses to comments on his popular blog.

The scatter plot is an excellent visualization of the data... provided that your goal is to highlight the people who post the most comments and classify them into "commenter" or "responder." The plot enables you to identify about a dozen of the 50 people in the data set. But what about the other nameless markers near the origin? You can see that there are many people who have posted between 10 and 30 comments, but the current plot makes it difficult to find out who they are. To visualize those observations (while not losing information about Chris and Michelle) requires some sort of transformation that distributes the data more uniformly within the plot.

Logarithmic transformations

The standard visualization technique to use in this situation is the logarithmic transformation of data. When you have data whose range spans several orders of magnitude, you should consider whether a log transform will enhance the visualization. A log transformation preserves the order of the observations while making outliers less extreme. (It also spreads out values that are in [0,1], but that doesn't apply for these data.) For these data, both the X and the X variables span two orders of magnitude, so let's try a log transform on both variables.

The XAXIS and YAXIS statements in the SGPLOT procedure support the TYPE=LOG option, which specifies that an axis should use a logarithmic scale. However, if you use that option on these data, the following message is printed to the SAS Log:

NOTE: Log axis cannot support zero or negative values in the data range.
      The axis type will be changed to LINEAR.

The note informs you that some people (for example, Mike) have never responded to a comment. Other people (Bradley, label not shown) have only posted responses, but have never initiated a comment. Because the logarithm of 0 is undefined, the plot refuses to use a logarithmic scale.

If you are willing to leave out these individuals, you can use a WHERE clause to subset the data:

title "Automatic Log Transformation";
title2 "Comment>0 and Response>0";
proc sgplot data=Comments;
   where Comment > 0  &  Response > 0;
   scatter x=Comment y=Response / datalabel=NickName;
   xaxis grid type=log minor offsetmin=0.01;
   yaxis grid type=log minor offsetmin=0.05;

The graph shows all individuals who have initiated and responded at least once. The log transformation has spread out the data so that it is possible to label all markers by using first names. The tick marks on the axes show counts in the original scale of the data. It is easy to see who has about 10 or about 100 responses. With a little more effort, the minor tick marks enable you to discover who has 3 or 50 responses. I used the OFFSETMIN= option to add a little extra space for the data labels.

The SGPLOT procedure does not support using the LINEPARM statement with logarithmic axes, so there is not diagonal line. However, the grid lines enable you to see at a glance that Michael, Jim, and Shelly initiate comments as often as they respond. Individuals that appear in the lower right grid square are those who initiate more than they respond. (If you really want a diagonal line, use the VECTOR statement.)

Although the log transformation has successful spread out the data, this graph does not show the seven people who were dropped by the WHERE clause. It is undesirable to not show certain observations just because the log scale is restricted to positive counts.

The log-of-x-plus-one transformation

There is an alternative: Rather than using the automatic log scale that PROC SGPLOT provides, you can write your own data transformation. Within the DATA step, you have complete control over the transformation and you can handle zero counts in any mathematically consistent way. I have previously written about how to use a log transformation on data that contain zero or negative values. The idea is simple: instead of the standard log transformation, use the modified transformation x → log(x+1). In this transformation, the value 0 is transformed into 0. The transformed data will be spread out but will show all observations.

The drawback of the "log-of-x-plus-one" transformation is that it is harder to read the values of the observations from the tick marks on the axes. For example, under the standard log transformation, a transformed value of 1 represents an individual that has 10 comments, since log(10) = 1. Under the transformation x → log(x+1), a transformed value of 1 represents an individual that has 9 comments. You can use the IMAGEMAP option on the ODS GRAPHICS statement to add tooltips to the graph, but of course that won't help someone who is trying to read a printed (static) version of the graph. Nevertheless, let's carry out this nonstandard log transformation:

data LogComments;
set Comments;
label logCommentP1  = "log10(1 + Number of Original Comments)"
      logResponseP1 = "log10(1 + Number of Responses)";
logCommentP1 = log10(1 + Comment);
logResponseP1 = log10(1 + Response);
ods graphics / imagemap=ON;
title "Custom Log Transformation";
proc sgplot data=LogComments noautolegend;
   scatter x=logCommentP1 y=logResponseP1 / datalabel=NickName 
                                            tip=(Comment Response Total);
   lineparm x=0 y=0 slope=1; 
   xaxis grid offsetmin=0.01;
   yaxis grid offsetmin=0.05 offsetmax=0.05;

This graph is pretty good: the observations are spread out and all the data are displayed. You can easily show the diagonal reference line by using the LINEPARM statement. A disadvantage of this plot is that it is harder to determine the original counts for the individuals, in part because the tick marks on the axes are displayed on the log scale. Although many analytical professional will have no problem recalling that the value 2.0 corresponds to a count of 102 = 100, the label on the axes might confuse those who are less facile with logarithms. In my next blog post I will show how to customize the tick marks to show counts on the original scale of the data.

The point of this article is that the log transformation can help you to visualize data that span several orders of magnitudes. However, the log function is properly restricted to positive data, which means that it is more complicated to create and interpret a log transformation on non-positive data.

Do you have suggestions for how to visualize these data? Download the data and let me know what you come up with.

tags: Data Analysis, Statistical Graphics
7月 072014

A few years ago I blogged about how to expand a data set by using a frequency variable. The DATA step in the article was simple, but the SAS/IML function was somewhat complicated and used a DO loop to expand the data. (Although a reader later showed how to avoid the DO loop.) Consequently, I am happy that the REPEAT function in SAS/IML 12.3 (which shipped with SAS 9.4) supports expanding data by frequency values. Goodbye, complicated function!

To make the situation clear, here is an example from my earlier blog post:

proc iml;
values={A B C D E};         /* categories */
freq = {2 1 3 0 4};         /* nonnegative frequencies */

The vector values contains five unique categories. The vector freq is the same size and contains the frequency of each category. The goal is to expand the categories by the frequencies to create a new vector that has 10 (sum(freq)) elements. The new vector should contain two copies of 'A', one copy of 'B', three copies of 'C', no copies of 'D', and four copies of 'E'. The new syntax for the REPEAT function makes this easy:

y = repeat(values, freq);    /* SAS/IML 12.3 */
print y;

The REPEAT function does not support negative or missing values, but you can use the LOC function to remove those invalid values:

values={A B C D E  F G};
freq = {2 1 3 0 4 -2 .}; 
posIdx = loc(freq>0);         /* select positive frequency values */
y = repeat(values[posIdx], freq[posIdx]);
tags: 12.3, Data Analysis
6月 302014

Last week I showed how to use the SUBMIT and ENDSUBMIT statements in the SAS/IML language to call the SGPLOT procedure to create ODS graphs of data that are in SAS/IML vectors and matrices. I also showed how to create a SAS/IML module that hides the details and enables you to create a plot by using a single statement.

Wouldn't it be great if someone used these ideas to write SAS/IML modules that enable you to create frequently used statistical graphics like bar charts, histograms, and scatter plots?

Someone did! SAS/IML 12.3 (which shipped with SAS 9.4) includes five modules that enable you to create basic statistical graphs. The modules are named for a corresponding SGPLOT statement, as follow:

  • The BAR subroutine creates a bar chart of a categorical variable. If you specify a grouping variable, you can create stacked or clustered bar charts. You can control the ordering of the categories in the bar chart.
  • The BOX subroutine creates a box plot of a continuous variable or multiple box plots when you also specify a categorical variable. You can control the ordering of the categories in the box plot.
  • The HISTOGRAM subroutine creates a histogram of a continuous variable. You can overlay a normal density or kernel density estimate. You can control the arrangement of bins in the histogram.
  • The SCATTER subroutine creates a scatter plot of two continuous variables. You can color markers according to a third grouping variable, and you can specify labels for markers.
  • The SERIES subroutine creates a series plot, which is sometimes called a line plot. You can specify a grouping variable to overlay multiple line plots.

You can watch a seven-minute video that shows examples of statistical graphs that you can create by using these subroutines. The SAS/IML documentation also contains a new chapter on these statistical graphs.

Examples of creating statistical graphs

Because the video does not enable you to copy and paste, the rest of this article shows SAS/IML statements that create some of the graphs in the video. You can modify these statements to display new data, add or delete options, and so forth.

The video examples use data from the Sashelp.Cars data set, which contains information about 428 cars, trucks, and SUVs. The following statements read the data and create five SAS/IML vectors: three categorical variables (Model, Type, and Origin) and two continuous variables (MPG_City and Weight):

proc iml;
use Sashelp.Cars where(type ? {"SUV" "Truck" "Sedan"});
read all var {Model Type Origin MPG_City Weight};
close Sashelp.Cars;

The following statement creates a clustered bar chart of the Origin variable, grouped by the Type variable:

title "Clustered Bar Chart";
call Bar(Origin) group=Type groupopt="Cluster" grid="Y";

The graph shows the number of vehicles in the data that are manufactured in Asia, Europe, and the US. The graph shows that there are no European trucks in the data and that each region produces more than 75 sedans.

You can create a box plot that shows the relative fuel economy for vehicles that are manufactured in each of the three regions:

title "Box Plot with Data Labels";
call Box(MPG_City) category=Origin option="spread" datalabel=putc(Model,"$10.");

The box plots show that the average and median fuel economy of Asian vehicles is greater than for European and US-built vehicles. Outliers are labeled by using the first 10 characters of the Model variable. Outliers that would otherwise overlap are spread horizontally.

You can use a histogram to visualize the distribution of the MPG_City variable. The following statement creates a histogram and overlays a normal and kernel density estimate:

title "Histogram with Density Curves";
call Histogram(MPG_City) density={"Normal" "Kernel"} rebin={0 5};

The normal curve does not fit the data well. The kernel density estimate shows three peaks because many vehicles get about 20 mpg, about 25 mpg, and about 30 mpg.

You can create a scatter plot that shows the relationship between fuel economy and the weight of a vehicle. The following statement creates a scatter plot and colors each marker according to whether it represents a sedan, truck, or SUV.

title "Scatter Plot with Groups";
run Scatter(Weight, MPG_City) group=Type;

The scatter plot shows that lighter vehicles tend to have better fuel economy. Sedans, which tend to be relatively light, often have better fuel economy than trucks and SUVs.

The Sashelp.Cars data set is not appropriate for demonstrating a series plot. The following statements evaluate the standard normal probability density function on the range [-5, 5] and call the SERIES subroutine to visualize the density function. Two vertical reference lines are overlaid on the plot.

x = do(-5, 5, 0.1);
Density = pdf("Normal", x, 0, 1);
title "Series Plot with Reference Lines";
call Series(x, Density) other="refline -2 2 / axis=x" grid={X Y};

This article has briefly described five SAS/IML subroutines in SAS/IML 12.3 that enable you quickly to create ODS statistical graphs from data in vectors or matrices. The routines are not intended to be a complete interface to the SGPLOT procedure. Rather, the routines expose common options for frequently used statistical graphs. If you want to create more complicated graphs that use more esoteric options, you can use the techniques from my previous post to write your own SAS/IML modules that create the exact graph that you need.

tags: 12.3, Data Analysis, Statistical Graphics
6月 232014

As you develop a program in the SAS/IML language, it is often useful to create graphs to visualize intermediate results. I do this all the time in my preferred development environment, which is SAS/IML Studio. In SAS/IML Studio, you can write a single statement to create a scatter plot, bar chart, histogram, and other standard statistical graphics. (For an overview of IMLPlus graphics, see the documentation chapter "Creating Dynamically Linked Graphs." For a longer discussion of IMLPlus programming, see Chapter 7 of Statistical Programming with SAS/IML Software.)

If you are writing a SAS/IML program by using a programming environment such as the SAS Windowing environment or SAS Enterprise Guide, then the IMLPlus graphics are not available. However, recall that you can call SAS procedures from the SAS/IML language. In particular, you can generate ODS graphics without leaving PROC IML by using the SUBMIT and ENDSUBMIT statements to call the SGPLOT procedure.

To give an example, suppose that you recently read my blog post on the new RANDFUN function and you decide to try it out. You call the RANDFUN function to simulate 1,000 observations from a normal distribution with mean 10 and standard deviation 2. Because you have never used that function before, you might want to create a histogram of the simulated data to make sure that it appears to be correct. No problem: just use the CREATE statement to create a SAS data set from a SAS/IML matrix, then use the SUBMIT/ENDSUBMIT statements to call PROC SGPLOT:

proc iml;
call randseed(1);
x = randfun(1000, "Normal", 10, 2);          /* X ~ N(10,2) */
/* graph the simulated data */
create Normal var {x}; append; close Normal; /* write data */
  proc sgplot data=Normal;
    histogram x;                             /* histogram */
    density x / type=normal(mu=10 sigma=2);  /* overlay normal */

The graph overlays a normal density curve on a histogram of the data. The graph indicates that the data are normally distributed with the specified parameters.

This technique is very powerful: from within your SAS/IML program you can create any graph by calling PROC SGPLOT or some other SG procedure!

In fact, you can do even more. You can write a module that encapsulates the creation of the data set and the call to PROC SGPLOT. The module can support optional parameters that invoke various options for PROC SGPLOT. For example, when called with one argument the following module displays a histogram. If you specify the optional second parameter, you can overlay a kernel density estimate on the histogram, as shown below:

/* call PROC SGPLOT to create a histogram; optionally overlay a KDE */
start MakeHistogram(x, addKernel=0);
   create _Temp var {x}; append; close _Temp; /* write data */
   densityStmt = choose(addKernel=0,  " ",  "density x / type=kernel");
   submit densityStmt;
     proc sgplot data=_Temp;
       histogram x;      /* create histogram */ 
       &densityStmt;     /* optionally add kernel density */
   call delete("_Temp"); /* delete temp data set */
run MakeHistogram(x);              /* just create a histogram         */
run MakeHistogram(x) addKernel=1;  /* overlay kernel density estimate */

The histogram with the kernel density estimate is shown above. The module uses an optional parameter with a default value to determine whether to overlay the kernel density estimate. Notice that the variable densityStmt is a character matrix that is either blank or contains a valid DENSITY statement for the SGPLOT procedure. The densityStmt variable is listed on the SUBMIT statement, which is the way to pass a string value to a SAS procedure. Inside the SUBMIT block, the token &densityStmt resolves to either a blank string or to a valid DENSITY statement. Notice that the temporary data set (_Temp) is created and deleted within the module.

This article shows how you can write a module to create any graph of data in SAS/IML vectors by calling the SGPLOT procedure. The possibilities are endless. Do you have ideas about how you might use this technique in your own work? Leave a comment.

tags: 12.1, 12.3, Data Analysis, Statistical Graphics
6月 182014

A colleague asked me an interesting question:

I have a journal article that includes sample quantiles for a variable. Given a new data value, I want to approximate its quantile. I also want to simulate data from the distribution of the published data. Is that possible?

This situation is common. You want to run an algorithm on published data, but you don't have the data. Even if you attempt to contact the original investigators, the data are not always available. The investigators might be dead or retired, or the data are confidential or proprietary. If you cannot obtain the original data, one alternative is to simulate data based on the published descriptive statistics.

Simulating data based on descriptive statistics often requires making some assumptions about the distribution of the published data. My colleague had access to quantile data, such as might be produced by the UNIVARIATE procedure in SAS. To focus the discussion, let's look at an example. The table to the left shows some sample quantiles of the total cholesterol in 35 million men aged 40-59 in the years 1999–2002. The data were published as part of the NHANES study.

With 35 million observations, the empirical cumulative distribution function (ECDF) is a good approximation to the distribution of cholesterol in the entire male middle-aged population in the US. The published quantiles only give the ECDF at 11 points, but if you graph these quantiles and connect them with straight lines, you obtain an approximation to the empirical cumulative distribution function, as shown below:


My colleague's first question is answered by looking at the plot. If a man's total cholesterol is 200, you can approximate the quantile by using linear interpolation between the known values. For 200, the quantile is about 0.41. If you want a more exact computation, you can use a linear interpolation program.

The same graph also suggest a way to simulate data based on published quantiles of the data. The graph shows an approximate ECDF by using interpolation to "connect the dots" of the quantiles. You can use the inverse CDF method for simulation to simulate data from this approximate ECDF. Here are the main steps for the simulation:

  • Use interpolation (linear, cubic, or spline) to model the ECDF. For a large sample of a continuous variable, the approximate ECDF will be a monotonic increasing function from the possible range of the variable X onto the interval [0,1].
  • Generate N random uniform values in [0,1]. Call these values u1, u2, ..., uN.
  • Because the approximate ECDF is monotonic, for each value ui in [0,1] there is a unique value of x (call it xi) that is mapped onto ui. For cubic interpolation or spline interpolation, you need to use a root-finding algorithm to obtain each xi. For linear interpolation, the inverse mapping is also piecewise linear, so you can use a linear interpolation program to solve directly for xi.
  • The N values x1, x2, ..., xN are distributed according to the approximate ECDF.

The following graph shows the distribution of 10,000 simulated values from the piecewise linear approximation of the ECDF. The following table compares the quantiles of the simulated data (the SimEst column) to the quantiles of the NHANES data. The quantiles are similar except in the tails. This is typical behavior because the extreme quantiles are more variable than the inner quantiles. You can download the SAS/IML program that simulates the data and that creates all the images in this article.

t_empdistrib2 empdistrib2


A few comments on this method:

  • The more quantiles there are, the better the ECDF is approximated by the function that interpolate the quantiles.
  • Extreme quantiles are important for capturing the tail behavior.
  • The simulated data will never exceed the range of the original data. In particular, the minimum and maximum value of the sample determine the range of the simulated data.
  • Sometimes the minimum and maximum values of the data are not published. For example, the table might end begin with the 0.05 quantile and end with the 0.95 quantile. In this case, you have three choices:
    • Assign missing values to the 10% of the simulated values that fall outside the published range. They simulated data will be from a truncated distribution of the data. The simulated quantiles will be biased.
    • Use interpolation to estimate the minimum and maximum values of the data. For example, you could use the 0.90 and 0.95 quantiles to extrapolate a value for the maximum data value.
    • Use domain-specific knowledge to supply reasonable values for the minimum and maximum data values. For example, if the data are positive, you could use 0 as a lower bound for the minimum data value.

Have you ever had to simulate data from a published table of summary statistics? What technique did you use? Leave a comment.

tags: Data Analysis, Sampling and Simulation, Statistical Programming