data analysis

8月 272014
 

Last Monday I discussed how to choose the bin width and location for a histogram in SAS. The height of each histogram bar shows the number of observations in each bin. Although my recent article didn't mention it, you can also use the IML procedure to count the number of observations in each bin. The BIN function associates a bin to each observation, and the TABULATE subroutine computes the count for each bin. I have previously blogged about how to use this technique to count the observations in bins of equal or unequal width.

bin2d2

You can also count the number of observations in two-dimensional bins. Specifically, you can divide the X direction into kx bins, divide the Y direction into ky bins, and count the number of observations in each of the kx x ky rectangles. In a previous article, I described how to bin 2-D data by using the SAS/IML language and produced the scatter plot at the left. The graph displays the birth weight of 50,000 babies versus the relative weight gain of their mothers for data in the Sashelp.bweight data set.

I recently remembered that you can also count observations in 2-D bins by using the KDE procedure. The BIVAR statement in the KDE procedure supports an OUT= option, which writes a SAS data set that contains the counts in each bin. You can specify the number of bins in each direction by using the NGRID= option on the BIVAR statement, as shown by the following statements:

ods select none;
proc kde data=sashelp.bweight;
   bivar MomWtGain(ngrid=11) Weight(ngrid=7) / out=KDEOut;
run;
ods select all;
 
proc print data=KDEOut(obs=10 drop=density);
run;
hist2bin1

As shown in the output, the the VALUE1 and VALUE2 variables contain the coordinates of the center of each bin. The COUNT variable contains the bin count. You can read the counts into SAS/IML and print it out as a matrix. In the data set, the Y variable changes the fastest as you go down the rows. Consequently, you need to use the SHAPECOL function rather than the SHAPE function to form the 7 x 11 matrix of counts:

proc iml;
kX = 11; kY = 7;
use KDEOut; read all var {Value1 Value2 Count}; close;
M = shapecol(Count, kY, kX);       /* reshape into kY x kX matrix */
/* create labels for cells in X and Y directions */
idx = do(1,nrow(Value1),kY);        /* location of X labels */
labelX = putn(Value1[idx], "5.1");
labelY = putn(Value2[1:kY], "5.0"); 
print M[c=labelX r=labelY];
hist2bin2

As I explained in Monday's blog post, there are two standard ways to choose bins for a histogram: the "midpoints" method and the "endpoints" method. If you compare the bin counts from PROC KDE to the bin counts from my SAS/IML program, you will see small differences. That is because the KDE procedure uses the "midpoints" algorithm for subdividing the data range, whereas I used the "endpoints" algorithm for my SAS/IML program.

Last week I showed how to use heat maps to visualize matrices in SAS/IML. This matrix of counts is begging to be visualized with a heat map. Because the counts vary across several orders of magnitude (from 100 to more than 104), a linear color ramp will not be an effective way to visualize the raw counts. Instead, transform the counts to a log scale and create a heat map of the log-counts:

call HeatmapCont(log10(1+M)) 
                 xvalues=labelX yvalues=labelY
                 colorramp="ThreeColor" legendtitle="Log(Count)"
                 title="Counts in Each Bin";
hist2bin3

The heat map shows the log-count of each bin. If you prefer a light-to-dark heatmap for the density, use the "TwoColor" option for the COLORRAMP= option. The heat map is a great way to see the distribution of counts at a glance. It enables you to see the approximate values of most cells, and you can easily determine cells where there are many or few observations. Of course, if you want to know the exact value of the count in each rectangular cell, look at the tabular output.

tags: Data Analysis
8月 252014
 

When you create a histogram with statistical software, the software uses the data (including the sample size) to automatically choose the width and location of the histogram bins. The resulting histogram is an attempt to balance statistical considerations, such as estimating the underlying density, and "human considerations," such as choosing "round numbers" for the location and width of bins. Common "round" bin widths include 1, 2, 2.5, and 5, as well as these numbers multiplied by a power of 10.

The default bin width and locations tend to work well for 95% of the data that I plot, but sometimes I decide to override the default choices. This article describes how to set the width and location of bins in histograms that are created by the UNIVARIATE and SGPLOT procedures in SAS.

Why override the default bin locations?

The most common reason to override the default bin locations is because the data have special properties. For example, sometimes the data are measured in units for which the common "round numbers" are not optimal:

  • For a histogram of time measured in minutes, a bin width of 60 is a better choice than a width of 50. Bin widths of 15 and 30 are also useful.
  • For a histogram of time measured in hours, 6, 12, and 24 are good bin widths.
  • For days, a bin width of 7 is a good choice.
  • For a histogram of age (or other values that are rounded to integers), the bins should align with integers.

You might also want to override the default bin locations when the data are bounded. If you are plotting a positive quantity, you might want to force the histogram to use 0 as the leftmost endpoint. If you are plotting percentages, you might want to force the histogram to choose 100 as the rightmost endpoint.

To illustrate these situations, let's manufacture some data with special properties. The following DATA step creates two variables. The T variable represents time measured in minutes. The program generates times that are normally distributed with a mean of 120 minutes, then rounds these times to the nearest five-minute mark. The U variable represents a proportion between 0 and 1; it is uniformly distributed and rounded to two decimal places.

data Hist(drop=i);
label T = "Time (minutes)" U = "Uniform";
call streaminit(1);
do i = 1 to 100;
   T = rand("Normal", 120, 30); /* normal with mean 120  */
   T = round(T, 5);             /* round to nearest five-minute mark */
   U = rand("Uniform");         /* uniform on (0,1) */
   U = floor(100*U) / 100;      /* round down to nearest 0.01 */
   output;
end;
run;

How do we control the location of histogram bins in SAS? Read on!

Custom bins with PROC UNIVARIATE: An example of a time variable

I create histograms with PROC UNIVARIATE when I am interested in also computing descriptive statistics such as means and quantiles, or when I want to fit a parametric distribution to the data. The following statements create the default histogram for the time variable, T:

title "Time Data (N=100)";
ods select histogram(PERSIST);  /* show ONLY the histogram until further notice */
proc univariate data=Hist;
histogram T / odstitle=title odstitle2="Default Bins";
run;
histbin1

The default bin width is 20 minutes, which is not horrible, but not as convenient as 15 or 30 minutes. The first bin is centered at 70 minutes; a better choice would be 60 minutes.

The HISTOGRAM statement in PROC UNIVARIATE supports two options for specifying the locations of bins. The ENDPOINTS= option specifies the endpoints of the bins; the MIDPOINTS= option specifies the midpoints of the bins. The following statements use these options to create two customize histograms for which the bin widths are 30 minutes:

proc univariate data=Hist;
histogram T / midpoints=(60 to 210 by 30)  odstitle=title odstitle2="Midpoints";
run;
 
proc univariate data=Hist;
histogram T / endpoints=(60 to 210 by 30)  odstitle=title odstitle2="Endpoints";
run;
histbin2

The histogram on the left has bins that are centered at 30-minute intervals. This histogram makes it easy to estimate that about 40 observations are approximately 120 minutes. The counts for other half-hour increments are similarly easy to estimate. In contrast, the histogram on the right has bins whose endpoints are 60, 90, 120,... minutes. With this histogram, it easy to see that about 35 observations have times that are between 90 and 120 minutes. Similarly, you can estimate the number of observations that are greater than three hours or less than 90 minutes.

Both histograms are equally correct. The one you choose should depend on the questions that you want to ask about the data. Use midpoints if you want to know how many observations have a value; use endpoints if you want to know how many observations are between two values.

If you run the SAS statements that create the histogram on the right, you will see the warning message
WARNING: The ENDPOINTS= list was extended to accommodate the data.
This message informs you that you specified the last endpoint as 210, but that additional bins were created to display all of the data.

Custom bins for a bounded variable

As mentioned earlier, if you know that values are constrained within some interval, you might want to choose histogram bins that incorporate that knowledge. The U variable has values that are in the interval [0,1), but of course PROC UNIVARIATE does not know that. The following statement create a histogram of the U variable with the default bin locations:

title "Bounded Data (N=100)";
proc univariate data=Hist;
histogram U / odstitle=title odstitle2="Default Bins";
run;
histbin3

The default histogram shows seven bins with a bin width of 0.15. From a statistical point of view, this is an adequate histogram. The histogram indicates that the data are uniformly distributed and, although it is not obvious, the left endpoint of the first bin is at 0. However, from a "human readable" perspective, this histogram can be improved. The following statements use the MIDPOINTS= and ENDPOINTS= options to create histograms that have bin widths of 0.2 units:

proc univariate data=Hist;
histogram U / midpoints=(0 to 1 by 0.2)  odstitle=title odstitle2="Midpoints";
run;
 
proc univariate data=Hist;
histogram U / endpoints=(0 to 1 by 0.2)  odstitle=title odstitle2="Endpoints";
run;
ods select all;  /* turn off PERSITS; restore normal output */
histbin4

The histogram on the left is not optimal for these data. Because we created uniformly distributed data in [0,1], we know that the expected count in the leftmost bin (which is centered at 0) is half the expected count of an inner bin. Similarly, the expected count in the rightmost bin (which is centered at 1) is half the count of an inner bins because no value can exceed 1. Consequently, this choice of midpoints is not very good. For these data, the histogram on the right is better at revealing that the data are uniformly distributed and are within the interval [0,1).

Custom bins with PROC SGPLOT

If you do not need the statistical power of the UNIVARIATE procedure, you might choose to create histograms with PROC SGPLOT. The SGPLOT procedure supports the BINWIDTH= and BINSTART= options on the HISTOGRAM statement. The BINWIDTH= option specifies the width for the bins. The BINSTART= option specifies the center of the first bin.

I recommend that you specify both the BINWIDTH= and BINSTART= options, and that you choose the bin width first. Be aware that not all specifications result a valid histogram. If you make a mistake when specifying the bins, you might get the following error
WARNING: The specified BINWIDTH= value will be ignored in order to accommodate the data.
That message usually means that the minimum value of the data was not contained in a bin. For a bin width of h, the BINSTART= value must be less than xmin + h/2, where xmin is the minimum value of the data.

By default, the axis does not show a tick mark for every bin, but you can force that behavior by using the SHOWBINS option. The following statements call the SGPLOT procedure to create histograms for the time-like variable, T. The results are again similar to the custom histograms that are shown in the previous section:

title "Time Data (N=100)";
title2 "Midpoints";
proc sgplot data=Hist;
histogram T / binwidth=15 binstart=60 showbins; /* center first bin at 60 */
run;
 
title2 "Endpoints";
proc sgplot data=Hist;
histogram T / binwidth=15 binstart=52.5;  /* 52.5 = 45 + h/2 */
xaxis values=(45 to 180 by 15);           /* alternative way to place ticks */
run;

The following statements call the SGPLOT procedure to create histograms for the bounded variable, U. The results are similar to those created by the UNIVARIATE procedure:

title "Bounded Data (N=100)";
title2 "Midpoints";
proc sgplot data=Hist;
histogram U / binwidth=0.2 binstart=0 showbins;  /* center first bin at 0 */
run;
 
title2 "Endpoints";
proc sgplot data=Hist;
histogram U / binwidth=0.2 binstart=0.1;      /* center first bin at h/2 */
xaxis values=(0 to 1 by 0.2);       
run;

In summary, for most data the default bin width and location result in a histogram that is both statistically useful and easy to read. However, the default choices can lead to a less-than-optimal visualization if the data have special properties, such as being time intervals or being bounded. In those cases, it makes sense to choose a bin width and a location of the first bin such that reveals your data's special properties. For the UNIVARIATE procedure, use the MIDPOINTS= or ENDPOINTS= options on the HISTOGRAM statement. For the SGPLOT procedure, use the BINWIDTH= and BINSTART= options to create a histogram with custom bins.

tags: Data Analysis, Getting Started
8月 222014
 

My wife got one of those electronic activity trackers a few months ago and has been diligently walking every day since then. At the end of the day she sometimes reads off how many steps she walked, as measured by her activity tracker. I am always impressed at how many steps she racks up during the day through a combination of regular activity and scheduled walks.

There has been a lot written about the accuracy of electronic activity trackers. Personally, I don't worry about accuracy as long as the errors are in the 10% range. The purpose of the tools are to give people an idea of how much they are moving and to encourage them to get off the couch. Whether someone walked 4.2 miles or 3.8 isn't as important as the fact that she walked about 4 miles.

Because my wife's daily numbers seem so impressive, I decided to download some data from other people who use the same device. The device can be linked to a person's Twitter account and programmed to tweet a summary of the person's activity level each day. The Tweets all have a common hashtag and format, so it is easy to download a few hundred tweets and prepare them for data analysis. In an act of statistical voyeurism, I present a descriptive analysis of the activity level of 231 random people who use a particular brand of activity tracker, as reported by their tracker for Sunday, August 17, 2014. You can download the SAS program that analyzes these data.

Distribution of steps

trackerhist The trackers records steps. The histogram at the left shows the distribution of the number of steps taken for the 231 subjects. (Click to enlarge.) A kernel density estimate is overlaid, and various quantiles are displayed in the inset. The histogram shows that about 25% of the people in the sample walked fewer than 4,000 steps, and about half of the people walked fewer than 7,100 steps. About a quarter of the people walked more than 11,600 steps, and the I-am-very-active award goes to those people who walked more than 18,000 steps—they are the upper 95th percentile of this sample.

The tail of the distribution falls off rapidly, which means that there is a low probability of finding someone who walks more than 30,000 steps per day. In other words, this is not a fat-tailed distribution. On the contrary, a person in the upper percentiles is working her tail off!

How many steps does it take to walk a mile?

trackerfit

The tracker also reports distance in miles. The basic device uses an algorithm to convert those steps into an approximate distances so, as they say, your distance may vary (nyuck-nyuck!). The device does not know your exact stride length.

The scatter plot to the left shows the relationship between distance and steps. For people who take many steps, there is substantial variation in the reported distance. Probably some of those people were running or jogging, which changes the length of the stride. The line indicates predicted distance for a given number of steps, as calculated by a robust regression algorithm.

These data can help to answer the question, "How many steps does it take to walk a mile?" Based on these data, it takes an average person 2,272 steps to walk a mile. Of course, shorter people will require more steps and taller people fewer, and there is the whole debate about how accurate these trackers are at estimating distance. Nevertheless, 2,272 steps is a good estimate for a mile. For a simpler number, you can estimate that 10,000 steps is about four miles.

These data also enable you to estimate the length of the average stride, which is 2.32 feet, or about 71 centimeters per step.

What do you think of these data? If you use a fitness tracking device, how many steps do you take each day? Leave a comment.

tags: Data Analysis
8月 202014
 

In a previous blog post, I showed how to use the graph template language (GTL) in SAS to create heat maps with a continuous color ramp. SAS/IML 13.1 includes the HEATMAPCONT subroutine, which makes it easy to create heat maps with continuous color ramps from SAS/IML matrices. Typical usage includes creating heat maps of data matrices, correlation matrices, or covariance matrices. A more advanced usage is using matrices to visualize the results of computational algorithms or computer experiments.

Data matrices

Heat maps provide a convenient way to visualize many variables and/or many variables in a single glance. For example, suppose that you want to compare and contrast the 24 trucks in the Sashelp.Cars data by using the 10 numerical variables in the data set. The following SAS/IML statements read the data into a matrix and sort the matrix according to the value of the Invoice variable (that is, by price). The Model variable, which identifies each truck, is sorted by using the same criterion:

proc iml;
use Sashelp.Cars where(type="Truck");
read all var _NUM_ into Trucks[c=varNames r=Model];
close Sashelp.Cars;
 
call sortndx(idx, Trucks[,"Invoice"]);     /* find indices that sort the data */
Trucks = Trucks[idx,]; Model = Model[idx]; /* sort the data and the models    */

The original variables are all measured on different scales. For example, the Invoice variable has a mean value of $23,000, whereas the EngineSize variable has a mean value of 4 liters. Most of the values in the matrix are less than 300. Consequently, a heat map of the raw values would not be illuminating: The price variables would be displayed as dark cells (high values) and the rest of the matrix would be almost uniformly light (small values, relative to the prices). Fortunately, the HEATMAPCONT subroutine supports the SCALE="Col" option to standardize each variable to have mean 0 and unit variance. With that option, a heat map reveals the high, middle, and low values of each variable:

call HeatmapCont(Trucks) scale="Col"; /* standardize cols */
heatmaptrucks1

The HEATMAPCONT output shows the following default values:

  • The default color ramp is the 'TwoColor' color ramp. With my ODS style, white is used to display small values and dark blue is used to display large values. The color ramp indicates that the standardized values are approximately in the range [-2, 3], which probably indicates that several variables have positive skewness.
  • The axes are labeled by using the row numbers of the 24 x 10 matrix.
  • The Y axes "points down." In other words, the heat map displays the matrix as it would be printed: the top of the heat map displays the first few rows and the bottom of the heat map displays the last few rows.

A problem with this initial heat map is that we don't know which trucks correspond to the rows, nor do we know which variables corresponds to the columns. You can use the XVALUES= and YVALUES= options to add that information to the heat map. Also, let's use a three-color color ramp and center the range of the color ramp so that white represents the mean value for each variable and red (blue) represents positive (negative) deviations from the mean. The resulting image is shown below (click to enlarge):

call HeatmapCont(Trucks) scale="Col"           /* standardize cols    */
     xvalues=varNames yvalues=Model            /* label rows and cols */
     colorramp="ThreeColor" range={-3 3}       /* color range [-3,3]  */
     legendtitle = "Standardized values" title="Truck Data";
heatmaptrucks2

This visualization of the data enables us to draw several conclusions about the data. In general, expensive cars are powerful (large values of EngineSize/, Cylinders, and Horsepower), large in size (large values of Weight, Wheelbase, and Length), and get poor fuel economy (small values of MPG_City and MPG_Highway). However, two vehicles seem unusual. Compared with similarly priced trucks, the Baja is smaller and more fuel efficient. Compared with similarly priced trucks, the Tundra is more powerful and larger in size.

Correlation matrices

Because the data matrix is sorted by price, it looks like price variables are positively correlated with all variables except for the fuel economy variables. Let's see if that is the case by computing the correlation matrix and displaying it as a heat map:

corr = corr(Trucks);
call HeatmapCont(corr)
     xvalues=varNames yvalues=varNames
     colorramp="ThreeColor" range={-1 1}
     title="Correlations in Truck Data";
heatmaptrucks3

The correlation matrix confirms our initial impression. The price variables are strongly correlated with the variables that indicate the power and size of the trucks. The price is negatively correlated with the fuel efficiency variables.

One of the advantages of computing correlation analysis in the SAS/IML language is that you can easily change the order of the variables. For example, the following statements move the fuel efficiency variables to the end, so that all of the positively correlated variables are grouped together:

newOrder = (1:5) || (8:10) || (6:7);
corr = corr[newOrder, newOrder];
varNames = varNames[newOrder];

Can you think of other applications of visualizing matrices by using the HEATMAPCONT subroutine? Leave a comment.

tags: Data Analysis, Statistical Graphics
7月 232014
 

In a previous blog post, I showed how to overlay a prediction ellipse on a scatter plot in SAS by using the ELLIPSE statement in PROC SGPLOT. The ELLIPSE statement draws the ellipse by using a standard technique that assumes the sample is bivariate normal. Today's article describes the technique and shows how to use SAS/IML to compute a prediction ellipse from a 2 x 2 covariance matrix and a mean vector. This can be useful for plotting ellipses for subgroups, ellipses that correspond to robust covariance estimates, or an ellipse for a population (rather than for a sample).

SAS/IML routines for computing prediction ellipses have been around for a long time. A module called the CONTOUR module was in the version 6 (1989) documentation for SAS/IML. In version 6.12, the module was used to compare prediction ellipses for robust and classical covariance matrices. A module appears in Michael Friendly's 1991 book The SAS System for Statistical Graphics, and in several of his papers, including a 1990 SUGI article. Friendly continues to distribute the %ELLIPSES macro for displaying ellipses on scatter plots. The SAS/IML function in this article is similar to these earlier modules.

Prediction ellipses: The main ideas

You can compute a prediction ellipse for sample data if you provide the following information:

  • m: A vector for the center of the ellipse.
  • S: A covariance matrix. This can be a classical covariance matrix or a robust covariance matrix.
  • n: The number of nonmissing observations in the sample.
  • p: The confidence level for the prediction ellipse. Equivalently, you could specify a significance level, α, which corresponds to a 1 – α confidence level.

The implicit formula for the prediction ellipse is given in the documentation for the CORR procedure as the set of points that satisfy a quadratic equation. However, to draw the ellipse, you should parameterize the ellipse explicitly. For example, when the axes of the ellipse are aligned with the coordinate axes, the equation of an ellipse with center (c,d) and with radii a and b is defined implicitly as the set of points (x,y) that satisfies the equation
      (x-c)2 / a2 + (y-d)2 / b2 = 1.
However, if you want to draw the ellipse, the parametric form is more useful:
      x(t) = c + a cos(t)
      y(t) = d + b sin(t)
for t in the interval [0, 2π].

An algorithm to draw a prediction ellipse

I can think of two ways to draw prediction ellipses. One way is to use the geometry of Mahalanobis distance. The second way, which is used by the classical SAS/IML functions, is to use ideas from principal components analysis to plot the ellipse based on the eigendecomposition of the covariance matrix:

  1. Find the eigenvalues (λ1 and λ2) and eigenvectors (e1 and e2) of the covariance matrix, S. The eigenvectors form an orthogonal basis for the coordinate system for plotting the ellipse. The first eigenvector (e1) points in the direction of the greatest variance and defines the major axis for the prediction ellipse. The second eigenvector (e2) points in the direction of the minor axis.
  2. As mentioned previously, sines and cosines parameterize an ellipse whose axes are aligned with the standard coordinate directions. It is just as simple to parameterize an ellipse in the coordinates defined by the eigenvectors:
    • The eigenvectors have unit length, so a circle is formed by the linear combination cos(t)*e1 + sin(t)*e2 for t in the interval [0, 2π].
    • The ellipse sqrt(λ1)cos(t)*e1 + sqrt(λ2)sin(t)*e2 is a "standardized ellipse" in the sense that the standard deviations of the variables are used to scale the semi-major axes.
  3. To get a prediction ellipse, scale the standardized ellipse by a factor that depends on quantiles of the F2,n-2 distribution, the confidence level, and an adjustment factor that depends on the sample size n. The SAS/IML module contains the details.
  4. Translate the prediction ellipse by adding the vector m.

A SAS/IML module for computing a prediction ellipse

The following module accepts a vector of k confidence levels. The module returns a matrix with three columns. The meaning of each column is described in the comments.

proc iml;
start PredEllipseFromCov(m, S, n, ConfLevel=0.95, nPts=128);
/* Compute prediction ellipses centered at m from the 2x2 covariance matrix S.
   The return matrix is a matrix with three columns.
   Col 1: The X coordinate of the ellipse for the confidence level.
   Col 2: The Y coordinate of the ellipse for the confidence level.
   Col 3: The confidence level. Use this column to extract a single ellipse.
 
   Input parameters:
   m           a 1x2 vector that specifies the center of the ellipse. 
               This can be the sample mean or median.
   S           a 2x2 symmetric positive definite matrix. This can be the 
               sample covariance matrix or a robust estimate of the covariance.
   n           The number of nonmissing observations in the data. 
   ConfLevel   a 1 x k vector of (1-alpha) confidence levels that determine the
               ellipses. Each entry is 0 < ConfLevel[i] < 1.  For example,
               0.95 produces the 95% confidence ellipse for alpha=0.05.
   nPts       the number of points used to draw the ellipse. The default is 0.95.
*/
 
   /* parameterize standard ellipse in coords of eigenvectors */
   call eigen(lambda, evec, S);   /* eigenvectors are columns of evec */
   t = 2*constant("Pi") * (0:nPts-1) / nPts;
   xy  = sqrt(lambda[1])*cos(t) // sqrt(lambda[2])*sin(t);
   stdEllipse = T( evec * xy );   /* transpose for convenience */
 
   /* scale the ellipse; see documentation for PROC CORR */
   c = 2 * (n-1)/n * (n+1)/(n-2);          /* adjust for finite sample size */
   p = rowvec(ConfLevel);                  /* row vector of confidence levels */
   F = sqrt(c * quantile("F", p, 2, n-2)); /* p = 1-alpha */
 
   ellipse = j(ncol(p)*nPts, 3);  /* 3 cols: x y p */
   startIdx = 1;                  /* starting index for next ellipse */
   do i = 1 to ncol(p);           /* scale and translate */
      idx = startIdx:startIdx+nPts-1;
      ellipse[idx, 1:2] = F[i] # stdEllipse + m; 
      ellipse[idx, 3] = p[i];     /* use confidence level as ID */
      startIdx = startIdx + nPts;
   end;           
   return( ellipse );
finish;

Compare a classical and robust prediction ellipse

The SAS/IML documentation includes an example in which a classical prediction ellipse is compared with a robust prediction ellipse for three-dimensional data that contain outliers. The following SAS/IML statements define the classical and robust estimates of location and scatter for two of the variables. The PredEllipseFromCov function is called twice: once for the classical estimates, which are based on all 21 observations, and once for the robust estimates, which are based on 17 observations:

/* Stackloss data example from SAS/IML doc */
/* classical mean and covariance for stackloss data */
/*      Rate      AcidConcentration */
N = 21;
mean = {60.428571 86.285714}; 
cov  = {84.057143 24.571429,
        24.571429 28.714286};
classicalEllipse = PredEllipseFromCov(mean, cov, N, 0.9);
 
/* robust estimates of location and scatter for stackloss data */
/*         Rate      AcidConcentration */
N = 17;
robMean = {56.7      85.5}; 
robCov  = {23.5     16.1,
           16.1      32.4};
robEllipse = PredEllipseFromCov(robMean, robCov, N, 0.9);
 
/* write the ellipse data to a SAS data set */
E = classicalEllipse || robEllipse[,1:2];
create Ellipse from E[c={"classX" "classY" "ConfLevel" "robX" "robY"}];
append from E;
close Ellipse;
quit;

The following SAS statements merge the data and the coordinates for the prediction ellipses. The POLYGON statement in the SGPLOT procedure is used to overlay the ellipses on a scatter plot of the data. The POLYGON statement is available in SAS 9.4M1. Notice that the PredEllipseFromCov function returns a matrix with three columns. The third column (the confidence level) is used as the ID= variable for the POLYGON statement:

data Stackloss;
input Rate Temperature AcidCon @@;
datalines;
80  27  89  80  27  88  75  25  90  62  24  87  62  22  87
62  23  87  62  24  93  62  24  93  58  23  87  58  18  80
58  18  89  58  17  88  58  18  82  58  19  93  50  18  89
50  18  86  50  19  72  50  19  79  50  20  80  56  20  82
70  20  91
;
 
data All;
set Stackloss Ellipse;
run;
 
title "Classical and Robust Prediction Ellipses";
proc sgplot data=All;
scatter x=Rate y=AcidCon / transparency=0.5 markerattrs=(symbol=CircleFilled size=12);
polygon x=classX y=classY id=ConfLevel / 
        lineattrs=GraphData1 name="classical" legendlabel="Classical 90% Ellipse";
polygon x=robX y=robY id=ConfLevel / 
        lineattrs=GraphData2 name="robust" legendlabel="Robust 90% Ellipse";
xaxis grid; yaxis grid; 
keylegend "classical" "robust";
run;
predellipse3

The classical prediction ellipse is based on all 21 observations. The robust estimation method classified four observations as outliers, so the robust ellipse is based on 17 observations. The four outliers are the markers that are outside of the robust ellipse.

Plotting prediction ellipses for subgroups

You can also use this module to overlay prediction ellipses for subgroups of the data. For example, you can compute the mean and covariance matrix for each of the three species of flower in the sashelp.iris data. You can download the complete program that computes the prediction ellipses and overlays them on a scatter plot of the data. The following graph shows the result:

predellipse4

In summary, by using the SAS/IML language, you can write a short function that computes prediction ellipses from four quantities: a center, a covariance matrix, the sample size, and the confidence level. You can use the function to compute prediction ellipses for classical estimates, robust estimates, and subgroups of the data. You can use the POLYGON statement in PROC SGPLOT to overlay these ellipses on a scatter plot of the data.

tags: Data Analysis, Statistical Graphics
7月 212014
 

It is common in statistical graphics to overlay a prediction ellipse on a scatter plot. This article describes two easy ways to overlay prediction ellipses on a scatter plot by using SAS software. It also describes how to overlay multiple prediction ellipses for subpopulations.

What is a prediction ellipse?

A prediction ellipse is a region for predicting the location of a new observation under the assumption that the population is bivariate normal. For example, an 80% prediction ellipse indicates a region that would contain about 80% of a new sample that is drawn from a bivariate normal population with mean and covariance matrices that are equal to the sample estimates.

A prediction ellipse is helpful for detecting deviation from normality. If you overlay a prediction ellipse on your sample and the sample does not "fill" the ellipse in the way that bivariate normal data would, then you have a graphical indication that the data are not bivariate normal.

Because the center of the ellipse is the sample mean, a prediction ellipse can give a visual indication of skewness and outliers in the data. A prediction ellipse also displays linear correlation: If you standardize both variables, a skinny ellipse indicates highly correlated variables, whereas an ellipse that is nearly circular indicates little correlation.

How to draw a prediction ellipse in SAS

SAS provides two easy ways to overlay a prediction ellipse on a scatter plot. The SGPLOT procedure supports the SCATTER statement, which creates a scatter plot, and the ELLIPSE statement, which overlays a bivariate normal prediction ellipse, computed by using the sample mean and covariance. The following statements overlay a 95% prediction ellipse on 50 measurements of the width and length of petals for a particular species of flower:

title "95% Prediction Ellipse";
title2 "iris Versicolor";
proc sgplot data=sashelp.iris noautolegend;
   where Species='Versicolor';
   scatter x=PetalLength y=PetalWidth / jitter;  /* JITTER is optional    */
   ellipse x=PetalLength y=PetalWidth;           /* default is ALPHA=0.05 */
run;
predellipse1

The JITTER option (which requires SAS 9.4) is used to slightly displace some of the observations. Without the option, some markers overlap because the data are rounded to the nearest millimeter. By default, the ELLIPSE statement computes and displays a 95% prediction ellipse, which tends to surround most of the data. However, you can use the ALPHA= option to display a 100(1-α)% prediction ellipse. Some researchers recommend overlaying a 68% prediction ellipse (ALPHA=0.32) because 68% is the probability of observing univariate normal data that is within one standard deviation of the mean. See the article by Michael Friendly for examples and a discussion.

Because a prediction ellipse gives a graphical indication of the linear correlation between variables, you can create a prediction ellipse as part of a correlation analysis in SAS. The following call to the CORR procedure uses the PLOTS=SCATTER option to overlay a 95% prediction ellipse. The output (not shown) is similar to the previous graph.

proc corr data=sashelp.iris plots=scatter(ellipse=prediction);
   where Species='Versicolor';
   var PetalLength PetalWidth;
run;

Draw prediction ellipses for each group

Occasionally, you might want to overlay prediction ellipses for subsets of the data that correspond to subpopulations. For example, if your data contain both male and female subjects, it might be that the means and covariance for the variables are different between males and females. In that case, overlaying a prediction ellipse for each subpopulation helps you to visualize how characteristics vary between the groups.

There are several ways to draw prediction ellipses for groups within the data. Michael Friendly provides a macro that uses SAS/GRAPH to overlay ellipses. It supports a grouping variable, as shown in his 2006 paper in J. Stat. Soft. (This macro has been used in some examples by Robert Allison.)

If you want to use ODS statistical graphics to display the multiple ellipses, you need to use a little trick. Because the ELLIPSE statement in PROC SGPLOT does not support a GROUP= option as of SAS 9.4m2, you have to reshape the data so that each group becomes a new variable. This is equivalent to transposing the data from a "long form" to a "wide form." From my previous blog post, here is one way to create six variables that represent the petal length and width variables for each of the three species of iris in the sashelp.iris data set:

data Wide;
/*   |-- PetalLength --| |--- PetalWidth ---|  */
keep L_Set L_Vers L_Virg W_Set W_Vers W_Virg;  /* names of new variables */
merge sashelp.iris(where=(Species="Setosa") 
              rename=(PetalLength=L_Set PetalWidth=W_Set))
      sashelp.iris(where=(Species="Versicolor") 
              rename=(PetalLength=L_Vers PetalWidth=W_Vers))
      sashelp.iris(where=(Species="Virginica") 
              rename=(PetalLength=L_Virg PetalWidth=W_Virg));
run;

Now that the data are in separate variables, call PROC SGPLOT to overlay three scatter plots and three 95% prediction ellipses:

title "95% Prediction Ellipses for Each Group";
proc sgplot data=Wide;
  scatter x=L_Set  y=W_Set  / jitter name="Set"  legendlabel="Setosa";
  scatter x=L_Virg y=W_Virg / jitter name="Virg" legendlabel="Virginica";
  scatter x=L_Vers y=W_Vers / jitter name="Vers" legendlabel="Versicolor";
  ellipse x=L_Set  y=W_Set  / lineattrs=GraphData1;
  ellipse x=L_Virg y=W_Virg / lineattrs=GraphData2;
  ellipse x=L_Vers y=W_Vers / lineattrs=GraphData3;
  keylegend "Set" "Virg" "Vers" / title="Species:";
  xaxis label="Petal Length (mm)";
  yaxis label="Petal Width (mm)";
run;
predellipse2

Notice that I used a trick to make the color of each prediction ellipse match the color of the data to which it is associated. The colors for markers in the scatter plots are assigned automatically according to the current ODS style. The style elements used for the markers are named GraphData1, GraphData2, and GraphData3. When I create the ellipses, I use the LINEATTRS= statement to set the style attributes of the ellipse to match the corresponding attributes of the data.

The graph visualizes the relationship between the petal length and the petal width variables in these three species. At a glance, three facts are evident:

  • The means of the variables (the centers of the ellipses) are different across the groups. This is a "graphical MANOVA test."
  • The Versicolor ellipse is "thinner" than the others, which indicates that the correlation between petal length and width is greater in that species.
  • The Virginica ellipse is larger than the others, which indicates greater variance within that species.

In summary, it is easy to use the ELLIPSE statement in PROC SGPLOT to add a prediction ellipse to a scatter plot. If you want to add an ellipse for subgroups of the data, use the trick from my previous article to reshape the data. Plotting ellipses for subgroups enables you to visually inspect covariance and correlation within groups and to compare those quantities across groups.

tags: Data Analysis, Statistical Graphics
7月 182014
 

An empty matrix is a matrix that has zero rows and zero columns. At first "empty matrix" sounds like an oxymoron, but when programming in a matrix language such as SAS/IML, empty matrices arise surprisingly often.

Sometimes empty matrices occur because of a typographical error in your program. If you try to print or compute with a matrix that has not been defined, the SAS/IML program will usually display an error message:

proc iml;
print UndefinedMatrix;
 
            ERROR: (execution) Matrix has not been set to a value.

However, sometimes an empty matrix is the correct result of a valid computation. For example:

  • If you use the LOC function to find elements in a vector that satisfy a criterion, the empty matrix results when no elements satisfy the criterion:
    x = {2 4 6};
    y = loc(x < 0);         /* y is empty matrix */
  • If you use the XSECT function to find the intersection of two sets, the empty matrix results when the intersection is empty:
    y = xsect({1}, {2});    /* empty intersection of sets */
  • If you use the REMOVE function to remove elements of a matrix, the empty matrix results when you remove all elements:
    y = remove({2}, 1);     /* remove 1st element from 1x1 matrix */

In my book Statistical Programming with SAS/IML Software, I mention that you can determine whether a matrix is empty by using one of three functions: TYPE, NROW, or NCOL, as follows:

print (type(y))[L="type"] (nrow(y))[L="nrow"] (ncol(y))[L="ncol"];
 
if ncol(y)=0 then print "y is empty";
else              print "y is not empty";
t_emptymatrix

The output shows that the "type" of an empty matrix is 'U' (for "undefined"). In my book I use the NCOL function to test for empty matrices, but the other functions work just as well.

New function for detecting empty matrices

Recent releases of SAS/IML software support new ways to create and detect empty matrices:

  • In SAS/IML 12.1, you can use curly braces to define an empty matrix:
    y = {};                 /* y is an empty matrix */
  • In SAS/IML 12.3, you can use the ISEMPTY function to test whether a matrix is empty:
    if IsEmpty(y) then print "y is empty";  
    else               print "y is not empty";

I think that the new syntax produces SAS/IML programs that are easier to read and understand. Have you encountered empty matrices in your SAS/IML programs? Where? How did you handle them? Leave a comment.

tags: 12.1, 12.3, Data Analysis
7月 142014
 

In my four years of blogging, the post that has generated the most comments is "How to handle negative values in log transformations." Many people have written to describe data that contain negative values and to ask for advice about how to log-transform the data.

Today I describe a transformation that is useful when a variable ranges over several orders of magnitude in both the positive and negative directions. This situation comes up when you measure the net change in some quantity over a period of time. Examples include the net profit at companies, net change in the price of stocks, net change in jobs, and net change in population.

A state-by-state look at net population change in California

logmodulus1

Here's a typical example. The US Census Bureau tracks state-to-state migration flows. (For a great interactive visualization of internal migration, see the Governing Data Web site.) The adjacent scatter plot shows the net number of people who moved to California from some other US state in 2011, plotted against the population of the state. (Click to enlarge.) For example, 35,650 people moved to California from Arizona, whereas 49,635 moved to Arizona from California, so Arizona was responsible for a net change of –13,985 to the California population. The population of Arizona is just under 5 million, so the marker for Arizona appears in the lower left portion of the graph.

Most states account for a net change in the range [–5000, 5000] and most states have populations less than 5 million. When plotted on the scale of the data, these markers are jammed into a tiny portion of the graph. Most of the graph is empty because of states such as Texas and Illinois that have large populations and are responsible for large changes to California's population.

As discussed in my blog post about using log-scale axes to visualize variables, when a variable ranges over several orders of magnitudes, it is often effective to use a log transformation to see large and small values on the same graph. The population variable is a classic example of a variable that can be log-transformed. However, 80% of the values for the net change in population are negative, which rules out the standard log transformation for that variable.

The log-modulus transformation

logmodulus2

A modification of the log transformation can help spread out the magnitude of the data while preserving the sign of data. It is called the log-modulus transformation (John and Draper, 1980). The transformation takes the logarithm of the absolute value of the variable plus 1. If the original value was negative, "put back" the sign of the data by multiplying by –1. In symbols,
L(x) = sign(x) * log(|x| + 1)

The graph of the log-modulus transformation is shown to the left. The transformation preserves zero: a value that is 0 in the original scale is also 0 in the transformed scale. The function acts like the log (base 10) function when x > 0. Notice that L(10) ≈ 1, L(100) ≈ 2, and L(1000) ≈ 3. This property makes it easy to interpret values of the transformed data in terms of the scale of the original data. Negative values are transformed similarly.

Applying the log-modulus transformation

logmodulus3

Let's see how the log-modulus transformation helps to visualize the state-by-state net change in California's population in 2011. You can download the data for this example and the SAS program that creates the graphs. A previous article showed how to use PROC SGPLOT to display a log axis on a scatter plot, and I have also discussed how to create custom tick marks for log axes.

The scatter plot to the left shows the data after using the log-modulus transformation on the net values. The state populations have been transformed by using a standard log (base 10) transformation. The log-modulus transformation divides the data into two groups: those states that contributed to a net influx of people to California, and those states that reduced the California population. It is now easy to determine which states are in which group: The states that fed California's population were states in New England, the Rust Belt, and Alaska. It is also evident that size matters: among states that lure Californians away, the bigger states tend to attract more.

The main effect of the log-modulus transformation is to spread apart markers that are near the origin and to pull in markers that are relatively far from the origin. By using the transformation, you can visualize variables that span several orders of magnitudes in both the positive and negative directions. The resulting graph is easy to interpret if you are familiar with logarithms and powers of 10.

tags: Data Analysis, Statistical Programming
7月 112014
 

In my previous blog post, I showed how to use log axes on a scatter plot in SAS to better visualize data that range over several orders of magnitude. Because the data contained counts (some of which were zero), I used a custom transformation x → log10(x+1) to visualize the data. You can download the data and the SAS program.

I left one problem unresolved at the end of my last post: The tick marks on the axes were labeled on the log scale so that, for example, a marker positioned at the tick mark labeled '2' actually represents a value of 102 = 100. If the graph is intended for people who do not use logarithms on a regular basis, this log-scale axis might hinder them from properly interpreting the graph.

Fortunately, the SGPLOT procedure in SAS supports custom tick marks. In the XAXIS and YAXIS statements, you can use the VALUES= option to specify the location of tick marks and you can use the VALUESDISPLAY= option to specify the label for each tick mark.

Determine the tick locations

The goal is to place tick marks on the plot for the transformed data, but label those ticks by using the original untransformed counts. For example, suppose that you decide to display tick marks that correspond to the following counts: 0, 5, 10, 25, 50, 100, 250, and 500. The following DATA step computes the log-x-plus-one transformation for those values:

data TickMarks;
input Count @@;
LogCountP1 = log10(Count+1);
datalines;
0 5 10 25 50 100 250 500
;
run;
 
proc print noobs; run;
t_LogTransform

The numbers in the second column are the locations of the tick marks in the scatter plot of the transformed data. Put those numbers on the VALUES= option. The numbers in the first column are the corresponding labels that we want to display with those tick marks. Put those numbers (as text strings) on the VALUESDISPLAY= option, as follows:

ods graphics / width=750 height=1000;
title "Custom Axes on Log Scale";
proc sgplot data=LogComments noautolegend;
   label logCommentP1="Number of Original Comments" logResponseP1="Number of Responses";
   scatter x=logCommentP1 y=logResponseP1 / datalabel=NickName;
   lineparm x=0 y=0 slope=1; 
   xaxis grid offsetmin=0.01 offsetmax=0.1
         values=(0 0.78 1.04 1.41 1.71 2.00)              /* tick locations   */
         valuesdisplay = ("0" "5" "10" "25" "50" "100");  /* labels displayed */
   yaxis grid offsetmin=0.05 offsetmax=0.1
         values=(0 0.78 1.04 1.41 1.71 2.00 2.40 2.70)
         valuesdisplay = ("0" "5" "10" "25" "50" "100" "250" "500");
run;
LogAxis4

This plot shows the custom tick marks for the axes. The data are plotted on a log scale, but the labels on the tick marks show the original scale of the data. It is easy to estimate the number of comments and responses for each individual. For example, Robert has 25 original comments and less than 250 responses. John has less than 10 original comments and 50 responses.

Of course, you still have to careful reading graphs that have nonlinear axes. For one thing, you can't compare distances between points. On the plot, it looks like Tricia and Michelle are about the same distance apart as Rick and Chris, but that is not true. Tricia and Michelle differ by 25 comments, whereas Rick and Chris differ by more than 150.

Automating the tick locations and labels

I have one final remark. When creating the plot, I used the DATA step to compute the locations for a selected set of tick marks, but then I entered those values by hand on the VALUES= and VALUESDISPLAY= options in PROC SGPLOT. The fancier approach is to pack information about the tick marks into SAS macro variables and use the macro variables in PROC SGPLOT. You can use the SYMPUTX routine and string concatenation routines to carry out this task. The following SAS/IML program shows how to assign macro variables in PROC IML. I will leave the analogous DATA step program as an exercise for an ambitious SAS programmer:

proc iml;
Count = {0 5 10 25 50 100 250 500};
d = ' "' + char(Count,3) + '"';
v = " " + putn(log10(Count+1), "Best6.");
call symputx("D2", rowcat(d));
call symputx("V2", rowcat(v));
quit;
 
%put _user_;
t_LogTransform2

The values of the macro variables are displayed in the SAS Log. You can now use these macro variables in the PROC SGPLOT statements as follows:

proc sgplot data=LogComments noautolegend;
   ...
   yaxis grid offsetmin=0.05 offsetmax=0.1 values=(&V2) valuesdisplay=(&D2);
run;
tags: Data Analysis, Statistical Graphics
7月 092014
 

If you are trying to visualize numerical data that range over several magnitudes, conventional wisdom says that a log transformation of the data can often result in a better visualization. This article shows several ways to create a scatter plot with logarithmic axes in SAS and discusses some of the advantages and disadvantages of using logarithmic transformations. It also discusses a common problem: How to transform data that range over several orders of magnitudes but that also contain zeros? (Recall that the logarithm of zero is undefined!)

Let's look at an example. My colleague, Chris Hemedinger, wrote a SAS program that collects data about comments on the blogs.sas.com Web site. For each comment, he recorded the name of the commenter and whether the comment was an original comment or a response to a previous comment. For example, "This is a great article. Thanks!" is classified as a comment, whereas "You're welcome. Glad you liked it!" is classified as a response. You can download the program that creates the data and the graphs in this article. I consider only commenters who have posted more than ten comments.

The following call to the SGPLOT procedure create a scatter plot of these data:

title "Comments and Responses on blogs.sas.com";
proc sgplot data=Comments noautolegend;
   scatter x=Comment y=Response / datalabel=TruncName;
   lineparm x=0 y=0 slope=1; 
   yaxis grid offsetmin=0.05;
   xaxis grid;
run;
LogAxis1

The scatter plot shows the number of comments and responses for 50 people. Those who have commented more than 30 times are labeled, and a line is drawn with unit slope. The line, which was drawn by using the LINEPARM statement, enables you to see who has initiated many comments and who has posted many responses. For example, Michelle and Tricia (lower right) often comment on blogs, but few of their comments are in response to others. In contrast, Sanjay and Robert are regular SAS bloggers and most of their remarks are responses to other people's comments. The big outlier is Chris, who has initiated almost 100 original comments while also posting more than 500 responses to comments on his popular blog.

The scatter plot is an excellent visualization of the data... provided that your goal is to highlight the people who post the most comments and classify them into "commenter" or "responder." The plot enables you to identify about a dozen of the 50 people in the data set. But what about the other nameless markers near the origin? You can see that there are many people who have posted between 10 and 30 comments, but the current plot makes it difficult to find out who they are. To visualize those observations (while not losing information about Chris and Michelle) requires some sort of transformation that distributes the data more uniformly within the plot.

Logarithmic transformations

The standard visualization technique to use in this situation is the logarithmic transformation of data. When you have data whose range spans several orders of magnitude, you should consider whether a log transform will enhance the visualization. A log transformation preserves the order of the observations while making outliers less extreme. (It also spreads out values that are in [0,1], but that doesn't apply for these data.) For these data, both the X and the X variables span two orders of magnitude, so let's try a log transform on both variables.

The XAXIS and YAXIS statements in the SGPLOT procedure support the TYPE=LOG option, which specifies that an axis should use a logarithmic scale. However, if you use that option on these data, the following message is printed to the SAS Log:

NOTE: Log axis cannot support zero or negative values in the data range.
      The axis type will be changed to LINEAR.

The note informs you that some people (for example, Mike) have never responded to a comment. Other people (Bradley, label not shown) have only posted responses, but have never initiated a comment. Because the logarithm of 0 is undefined, the plot refuses to use a logarithmic scale.

If you are willing to leave out these individuals, you can use a WHERE clause to subset the data:

title "Automatic Log Transformation";
title2 "Comment>0 and Response>0";
proc sgplot data=Comments;
   where Comment > 0  &  Response > 0;
   scatter x=Comment y=Response / datalabel=NickName;
   xaxis grid type=log minor offsetmin=0.01;
   yaxis grid type=log minor offsetmin=0.05;
run;
title2;
LogAxis2

The graph shows all individuals who have initiated and responded at least once. The log transformation has spread out the data so that it is possible to label all markers by using first names. The tick marks on the axes show counts in the original scale of the data. It is easy to see who has about 10 or about 100 responses. With a little more effort, the minor tick marks enable you to discover who has 3 or 50 responses. I used the OFFSETMIN= option to add a little extra space for the data labels.

The SGPLOT procedure does not support using the LINEPARM statement with logarithmic axes, so there is not diagonal line. However, the grid lines enable you to see at a glance that Michael, Jim, and Shelly initiate comments as often as they respond. Individuals that appear in the lower right grid square are those who initiate more than they respond. (If you really want a diagonal line, use the VECTOR statement.)

Although the log transformation has successful spread out the data, this graph does not show the seven people who were dropped by the WHERE clause. It is undesirable to not show certain observations just because the log scale is restricted to positive counts.

The log-of-x-plus-one transformation

There is an alternative: Rather than using the automatic log scale that PROC SGPLOT provides, you can write your own data transformation. Within the DATA step, you have complete control over the transformation and you can handle zero counts in any mathematically consistent way. I have previously written about how to use a log transformation on data that contain zero or negative values. The idea is simple: instead of the standard log transformation, use the modified transformation x → log(x+1). In this transformation, the value 0 is transformed into 0. The transformed data will be spread out but will show all observations.

The drawback of the "log-of-x-plus-one" transformation is that it is harder to read the values of the observations from the tick marks on the axes. For example, under the standard log transformation, a transformed value of 1 represents an individual that has 10 comments, since log(10) = 1. Under the transformation x → log(x+1), a transformed value of 1 represents an individual that has 9 comments. You can use the IMAGEMAP option on the ODS GRAPHICS statement to add tooltips to the graph, but of course that won't help someone who is trying to read a printed (static) version of the graph. Nevertheless, let's carry out this nonstandard log transformation:

data LogComments;
set Comments;
label logCommentP1  = "log10(1 + Number of Original Comments)"
      logResponseP1 = "log10(1 + Number of Responses)";
logCommentP1 = log10(1 + Comment);
logResponseP1 = log10(1 + Response);
run;
 
ods graphics / imagemap=ON;
title "Custom Log Transformation";
proc sgplot data=LogComments noautolegend;
   scatter x=logCommentP1 y=logResponseP1 / datalabel=NickName 
                                            tip=(Comment Response Total);
   lineparm x=0 y=0 slope=1; 
   xaxis grid offsetmin=0.01;
   yaxis grid offsetmin=0.05 offsetmax=0.05;
run;
LogAxis3

This graph is pretty good: the observations are spread out and all the data are displayed. You can easily show the diagonal reference line by using the LINEPARM statement. A disadvantage of this plot is that it is harder to determine the original counts for the individuals, in part because the tick marks on the axes are displayed on the log scale. Although many analytical professional will have no problem recalling that the value 2.0 corresponds to a count of 102 = 100, the label on the axes might confuse those who are less facile with logarithms. In my next blog post I will show how to customize the tick marks to show counts on the original scale of the data.

The point of this article is that the log transformation can help you to visualize data that span several orders of magnitudes. However, the log function is properly restricted to positive data, which means that it is more complicated to create and interpret a log transformation on non-positive data.

Do you have suggestions for how to visualize these data? Download the data and let me know what you come up with.

tags: Data Analysis, Statistical Graphics