data analysis

1月 092013
 

Sometimes a categorical variable has many levels, but you are only interested in displaying the levels that occur most frequently. For example, if you are interested in the number of times that a song was purchased on iTunes during the past week, you probably don't want a bar chart with thousands of songs. Instead, you probably want a "Top 40" or "Top 20" chart that shows the number of purchases for the most popular songs.

These kinds of charts arise frequently, and you can create them in SAS by using PROC FREQ to compute the counts for the categories, followed by a call to PROC SGPLOT to display the bar chart for the top categories.

As an example, suppose that you are interested in plotting the number of different models of vehicles that are manufactured by each car company. The Sashelp.Cars data set has a Make variable, which identifies the company, and there is one observation for each different vehicle that a company manufactures. The following call to PROC FREQ counts the number of models that each company makes, sorts the counts in decreasing order, and writes the counts to a SAS data set:

proc freq data=sashelp.cars ORDER=FREQ noprint;
  tables make / out=FreqOut;
run;

The key syntax is the ORDER= option, which sorts the counts so that the first observation in the FreqOut data set is the company that makes the most vehicles, the second observation is the company that makes the second most, and so on. The FreqOut data set also has a Count variable that contains the number of vehicles for each manufacturer.

Because the observations are sorted by the Count variable, you can easily use the OBS= data set option to restrict the observations that appear in a plot:

%let NumCats = 20;   /* limit number of categories in bar chart */
 
proc sgplot data=FreqOut(OBS=&NumCats);   /* restrict to top companies */
  Title "Vehicle Manufacturers That Produce the Most Models";
  Title2 "Top &NumCats Shown";
  hbar make / freq=Count;      /* bar lengths given by Count */
  yaxis discreteorder=data;    /* order of bars same as data set order */
  xaxis label = "Number of Models";
run;

If, instead, you were interested in showing all companies that manufacture at least 12 vehicles, you could use a WHERE clause:

where count >= 12;

That was pretty easy. However, there is a second kind of bar chart that is also useful, but is more difficult to create. It requires aggregating all of the smaller categories into an "Other" category, and creating a bar whose length represents the sum of the counts for the smaller categories. I'll show how to create that kind of a bar chart in my next post.

tags: Data Analysis, Statistical Graphics
1月 032013
 

It's the start of a new year. Have you made a resolution to be a data analyst? A better SAS statistical programmer? To learn more about multivariate statistics? What better way to start the New Year than to read (or re-read!) the top 12 articles for statistical programmers from my blog in 2012. Each contains tips and techniques to make you a better programmer. (Sorry, but I can't promise that they will help you to lose weight!)

I've organized the 12 tips into four categories: multivariate statistics, simulation, matrix computations, and data analysis. Each of these categories is an essential area of knowledge for statistical programmers. The articles that made the Top 12 lists were among my most popular blog posts of 2012.

Multivariate statistics

Multivariate data are often correlated. Therefore multivariate analysis, simulation, and outlier detection must account for correlation. These articles describe techniques for understanding and analyzing correlated data:

Simulation

I've written many article on simulation, but these two articles describe how to implement efficient simulation algorithms in SAS:

Matrix computations

The SAS/IML language makes it easy to compute with matrices and vectors, and to compute quantities such as eigenvalues:

Data analysis

Data analysis requires knowledge of statistics, software, programming, and a lot of common sense. No wonder the "data scientist" is the current hot job!

  • Fitting a Poisson distribution to data in SAS: Some people ask why the UNIVARIATE procedure doesn't support fitting a Poisson distribution. It's because the Poisson distribution is discrete, whereas the UNIVARIATE procedure fits continuous distributions. To fit Poisson data, use PROC GENMOD.
  • Compute a running mean and variance: In matrix-vector languages, it is important to vectorize computations to maximize the efficiency of your program. This program describes a vectorized algorithm for computing the running mean and the running variance.
  • For each observation, find the variable that contains the minimum value: In SAS software, there are usually many ways to compute a quantity. This article describes how to carry out a common task by using PROC IML, PROC SQL, and the DATA step.

What articles will be popular in 2013? I don't know, but I am committed to bringing you efficient tips and techniques for statistical programming, statistical graphics, and data analysis in SAS. Subscribe to this blog so that you don't miss a single article!

tags: Data Analysis, Getting Started, Statistical Programming
1月 032013
 

It's the start of a new year. Have you made a resolution to be a better data analyst? A better SAS statistical programmer? To learn more about multivariate statistics? What better way to start the New Year than to read (or re-read!) the top 12 articles for statistical programmers from my blog in 2012. Each contains tips and techniques to make you a better programmer. (Sorry, but I can't promise that they will help you to lose weight!)

I've organized the 12 tips into four categories: multivariate statistics, simulation, matrix computations, and data analysis. Each of these categories is an essential area of knowledge for statistical programmers. The articles that made the Top 12 lists were among my most popular blog posts of 2012.

Multivariate statistics

Multivariate data are often correlated. Therefore multivariate analysis, simulation, and outlier detection must account for correlation. These articles describe techniques for understanding and analyzing correlated data:

Simulation

I've written many article on simulation, but these two articles describe how to implement efficient simulation algorithms in SAS:

Matrix computations

The SAS/IML language makes it easy to compute with matrices and vectors, and to compute quantities such as eigenvalues:

Data analysis

Data analysis requires knowledge of statistics, software, programming, and a lot of common sense. No wonder the "data scientist" is the current hot job!

  • Fitting a Poisson distribution to data in SAS: Some people ask why the UNIVARIATE procedure doesn't support fitting a Poisson distribution. It's because the Poisson distribution is discrete, whereas the UNIVARIATE procedure fits continuous distributions. To fit Poisson data, use PROC GENMOD.
  • Compute a running mean and variance: In matrix-vector languages, it is important to vectorize computations to maximize the efficiency of your program. This program describes a vectorized algorithm for computing the running mean and the running variance.
  • For each observation, find the variable that contains the minimum value: In SAS software, there are usually many ways to compute a quantity. This article describes how to carry out a common task by using PROC IML, PROC SQL, and the DATA step.

What articles will be popular in 2013? I don't know, but I am committed to bringing you efficient tips and techniques for statistical programming, statistical graphics, and data analysis in SAS. Subscribe to this blog so that you don't miss a single article!

tags: Data Analysis, Getting Started, Statistical Programming
12月 052012
 

In a recent article on efficient simulation from a truncated distribution, I wrote some SAS/IML code that used the LOC function to find and exclude observations that satisfy some criterion. Some readers came up with an alternative algorithm that uses the REMOVE function instead of subscripts. I remarked in a comment that it is interesting to compare the relative efficiency of matrix subscripts and the REMOVE function, and that Chapter 15 of my book Statistical Programming with SAS/IML Software provides a comparison. This article uses the TIME function to measure the relative performance of each method.

The task we will examine is as follows: given a vector x and a value x0, return a row vector that contains the values of x that are greater than x0. The following two SAS/IML functions return the same result, but one uses the REMOVE function whereas the other uses the LOC function and subscript extraction:

proc iml;
/* 1. Use REMOVE function, which supports empty indices */
start Test_Remove(x, x0);
   return( (remove(x, loc(x<=x0))) );
finish;
 
/* 2. Use subscripts, which do not support empty indices */
start Test_Substr(x, x0); 
   idx = loc(x>x0);          /* find all less than values */
   if ncol(idx)>0 then 
      return( T(x[idx]) );   /* return less than values */
   else return( idx );       /* return empty matrix */
finish;

To test the relative performance of the two functions, create a vector x that contains one million normal variates. By using the QUANTILE function, you can obtain various cut points so that approximately 1%, 5%, 10%,...,95%, and 99% of the values will be truncated when you call the functions:

/* simulate 1 million random normal variates */
call randseed(1);
N = 1e6;
x = j(N,1); 
call randgen(x, "Normal");
 
/* compute normal quantiles so fncs return about 1%, 5%,...,95%, 99% of x */
prob = 0.01 || do(0.05,0.95,0.05) || 0.99;
q = quantile("Normal", 1-prob);

The following loop calls each function 50 times and computes the average time for the function to return the truncated data. The results of using the REMOVE function are stored in the first column of the results matrix. The second column stores the results of using index subscripts.

NRepl = 50;      /* repeat computations this many times */
results = j(ncol(prob),2);
 
do i = 1 to ncol(prob);
   t0 = time();  
   do k = 1 to NRepl;   
      y = Test_Remove(x, q[i]);   
   end;  
   results[i,1] = time()-t0;
 
   t0 = time();
   do k = 1 to NRepl;
      z = Test_Substr(x, q[i]);
   end;
   results[i,2] = time()-t0;
end;
results = results / NRepl; /* average time */

You can use the SGPLOT procedure to plot the time that each function takes to extract 1%, 5%, ..., 95%, and 99% of data. In addition, you can smooth the performance by using a loess curve:

/* stack vertically and add categorical variable */
Method = j(ncol(Prob), 1, "Remove") // j(ncol(Prob), 1, "Subscripts"); 
Time = results[,1] // results[,2]; 
Prob = Prob` // Prob`;
create Timing var {"Prob" "Method" "Time"};  append;  close Timing;
 
title "Average Time to Truncate One Million Observations";
proc sgplot data=Timing;
   loess x=Prob y=Time / group = Method;
   yaxis label="Average Time (s)";
   xaxis label="Probability of Truncation";
run;

What does this exercise tell us? First, notice that the average time for either algorithm is a few hundredths of a second, and this is for a vector that contains a million elements. This implies that, in practical terms, it doesn't matter which algorithm you choose! However, for people who care about algorithmic performance, the LOC-subscript algorithm outperforms the REMOVE function provided that you truncate less than 80% of the observations. When you truncate 80% or more, the REMOVE function has a slight advantage.

Notice the interesting shapes of the two curves. The REMOVE algorithm performs worse when 50% of the points are truncated. The LOC-subscript function is faster when fewer than 50% of the points are truncated, and then levels out. There is essentially no difference in performance between 50% truncation and 99% truncation.

The value in this article is not in analyzing the performance of these particular algorithms. Rather, the value is in knowing how to use the TIME function to compare the performance of whatever algorithms you encounter. That programming skill comes in handy time and time again.

tags: Data Analysis, Efficiency, Tips and Techniques
10月 172012
 

Sometimes a graph is more interpretable if you assign specific colors to categories. For example, if you are graphing the number of Olympic medals won by various countries at the 2012 London Olympics, you might want to assign the colors gold, silver, and bronze to represent first-, second-, and third-place medals. A good choice of colors can reduce the time that a reader spends studying the legend, and increase the time spent studying the data. In this example, I assign a "traffic light" scheme to visualize data about gains and losses: green means positive gains, gray indicates little or no change, and red indicates negative change.

The SGPLOT procedure in SAS makes it easy to display a different color for each level of a grouping variable. By default, when you specify the GROUP= option for a graph, colors are assigned to groups based on the current ODS style. But what can you do if you want to assign specific colors for group categories?

There are several ways to assign colors to categories, including using PROC TEMPLATE to write your own ODS style. However, in SAS 9.3, the easiest option is to use an "attribute map." An attribute map is a small data set that describes how each category level should be rendered. Dan Heath wrote a blog post about the attribute map. In this article I give a simple example of using an attribute map with the SGPLOT procedure, and I show that you can use a format to help you create an attribute map.

In my last article, I created a scatter plot that shows the gains made by women in the workplace. This graph is reproduced below, and you can download the SAS program that contains the data and that creates the plot. The colors make it clear that the proportion of women has increased in most job categories.

The program begins by using PROC format to define a mapping between a continuous variable (the difference between the proportion of women in 1980 and the proportion in 2010) and five discrete levels:

proc format;
value Gain  low-< -10 ="Large Loss" /* [low,-10) */
           -10 -<  -5 ="Loss"       /* [-10, -5) */
            -5 -    5 ="No Change"  /* [ -5,  5] */
             5 -   10 ="Gains"      /* (  5, 10] */
            10  - high="Large Gains";/*( 10, high] */
run;

Notice that because I use a format, there is not an explicit categorical variable in the data set. Instead, there is a continuous variable (the difference) and a format on that variable.

Nevertheless, you can assign a color to each category by creating an attribute map that assigns formatted values to colors. To do this, define two variables, one named Value and the other MarkerColor. The values of the Value variable define the categories; the corresponding values of the MarkerColor variable define the colors of markers in the scatter plot. (There are other columns that you could define, such as LineColor and MarkerSymbol.) You could hard-code the values "Large Loss," "Loss", and so forth, but why not use the fact that there is a format? That way, if the labels in the format change, the graph will pick up the new labels. The following DATA step uses the PUTN function to apply the user-defined format to values that are representative of each category:

data Attrs;
length Value $20 MarkerColor $20;
ID = "Jobs"; 
Value = putn(-15,"Gain."); MarkerColor = "DarkRed     "; output;
Value = putn( -8,"Gain."); MarkerColor = "DarkOrange  "; output;
Value = putn(  0,"Gain."); MarkerColor = "MediumGray  "; output;
Value = putn(  8,"Gain."); MarkerColor = "GrayishGreen"; output;
Value = putn( 15,"Gain."); MarkerColor = "DarkGreen   "; output;
run;

Notice also that the attribute map includes an ID variable that is used to identify the map, because you can define multiple attribute maps in a single data set.

Equivalently, you could use arrays to store the "cut points" for each category, and use a loop to iterate over all possible categories. This is more general, and I think that the cut points make the program easier to understand than the "strange" numbers (15 and 8) in the previous attribute map.

data Attrs2;
length Value $20 MarkerColor $20;
ID = "Jobs";              
array cutpts{6} _temporary_(-100 -10 -5 5 10 100);
array colors{5} $20 _temporary_ 
      ("DarkRed" "DarkOrange" "MediumGray" "GrayishGreen" "DarkGreen");
drop i v;
do i = 1 to dim(cutpts)-1;
   v = (cutpts[i+1] + cutpts[i]) / 2; /* midpoint of interval */
   Value = putn(v,"Gain.");           /* formatted value for this interval */
   MarkerColor = colors[i];           /* color for this interval */
   output;
end;
run;

To use an attribute map, you have to specify two pieces of information: the data set (Attrs) and the ID variable (Jobs). You specify the data set by using the DATTRMAP= option on the PROC SGPLOT statement. You specify the map by using the ATTRID= option on the SCATTER statement, as follows:

proc sgplot data=Jobs DATTRMAP=Attrs; 
scatter x=PctFemale1980 y=PctFemale2010 / group=Diff ATTRID=Jobs; 
run;

Attribute maps are a useful feature of the SG procedures in SAS 9.3. Most of the time, the default colors that are defined in ODS styles are sufficient for visualizing grouped data. However, you can use attribute maps when you need to assign specific colors to categories. As I've shown in this article, you can also assign a color to a category that is defined by a format. In either case, an attribute map can make it easier for your audience to understand the data.

tags: Data Analysis, SAS Programming, Statistical Graphics
10月 152012
 
New York Times graphic

The New York Times has an excellent staff that produces visually interesting graphics for the general public. However, because their graphs need to be understood by all Times readers, the staff sometimes creates a complicated infographic when a simpler statistical graph would show the data in a clearer manner.

A recent graphic was discussed by Kaiser Fung in his article, "When Simple Is too Simple." Kaiser argued that the Times made a poor choice of colors in a graphic (shown at right) that depicts the proportion of women in certain jobs and how that proportion has changed between 1980 and 2010.

I agree with Kaiser that the colors should change, and I will discuss colors in a subsequent blog post. However, I think that the graph itself suffers from some design problems. First, it is difficult to see overall trends and relationships in the data. Second, in order to understand the points on the left side of the graph, you have to follow a line to the right side of the graph in order to find the label. Third, the graph is too tall. At a scale in which you can read the labels, only half the graph appears on my computer monitor. If printed on a standard piece of paper, the labels would be quite small.

You can overcome these problems by redesigning the graph as a scatter plot. I presume that the Times staff rejected the scatter plot design because they felt it would not be easily interpretable by a general Times reader. Another problem, as we will see, is that a scatter plot requires shorter labels that the ones that are used in the Times graphic.

A scatter plot of the same data is shown in the next figure. (Click to enlarge.) The graph shows the proportion of women in 35 job categories. The horizontal axis shows the proportion in 1980 and the vertical axis shows the proportion in 2010. Jobs such as secretary, hygienist, nurse, and housekeeper are primarily held by women (both in 1980 and today) and appear in the upper right. Jobs such as auto mechanic, electrician, pilot, and welder are primarily held by men (both in 1980 and today) and appear in the lower left. Jobs shown in the middle of the graph (bus driver, reporter, and real estate agents) are held equally by men and women.

The diagonal line shows the 1980 baseline. Points that are displayed above that line are jobs for which the proportion of women has increased between 1980 and 2010. This graph clearly shows that the proportion of women in most jobs has increased or stayed the same since 1980. (I classify a deviation of a few percentage points as "staying the same," because this is the margin of error for most surveys.) Only the job of "welfare aide worker" has seen a substantial decline in the proportion of women.

The markers are colored on a red-green scale according to the gains made by women. Large gains (more than 10%) are colored a dark green. Lesser gains (between 5% and 10% are a lighter green. Similarly, jobs for which the proportion of women declined are shown in red. Small changes are colored gray.

This graph shows the trend more clearly than the Times graphic. You can see at a glance that women have made strides in workplace equity across many fields. In many high-paying jobs such as doctor, lawyer, dentist, and various managerial positions, women have made double-digit gains.

The shortcomings of the graph include using shorter labels for the job categories, and losing some of the ability to know the exact percentages in each category. For example, what is the proportion of females that work as a welfare aide in 2010? Is it 68%, 67%, or 66%? This static graph does not give the answer, although if the graph is intended for an online display you can create tooltips for each point.

For those readers who are interested in the details, you can download the SAS program that creates this plot. The program uses three noteworthy techniques:
  1. A user-defined SAS format is used to display the differences in proportions into a categorical variable with levels "Large Gains," "Gains," "No Change," and so on.
  2. An attribute map is used to map each category to a specified shade of red, gray, and green. This feature was introduced in SAS 9.3. My next blog post will discuss this step in more detail.
  3. The DATALABEL= option is used to specify a variable that should be used to label each point. The labels are arranged automatically. The SAS research and development staff spent a lot of time researching algorithms for the automatic placement of labels, and for this example the default algorithm does an excellent job of placing each label near its marker while avoiding overlap between labels.

What do you think? Which graphic would you prefer to use to examine the gains made by women in the workplace? Do you think that the New York Times readers are sophisticated enough to read a scatter plot, or should scatter plots be reserved for scientific communication?

tags: Data Analysis, Statistical Graphics
9月 242012
 

Sometimes it is useful to group observations based on the values of some variable. Common schemes for grouping include binning and using quantiles.

In the binning approach, a variable is divided into k equal intervals, called bins, and each observation is assigned to a bin. In this scheme, the size of the groups is proportional to the density of the grouping variable. In the SAS/IML language, you can use the BIN function to assign observations to bins.

However, sometimes it is useful to have approximately the same number of observations in each group. In this case, you can group the observations into k quantiles. One way to do this is to use PROC RANK and to specify the GROUPS= option, as follows:

%let NumGroups = 4;
proc rank data=Sashelp.class out=class groups=&NumGroups ties=high;
  var Height;   /* variable on which to group */
  ranks Group;  /* name of variable to contain groups 0,1,...,k-1 */
run;

A scatter plot of the data (click to enlarge) is shown with markers colored by group membership. There are 19 observations in this data set. The RANK procedure puts four observations into the first group and five observations into each of the next three groups, which is an excellent distribution of the 19 observations into four groups. (In Group 4, there are two markers with the coordinates (15, 66.5).) The TIES= option specifies how to handle tied values. I specified TIES=HIGH because that option is consistent with the definition of a quantile for a discrete distribution.

In the SAS/IML language, you can use the QNTL subroutine and the BIN function to construct a function that groups observations by quantiles, as follows:

proc iml;
/* Assign group to each observation based on quantiles. 
   For example, g=GroupRanks(x,4) assigns each observation 
   to a quartile. Assume x is a column vector and k is less
   than the number of unique values of x */
start GroupRanks(x, k);
   if k<2 then return(j(nrow(x), 1, 1));
   dx = 1/k;
   p = do(dx, 1-dx/2, dx);  /* 4 groups ==> p={0.25 0.50 0.75} */
   call qntl(q, x, p);
   Group = bin(x, .M // q // .I);
   return(Group);
finish;
 
/* run example */
use Sashelp.Class;  read all var {Height};  close;
Group = GroupRanks(Height, 4);

The call to the GroupRanks function results in the same assignment of observations to groups as PROC RANK, except that the GroupRanks function uses the identifiers 1,2,...,k. The GroupRanks function uses two tricks:

  • The QNTL function finds k–1 "cut points" that divide the values of x into k groups, with about the same number of observations in each group.
  • The special missing values .M and .I correspond to "minus infinity" and "plus infinity," respectively. These special values are appended to the other cut points and supplied as parameters to the BIN function. The BIN function returns a vector with the values 1, 2, ...,k that classifies each observation into a bin.

There are several statistical uses for creating k groups with roughly the same number of observations in each group. One is when assigning colors to a choropleth map or a heat map. Rather than chop the range of the response variable into k equal intervals, it is often better to choose 5 or 7 quantiles and color the regions by quantiles of the response. This technique is especially valuable when the response variable has a long-tailed distribution.

tags: Data Analysis, Statistical Programming
9月 192012
 

With the US presidential election looming, all eyes are on the Electoral College. In the presidential election, each state gets as many votes in the Electoral College as it has representatives in both congressional houses. (The District of Columbia also gets three electors.) Because every state has two senators, it is the number of representatives in the House that determines how many electors are awarded to each state.

The 2012 election is the first presidential election since the 2010 census and subsequent reapportionment of representatives. Usually the apportionment of the Electoral College is represented as a static choropleth map that show the number of electors by state at any instance in time. Recently, however, the US Census Bureau's Data Visualization Gallery published a visualization of "Ranks of States by Congressional Representation" over time. This graphic is reproduced below (click to enlarge):

The graph shows the rank of the top seven most populous states after each US census. You can see a few interesting facts such as:

  • New York and Pennsylvania have always been among the most populous states.
  • California, which became a state in 1850, popped onto the list in 1930 and rapidly became the most populous state.
  • Texas, which became a state in 1845, entered the list in 1890 and is now the second most populous state.
  • Florida, which became a state in 1845, broke into the top seven in 1980 and currently has the same number of representatives as New York.
  • Virginia and Massachusetts were once among the populous states, but have since dropped off the list.

The reason that the US Census Bureau used ranks on the vertical scale is because the number of representatives has changed over the years: from 106 in 1790, to 243 at the time of the Civil War, to the modern value of 435, which was adopted in 1960. It occurred to me that it might be more interesting to see the percentage of the representatives that were apportioned to each state throughout the US history. I also thought the graph might look better if states don't "pop on" and "pop off" the way that Tennessee and Michigan do. I would prefer to see a state's entire history or representation.

With that goal in mind, I downloaded the data from the Census Bureau in CSV format. Instead of computing the ranks of the states for each decade, I grouped the states into quintiles according to their current number of representatives. Thus there are five groups, where each group contains 10 states. The first quintile contains the least populous states (Wyoming, Vermont, and so forth) and the fifth quintile contains the most populous states (California, New York, and so forth). The time series plot of the most populous states follows. The label for Ohio is partially hidden under the label for Georgia.

You can download the complete SAS program that manipulates, rearranges, and plots the data. The program produces plots for all 50 states. I tried different approaches, such as assigning states into quintiles according to the historical average of the representation percentage, but I the graph that I show is the one that compares the best with the US Census graph. The graph is different than a graph that shows the population of each state over time, because this graph shows relative growth rather than absolute growth.

tags: Data Analysis, Statistical Graphics
9月 102012
 

Robert Allison posted a map that shows the average commute times for major US cities, along with the proportion of the commute that is attributed to traffic jams and other congestion. The data are from a CEOs for Cities report (Driven Apart, 2010, p. 45).

Robert use SAS/GRAPH software to improve upon a somewhat confusing graph in the report. I can't resist using the same data to create two other visualizations of the same data. Like Robert, I have included the complete SAS program that creates the graphs in this post.

First graph: A tall bar chart

The following bar chart (click to enlarge) shows the mean commute time for the cities (in hours per year), sorted by the commute time. You can see that the commute times in Chicago and New Orleans are short, on average, whereas cities like Nashville and Oklahoma City (and my own Raleigh!) are longer. The median of the 45 cities is about 200 hours per year, and is represented by a vertical line.

Overlaid on the graph is a second bar chart that shows the proportion of the commute that is attributed to congestion. In some sense, this is the "wasted" portion of the commute. The median wasted time is 39 hours per year. You can identify cities with low congestion (Buffalo, Kansas City) as well as those with high congestion (Los Angeles, Washington, DC).

The plot was created by using the SGPLOT procedure. The ODS GRAPHICS statement is used to set the dimensions of the plot so that all of the vertical labels show, and the vertical axis is offset to make room for the labels for the reference lines. Thanks to Sanjay Matange who told me about the NOSCALE option, which enables you to change the size of a graph without changing the size of the fonts.

ods graphics / height=9in width=5in noscale;    /* make a tall graph */
title "Mean Commute Time for Major US Cities";
footnote "Data Source: 2010 CEOsForCities.org";
 
proc sgplot data=traffic;
hbar City / response=total_hours CategoryOrder=RespAsc transparency=0.2 
           name="total" legendlabel="Total";
hbar City / response=hours_of_delay
           name="delay" legendlabel="Due to Congestion";
keylegend "delay" "total";
xaxis grid label="Cumulative Commute Time (hours per year)";
yaxis ValueAttrs=(size=8) offsetmax=0.05; /* make room for labels */
refline 39 199 / axis=x label=("Median Delay" "Median Commute") 
           lineattrs=(color=black) labelloc=inside;
run;

Second graph: A scatter plot

Although the previous graph is probably the one that I would choose for a report that is intended for general readers, you can use a scatter plot to create a more statistical presentation of the same data.

In the bar chart, it is hard to compare cities. For example, I might want to compare the average commute and congestion in multiple cities. Because the cities might be far apart on the bar chart, it is difficult to compare them. An alternative visualization is to create a scatter plot of the commute time versus the time spent in traffic for the 45 cities. In the scatter plot, you can see that Nashville and Oklahoma City have moderate congestion, even though their mean commute times are large (presumably due to long commutes). In contrast, the congestion in cities such as Los Angeles, Washington, and Atlanta are evident.

The plot was created by using the SGPLOT procedure. The DATALABEL= option is used to label each marker by the name of the city.

ods graphics / reset; /* reset graphs to default size */
title "Commute Time versus Time Spent in Congestion";
 
proc sgplot data=traffic;
scatter x=total_hours y=hours_of_delay / 
        MarkerAttrs=(size=12 symbol=CircleFilled) transparency=0.3 
        datalabel=City datalabelattrs=(size=8);
xaxis grid label="Cumulative Commute Time (hours per year)"
        values=(125 to 300 by 25);
yaxis grid label="Time Lost to Congestion (hours per year)";
run;

For me, the surprising aspect of this data (and Rob's graph) is that the average commute times are more similar than I expected. Another interesting aspect is that my notion of "cities with bad commutes" seems to be based on congestion (the cities that are near the top of the scatter plot) rather than on long commutes (the cities to the right of the scatter plot).

What aspects of these data do you find interesting?

tags: Data Analysis, Statistical Graphics
8月 222012
 

The other day I was using PROC SGPLOT to create a box plot and I ran a program that was similar to the following:

proc sgplot data=sashelp.cars;
title "Box Plot: Category = Origin";
vbox Horsepower / category=origin;
run;

An hour or so later I had a need for another box plot. Instead of copying the previous statements, I retyped in the PROC SGPLOT code. However, I wasn't paying attention and I typed GROUP= instead of CATEGORY= as the option to the VBOX statement:

proc sgplot data=sashelp.cars ;
title "Box Plot: Group = Origin";
vbox Horsepower / group=Origin;
run;

When I saw the second graph, I noticed that it is more colorful and also has a legend instead of an axis with tick marks and labels. I wondered, "What is the difference between the CATEGORY= option and the GROUP= option?" I started making a set of notes, which I share in this article.

The CATEGORY= option defines a categorical variable

The CATEGORY= syntax defines the discrete variable for the plot. As such, the values of the categorical variable appear as tick marks on an axis. All graphical elements have the same graphical styles, such as color, line pattern, marker shapes, and so forth.

The values of the categorical variable appear in alphabetical or numerical order, although some graphs support option for sorting the categories. For example, to order the categories in a bar chart in ascending or descending order, use the CATEGORYORDER= option in the VBAR statement.

For the box plot, specifying a categorical variable is optional and therefore the CATEGORY= option is specified after the slash (/). For most other plots (for example, the bar chart), the categorical variable is NOT optional, so the variable is specified before the slash.

As of SAS 9.3, the CATEGORY= variable can be numeric, which means that you can create box plots that are arranged on a continuous axis, as follows:
proc sgplot data=sashelp.cars;
title "Box Plot: Category = Cylinders, Linear Scale";
vbox horsepower / category=cylinders; /* SAS 9.3 example */
xaxis type=linear;
run;

In this graph, the XAXIS statement is used to specify that the number of cylinders should not be treated as a discrete (nominal) variable, but should be spaced according to their values on a continuous scale. (Notice the presence of vehicles with three and five cylinders in the data.) This can be a useful feature. For example, I use it to visualize the performance of algorithms. The X axis might be the number of variables in an analysis, and the box plot might represent the distribution of times for 10 runs of the algorithm.

The syntax for this "continuous variable box plot" is a bit contradictory: The CATEGORY= option specifies the X variable (which, from the syntax, you expect to be categorical!) and the TYPE=LINEAR option specifies that the X variable is continuous. However, this is a powerful syntax. It very useful for clinical trials data in which you plot the distribution of the response variable versus time for patients in experimental and control groups.

The GROUP= option defines an auxiliary variable

The GROUP= option defines an auxiliary classification variable. I like to think of the GROUP= variable as defining an overlay of various "mini plots." In most cases, you get k mini plots for every one that you have without the GROUP= option, where k is the number of levels in the grouping variable. For example, in the line plot you get k overlaid lines, one for each group.

The main difference between the CATEGORY= and GROUP= options is that the GROUP= option results in graphical elements that have varying attributes. By default, each unique value of the grouping variable is drawn in a separate style element GraphData1 through GraphDatak. The association between graphical styles and the groups are shown in a legend.

The SGPLOT procedure supports many options that control the appearance of the grouped data. You can use the GROUPDISPLAY= option to specify that the grouped elements be clustered, overlaid, or (for bar charts) stacked. You can use the GROUPORDER= option to specify how you want the group elements to be ordered.

Combining the two options

You can combine the two options to visualize a group variable nested within a categorical variable. The following statements create a graph that contains box plots for several types of vehicles, nested within the Origin variable:

proc sgplot data=sashelp.cars;
where Type in ('SUV' 'Truck' 'Sedan');
title "Box Plot: Category = Origin, Group = Type";
vbox horsepower / category=Origin Group=Type;
run;

The plot shows that the data set does not contain any trucks that are made in Europe. It also shows that sedans tend to have lower horsepower than SUVs, when you account for the Origin variable.

In summary, the VBOX (and HBOX) statements in the SGPLOT procedure support several options that arrange the boxes. The CATEGORY= option defines the variable to use for the X axis, whereas the GROUP= option defines an auxiliary discrete variable whose values and graphical attributes are displayed in a legend. You can use the options to visualize the distribution of one response variable with respect to one or two other variables.

tags: Data Analysis, Statistical Graphics