data analysis

3月 212013
 
A big part of  "winning" these days (be it sports or a business) is performing analytics better than your competition.  This is demonstrated in awe-inspiring fashion in the book (and movie) "Moneyball."  And on that topic, I'd like to show you a few ways SAS can be used to analyze sports data [...]
3月 202013
 

Someone recently asked a question on the SAS Support Communities about estimating parameters in ridge regression. I answered the question by pointing to a matrix formula in the SAS documentation. One of the advantages of the SAS/IML language is that you can implement matrix formulas in a natural way. The SAS/IML expressions can often be written almost verbatim from the formula. This article uses a ridge regression formula from the PROC REG documentation to illustrate this feature.

When to use ridge regression?

Ridge regression is a variant to least squares regression that is sometimes used when several explanatory variables are highly correlated. The "usual" ordinary least squares (OLS) regression produces unbiased estimates for the regression coefficients (in fact, the Best Linear Unbiased Estimates). However, when the explanatory variables are correlated, the OLS parameter estimates have large variance. It might be desirable to use a different regression technique, such as ridge regression, in order to obtain parameter estimates that have smaller variance. The trade-off is that the estimates for the ridge regression method are biased.

If X is the centered and scaled data matrix, then the crossproduct matrix X`X is nearly singular when columns of X are highly correlated. Mathematically, ridge regression adds a multiple (called the ridge parameter, k) of the identity matrix to the X`X matrix. The nonsingular matrix (X`X + kI) is then used to compute the parameter estimates.

Each value of k results in a different set of parameter estimates. There have been many papers written about how to choose the best value of k. In this article, I merely want to show how to compute the parameters for ridge regression when the ridge parameter is given.

How to compute ridge regression in SAS

In SAS software, you can compute ridge regression by using the REG procedure. The following example from the PROC REG documentation is used to illustrate ridge regression. The RIDGE= option specifies the value(s) of the ridge parameter, k. The OUTEST= option is used to create an output data set that contains the parameter estimates for each value of k.

/* Ridge regression example from PROC REG documentation */
data acetyl;
input x1-x4 @@;
x1x2 = x1 * x2;
x1x1 = x1 * x1;
datalines;
1300  7.5 0.012 49   1300  9   0.012  50.2 1300 11 0.0115 50.5
1300 13.5 0.013 48.5 1300 17   0.0135 47.5 1300 23 0.012  44.5
1200  5.3 0.04  28   1200  7.5 0.038  31.5 1200 11 0.032  34.5
1200 13.5 0.026 35   1200 17   0.034  38   1200 23 0.041  38.5
1100  5.3 0.084 15   1100  7.5 0.098  17   1100 11 0.092  20.5
1100 17   0.086 29.5
;
 
proc reg data=acetyl outest=b ridge=0.02 noprint;
   model x4=x1 x2 x3 x1x2 x1x1;
run;
 
proc print data=b(where=(_TYPE_="RIDGE")) noobs;
   var _RIDGE_ Intercept x1 x2 x3 x1x2 x1x1;
run;

The parameter estimates for the ridge regression are shown for the ridge parameter k = 0.02.

Implementing a matrix formula for ridge regression by using SAS/IML software

The question that was asked on the SAS Discussion Forum was about where to find the matrix formula for estimating the ridge regression coefficients. The documentation for the REG procedure includes a section that provides the formula that PROC REG uses to compute the ridge regression coefficients. The accompanying text says:

Let X be the matrix of the independent variables after centering [and scaling] the data, and let Y be a vector corresponding to the [centered] dependent variable. Let D be a diagonal matrix with diagonal elements as in X`X. The ridge regression estimate corresponding to the ridge constant k can be computed as D-1/2(Z`Z + kI)-1Z`Y.

The terms in brackets do not appear in the original documentation, but I included them for clarity. Since this is a matrix formula, let's use the SAS/IML language to implement the formula. The following SAS/IML program uses the formula to compute the ridged parameter estimates:

/* Ridge regression coefficients computed in SAS/IML */
proc iml;
use acetyl;        
read ALL var {x1 x2 x3 x1x2 x1x1} into X; 
read all var {x4} into y;
close acetyl;        
 
/* ridge regression */
xBar = mean(X);       /* save row vector of means */
yBar = mean(y);       /* save mean of Y */
X = X - xBar;         /* center X and y; do not add intercept */
y = y - yBar;                      
D = vecdiag(X`*X);
Z = X / sqrt(D`);     /* Z`Z = corr(X) */
k = 0.02;             /* choose a ridge parameter */
b =  (1/sqrt(D)) # inv(Z`*Z + k*I(ncol(X))) * (Z`*y); /* formula in REG doc */
print (b`)[colname={"x1" "x2" "x3" "x1x2" "x1x1"}];

So that the formula and the SAS/IML statements match each other, I have written the computation in a "natural" way, rather than in a more efficient way. See the article "Do you really need to compute that matrix inverse?" In any case, you can see that the parameter estimates that are compute from the formula agree with the estimates that are computed by PROC REG.

However, notice that the PROC IML computations do not include an intercept term. This is because the independent and dependent variables were all centered, so the intercept in these coordinates is exactly zero. To obtain the intercept in the original (uncentered) coordinates, you can use a simple formula that recovers the intercept estimate:

   /* The ridge regression uses centered x and y. Recover the intercept:
      y-yBar = b1(x1-x1Bar) + ... + bn(xn-xnBar)
   so      y = yBar-Sum(bi*xibar) + b1 x1 + ... + bn xn
   Consequently, b0 = yBar-Sum(bi*xibar). */
   b0 = ybar - xbar * b;
   print b0[label="Intercept"];

Got a matrix formula? Use SAS/IML software

As this article shows, the SAS/IML language enables you to implement matrix formulas in a natural way. Although this article implements a formula that is already built into a SAS procedure, the same ideas apply for formulas and algorithms that are not available in any SAS procedure.

So the next time that you need to implement a matrix formula that appears in a textbook, in documentation, or in a journal article, reach for the SAS/IML language!

tags: Data Analysis, Matrix Computations, Statistical Programming
3月 132013
 

Argh! I've just spilled coffee on output that shows the least squares coefficients for a regression model that I was investigating. Now the parameter estimate for the intercept is completely obscured, although I can still see the parameter estimates for the coefficients of the continuous explanatory variable. What can I do? Do I need to rerun the regression in order to recover the estimate for the intercept term?

No, it turns out that I do not. If I know (or can compute) the mean values of the response variable and the explanatory variables, then that is enough information to recreate the intercept estimate.

A formula for the intercept of a regression model

In high school or college, you might have learned a simple formula for computing the intercept for a one-variable regression model. If my is the mean of the response variable, mx is the mean of the explanatory variable, and b is the slope of the least square regression line, then the intercept is computed as b0 = myb mx.

For multivariate regression with several continuous variables, the formula is similar: b0 = my − Σibi mi, where mi is the mean of the ith explanatory variable. In vector notation, if b is the vector of coefficients for the explanatory variables and mx is the corresponding vector of means, then you can use the inner product operator to compute the intercept: b0 = myb*mx.

Math saves the day

Let's see how this works on my coffee-stained example. The original analysis was created by using this regression model in PROC REG:

proc reg data=sashelp.cars;
   model Mpg_city = Weight Cylinders EngineSize;
   ods select ParameterEstimates;
quit;

Rather than rerun the regression, I'll use the SAS/IML language to compute the means of the relevant variables, and use the inner product formula to obtain the estimate for the intercept term. In practice, you might also need to take care of two additional details:

  • The REG procedure excludes observations for which any explanatory variable is missing. If your data contain missing values, remove the rows of your data that have missing values. One way to do it is to use the COUNTMISS function in the SAS/IML language to exclude the rows with missing values.
  • The ParameterEstimates table displays the parameter estimates in a format that aligns the decimal points, but does not necessarily represent the full precision of the coefficients. Because I want the following PROC IML statements to give exactly the same estimate for the intercept term as reported by PROC REG, I used additional digits of precision for the coefficient of the WEIGHT variable.

proc iml;
use sashelp.cars;
read all var {Weight Cylinders EngineSize} into X;  /* read explanatory vars */
read all var {Mpg_city} into Y;                     /* read response var */
close sashelp.cars;
 
/* include only complete cases (exclude missing) */
idx = loc(countmiss(X, "row")=0);
X = X[idx, ];
Y = Y[idx, ];
 
xBar = mean(X);                         /* row vector of means for X */
yBar = mean(Y);                         /* mean of Y */
b = {-.003165747, -0.55569, -0.93885};  /* regression coefficients for X */
b0 = yBar - xBar * b;                   /* recover intercept term */
print b0;

The output shows that the intercept term for this multivariable regression is about 37.6. If you run the REG procedure, you obtain the same estimate.

Why should you care?

Okay, you might have guessed by now that I didn't really spill coffee on a piece of paper, so why did I go through this exercise? It is faster and easier to simply rerun PROC REG. Why go to the trouble of computing the interept estimate in terms of the means of the variables?

The reason is that some regression problems are easier to solve if you not only center the explanatory variables but also center the response variable. The centering operation results in a solution that passes through (0,0). In other words, the y-intercept term is zero. However, by using the trick in this article, you can "recover" the intercept term for the uncentered data by using the simple linear operation that I've described here. In a future article, I'll apply this trick to solving a problem that uses ridge regression. That article won't feature any spilled coffee, but the trick will be useful anyway.

tags: Data Analysis, Statistical Programming, Tips and Techniques
3月 062013
 

There is something for everyone at SAS Global Forum 2013. I like to attend presentations in the Statistics and Data Analysis track and talk with SAS customers in the SAS Support and Demo Area. But one activity that I enjoy the most is to stroll through the poster area and to chat informally with the poster presenters. In contrast to the formal lectures and presentations, posters promote a two-way dialog between the presenter and the listener. If a poster interests me, I might stay and talk for 20 minutes. If a poster does not interest me, I can move on to the next poster.

This year, several posters feature SAS/IML computations, so I am keen to see those and to talk with the presenters. This year the presenters will be at their posters on Monday, April 29, 2013, 2:00 pm–3:30 pm. Here are a few posters whose abstracts mention the SAS/IML language:

  • A SAS Macro for Generating a Set of All Possible Samples with Unequal Probabilities without Replacement
  • Optimize SAS/IML Software Codes for Big Data Simulation
  • A Flexible Method to Apply Multiple Imputation Using SAS/IML Studio
  • Analyzing Partially Confounded Factorial Conjoint Choice Experiments Using SAS/IML
  • What Score Should Johnny Get? Missing_Items SAS Macro for Analyzing Missing Item Responses on Summative Scales
  • Utilizing SAS for the Construction of Preassembled Parallel, Computerized Fixed-Test Forms under Item Response Theory Framework

The poster on how to "Optimize SAS/IML [for] Simulation" interests me because I am currently putting the finishing touches on my newest book, Simulating Data with SAS. There will be an advanced copy of my book at the SAS Press booth, in case you want to browse it.

Of course, I am also excited to be giving a free Hands-On Workshop, "Getting Started with the SAS/IML Language," on Tuesday, April 30, 2013, at 1:30 pm–3:10 pm (Room 2024 in the Moscone West Convention Center).

I am also interested in the paper "Using SAS to Measure Airport Connectivity: An Analysis of Airport Centrality in the US Network with SAS/IML Studio," which will be presented on Wednesday, May 1, 2013, at 9:00 am–9:50 am. Because SAS/IML Studio supports dynamically linked statistical graphs, I think it will be an interesting talk and demo. Also, I've done an analysis of US airports using SAS/IML Studio, so it will be fun to see a related analysis.

What talks are you looking forward to at SAS Global Forum 2013? I hope to see you there!

tags: Data Analysis
2月 272013
 

A SAS user asked an interesting question on the SAS/GRAPH and ODS Graphics Support Forum. The question is: Does PROC SGPLOT support a way to display the slope of the regression line that is computed by the REG statement? Recall that the REG statement in PROC SGPLOT fits and displays a line through points in a scatter plot.

In SAS 9.3, you cannot obtain this information directly from PROC SGPLOT. Instead, you need to use PROC REG to compute this information. You can use the following steps to create a plot that displays the parameter estimates:

  1. Use PROC REG to compute the parameter estimates (slope and intercept). Save this information to a SAS data set.
  2. Use a DATA step to create macro variables that contain the parameter estimates.
  3. Use the INSET statement in PROC SGPLOT to add this information to the fitted scatter plot\.

Step 1: Save the parameter estimates

You can use the OUTEST= option or the ODS OUPUT statements to save the parameter estimates to a SAS data set. In the following example, the ODS OUTPUT statement saves the ParameterEstimates table to the PE data set:

ods graphics off;
proc reg data=sashelp.class;
   model weight = height;
   ods output ParameterEstimates=PE;
run;

Step 2: Create macro variables

In the PE data set, the ESTIMATE variable contains the parameter estimates. The first row contains the estimate for the intercept term; the second row contains the estimate for the slope. The following DATA step saves these into macro variables:

data _null_;
   set PE;
   if _n_ = 1 then call symput('Int', put(estimate, BEST6.));    
   else            call symput('Slope', put(estimate, BEST6.));  
run;

Step 3: Use the INSET Statement to display the parameter estimates

You can now create the plot by using PROC SGPLOT. Use the INSET statement to display the parameter estimates in a text box:

proc sgplot data=sashelp.class noautolegend;
   title "Regression Line with Slope and Intercept";
   reg y=weight x=height;
   inset "Intercept = &Int" "Slope = &Slope" / 
         border title="Parameter Estimates" position=topleft;
run;

Of course, you can use a similar strategy to display any other relevant statistics on the scatter plot. There is an example in the SAS/STAT User's Guide that shows other fit statistics, as well as how to use Greek letters and superscripts in the inset text. You can also display a formula that shows the equation of a displayed line.

tags: Data Analysis, Statistical Graphics
2月 062013
 

Last week the SAS Training Post blog posted a short article on an easy way to find variables in common to two data sets. The article used PROC CONTENTS (with the SHORT option) to print out the names of variables in SAS data sets so that you can visually determine whether the data sets have any variables in common. The article also mentioned using the COMPARE procedure or writing a PROC SQL query that interrogates DICTIONARY tables.

But what if you want to find variable names that are common to many data sets?

The PROC SQL approach is a programming solution, so it might be up to the challenge. A quick internet search reveals one way to use PROC SQL to find common variables in two data sets (see p. 4 of the linked paper). I am not a PROC SQL expert, but the approach in that paper seems difficult to generalize to the case of multiple data sets.

Because I like the SAS/IML language, this article shows how to find all variables that are common to multiple data sets. The following statements define six SAS data sets:

data D1 D2;
A=1; b=2; C=3; D=4; E=5; F=6; g=7; h=8; I=9; J=10;
run;
data D3 D4;
j=1; f=2; h=3; a=4; N=7; L=6; c=7;
run;
data D5;
J=1; D=2; A=3; g=4; h=5; P=6; q=7;
run;
data D6;
C=1; M=2; F=3; a=4; j=5; H=6; B=7; R=8; K=9;
run;

I would have a hard time visually determining which variables are common to all of the data sets, so I'm going to write a program. I will use two SAS/IML functions to help:

  • The CONTENTS function returns a sorted list of the variables in a SAS data set. Use the UPCASE function in Base SAS to get the names in uppercase format so that you can perform case-insensitive comparisons.
  • The XSECT function returns the intersection between two or more arrays of values.

With those two functions, you can obtain the variables names that are common to the data sets D1–D6, as follows:

proc iml;
DSNames = "D1":"D6";
InCommon = upcase(contents(DSNames[1]));     /* get all vars in D1      */
do i = 2 to ncol(DSNames);                   /* loop over data sets     */
   varNames = upcase(contents(DSNames[i]));  /* get variable names      */
   InCommon = xsect(InCommon, varNames);     /* intersect with previous */
end;
print InCommon;

The variables that are common to all the SAS data sets are A, H, and J. If you want to generalize the problem even more, you can use the SAS/IML DATASETS function to get the names of all data sets in a library. For example, you could use DSNames = T(datasets("work")) instead of hard-coding the data set names in this example.

I invite you to submit your own solution in the comments.

tags: Data Analysis, Statistical Programming, Tips and Techniques
1月 142013
 

When a categorical variable has dozens or hundreds of categories, it is often impractical and undesirable to create a bar chart that shows the counts for all categories. Two alternatives are popular:

  • Display only the Top 10 or Top 20 categories. As I showed last week, to do this in SAS you can use PROC FREQ to tabulate and order the data, and then use the OBS= data set option to truncate the data. See the article "Create a bar chart with only a few categories" for details and SAS code.
  • Display the Top 10 or Top 20 categories, but also aggregate the counts of the remaining categories to create a new "Others" category. This merging of small categories is done by default in some software, such as in SAS/IML Studio bar charts, and is very useful at the exploratory and model-building phases of data analysis. This bar chart was also available in the now-defunct SAS/INSIGHT product.
This article describes how to create an "Others" category in PROC SGPLOT. It assumes that the categorical variable is a character variable. If your variable is numeric, a short DATA step that uses the PUT function will create a character variable.

Preliminary step: Count and order the categories

As described in the previous article, you can use PROC FREQ to tabulate the counts of observations. The following statements count the number of different vehicles for car manufactures in the Sashelp.Cars data set. The FreqOut data set is sorted in descending order by the Count variable.

/* Combine small categories into an "Other" category; sum the counts */
%let VName = Make;                 /* character categorical variable */
 
proc freq data=sashelp.cars ORDER=FREQ noprint;
tables &VName / out=FreqOut;
run;

Main step: Create the "Others" category by aggregating small categories

In order to create the new category, just sum the counts of the small categories. For example, if you intend to show the 20 largest categories, sum the counts of categories 21, 22, 23, and so forth. There are several ways to do this. For me, the SAS/IML language is to most compact:

/* aggregate counts; append new level "Others" to truncated categorical variable */
%let NumCats = 20; /* number of categories in bar chart */
proc iml;
varNames = {"&VName" "Count"};
/* read categories and counts */
use FreqOut; read all var varNames; close FreqOut;
 
/* sum up the counts for the smaller categories */
OtherCount = sum(Count[&NumCats+1:nrow(Count)]);
&VName = &VName[1:&NumCats] // "Others";  /* truncate and append "Others" */
Count  = Count[1:&NumCats] // OtherCount; /* truncate and append sum of small counts */
 
/* write new data set with the Top categories and "Others" */
create Other var varNames; append; close Other;

The vertical concatenation operator (//) was used to append the "Others" category and counts to the Top 20. You can now create the bar chart by calling the SGPLOT procedure, as follows:

/* create bar chart that includes the "Others" category */
proc sgplot data=Other;
  title "Top &NumCats Vehicle Manufacturers and 'Others' Category";
  hbar &VName / freq=Count;
  yaxis discreteorder=data;
  xaxis label = "Count" grid;
run;

The graph shows that Toyota, Chevrolet, and other companies produce many different models, but that collectively the companies that manufacture 10 or fewer models produce more than 100 models. Advantages of the "Others" category include:

  • You can see the relative size ("market share") of the Top 20 categories.
  • You can plot percentages instead of counts, and the percentages will be correctly interpretable as a percentage of the whole.
  • The graph shows all the data, so it is more honest than a graph that selectively shows only a portion of the data.

Other ways to compute the "Others" category

You could argue that using SAS/IML to add a series of numbers is overkill. However, when I tried to use the DATA step, I realized that I potentially need to change the length of the categorical variable. If the length of the categorical variable is less than six, the word "Others" gets truncated. I am not aware of any way to change the length of a character variable "dynamically" in the DATA step, whereas in PROC IML the concatenation operator ensures that c = c // "Others" always has space for at least six characters.

In the following code, I follow a suggestion made by Chris Hemedinger and create a macro variable that is always greater than or equal to six. That macro variable is then used in a LENGTH statement to ensure that the categorical variable can hold the string "Others":

/* DATA step approach */
/* 1. Check length of the category variable. Create macro variable . */
data _null_;
   dsid = open('sashelp.cars');
   cur_len = varlen(dsid,varnum(dsid,"&VName")); /* length of categorical var */
   rc = close(dsid);
   call symputx('len',max(cur_len, 6));          /* new length is always >= 6 */
run;
 
/* 2. Keep first few obs; accumulate count of Others */
data Other2(keep=&VName Count);
length &VName $ &len;                     /* make sure length >= 6 */
retain OtherCount;
set FreqOut END=DONE;
if _N_>&NumCats & ^DONE then do;          /* sum up the counts for the smaller categories */ 
   OtherCount + Count;
   delete;                                /* get rid of this observation */
end;
if DONE then do;
   OtherCount + Count; 
   &VName = "Others"; Count = OtherCount; /* append last obs */
end;
run;

The Others2 data set is identical to the Others data set that was created by PROC IML. However, my Base SAS implementation is not as compact as the SAS/IML version. Perhaps you can do better? Let me know in the comments.

tags: Data Analysis, Statistical Graphics
1月 142013
 

When a categorical variable has dozens or hundreds of categories, it is often impractical and undesirable to create a bar chart that shows the counts for all categories. Two alternatives are popular:

  • Display only the Top 10 or Top 20 categories. As I showed last week, to do this in SAS you can use PROC FREQ to tabulate and order the data, and then use the OBS= data set option to truncate the data. See the article "Create a bar chart with only a few categories" for details and SAS code.
  • Display the Top 10 or Top 20 categories, but also aggregate the counts of the remaining categories to create a new "Others" category. This merging of small categories is done by default in some software, such as in SAS/IML Studio bar charts, and is very useful at the exploratory and model-building phases of data analysis. This bar chart was also available in the now-defunct SAS/INSIGHT product.
This article describes how to create an "Others" category in PROC SGPLOT. It assumes that the categorical variable is a character variable. If your variable is numeric, a short DATA step that uses the PUT function will create a character variable.

Preliminary step: Count and order the categories

As described in the previous article, you can use PROC FREQ to tabulate the counts of observations. The following statements count the number of different vehicles for car manufactures in the Sashelp.Cars data set. The FreqOut data set is sorted in descending order by the Count variable.

/* Combine small categories into an "Other" category; sum the counts */
%let VName = Make;                 /* character categorical variable */
 
proc freq data=sashelp.cars ORDER=FREQ noprint;
tables &VName / out=FreqOut;
run;

Main step: Create the "Others" category by aggregating small categories

In order to create the new category, just sum the counts of the small categories. For example, if you intend to show the 20 largest categories, sum the counts of categories 21, 22, 23, and so forth. There are several ways to do this. For me, the SAS/IML language is to most compact:

/* aggregate counts; append new level "Others" to truncated categorical variable */
%let NumCats = 20; /* number of categories in bar chart */
proc iml;
varNames = {"&VName" "Count"};
/* read categories and counts */
use FreqOut; read all var varNames; close FreqOut;
 
/* sum up the counts for the smaller categories */
OtherCount = sum(Count[&NumCats+1:nrow(Count)]);
&VName = &VName[1:&NumCats] // "Others";  /* truncate and append "Others" */
Count  = Count[1:&NumCats] // OtherCount; /* truncate and append sum of small counts */
 
/* write new data set with the Top categories and "Others" */
create Other var varNames; append; close Other;

The vertical concatenation operator (//) was used to append the "Others" category and counts to the Top 20. You can now create the bar chart by calling the SGPLOT procedure, as follows:

/* create bar chart that includes the "Others" category */
proc sgplot data=Other;
  title "Top &NumCats Vehicle Manufacturers and 'Others' Category";
  hbar &VName / freq=Count;
  yaxis discreteorder=data;
  xaxis label = "Count" grid;
run;

The graph shows that Toyota, Chevrolet, and other companies produce many different models, but that collectively the companies that manufacture 10 or fewer models produce more than 100 models. Advantages of the "Others" category include:

  • You can see the relative size ("market share") of the Top 20 categories.
  • You can plot percentages instead of counts, and the percentages will be correctly interpretable as a percentage of the whole.
  • The graph shows all the data, so it is more honest than a graph that selectively shows only a portion of the data.

Other ways to compute the "Others" category

You could argue that using SAS/IML to add a series of numbers is overkill. However, when I tried to use the DATA step, I realized that I potentially need to change the length of the categorical variable. If the length of the categorical variable is less than six, the word "Others" gets truncated. I am not aware of any way to change the length of a character variable "dynamically" in the DATA step, whereas in PROC IML the concatenation operator ensures that c = c // "Others" always has space for at least six characters.

In the following code, I follow a suggestion made by Chris Hemedinger and create a macro variable that is always greater than or equal to six. That macro variable is then used in a LENGTH statement to ensure that the categorical variable can hold the string "Others":

/* DATA step approach */
/* 1. Check length of the category variable. Create macro variable . */
data _null_;
   dsid = open('sashelp.cars');
   cur_len = varlen(dsid,varnum(dsid,"&VName")); /* length of categorical var */
   rc = close(dsid);
   call symputx('len',max(cur_len, 6));          /* new length is always >= 6 */
run;
 
/* 2. Keep first few obs; accumulate count of Others */
data Other2(keep=&VName Count);
length &VName $ &len;                     /* make sure length >= 6 */
retain OtherCount;
set FreqOut END=DONE;
if _N_>&NumCats & ^DONE then do;          /* sum up the counts for the smaller categories */ 
   OtherCount + Count;
   delete;                                /* get rid of this observation */
end;
if DONE then do;
   OtherCount + Count; 
   &VName = "Others"; Count = OtherCount; /* append last obs */
end;
run;

The Others2 data set is identical to the Others data set that was created by PROC IML. However, my Base SAS implementation is not as compact as the SAS/IML version. Perhaps you can do better? Let me know in the comments.

tags: Data Analysis, Statistical Graphics
1月 092013
 

Sometimes a categorical variable has many levels, but you are only interested in displaying the levels that occur most frequently. For example, if you are interested in the number of times that a song was purchased on iTunes during the past week, you probably don't want a bar chart with thousands of songs. Instead, you probably want a "Top 40" or "Top 20" chart that shows the number of purchases for the most popular songs.

These kinds of charts arise frequently, and you can create them in SAS by using PROC FREQ to compute the counts for the categories, followed by a call to PROC SGPLOT to display the bar chart for the top categories.

As an example, suppose that you are interested in plotting the number of different models of vehicles that are manufactured by each car company. The Sashelp.Cars data set has a Make variable, which identifies the company, and there is one observation for each different vehicle that a company manufactures. The following call to PROC FREQ counts the number of models that each company makes, sorts the counts in decreasing order, and writes the counts to a SAS data set:

proc freq data=sashelp.cars ORDER=FREQ noprint;
  tables make / out=FreqOut;
run;

The key syntax is the ORDER= option, which sorts the counts so that the first observation in the FreqOut data set is the company that makes the most vehicles, the second observation is the company that makes the second most, and so on. The FreqOut data set also has a Count variable that contains the number of vehicles for each manufacturer.

Because the observations are sorted by the Count variable, you can easily use the OBS= data set option to restrict the observations that appear in a plot:

%let NumCats = 20;   /* limit number of categories in bar chart */
 
proc sgplot data=FreqOut(OBS=&NumCats);   /* restrict to top companies */
  Title "Vehicle Manufacturers That Produce the Most Models";
  Title2 "Top &NumCats Shown";
  hbar make / freq=Count;      /* bar lengths given by Count */
  yaxis discreteorder=data;    /* order of bars same as data set order */
  xaxis label = "Number of Models";
run;

If, instead, you were interested in showing all companies that manufacture at least 12 vehicles, you could use a WHERE clause:

where count >= 12;

That was pretty easy. However, there is a second kind of bar chart that is also useful, but is more difficult to create. It requires aggregating all of the smaller categories into an "Other" category, and creating a bar whose length represents the sum of the counts for the smaller categories. I'll show how to create that kind of a bar chart in my next post.

tags: Data Analysis, Statistical Graphics
1月 092013
 

Sometimes a categorical variable has many levels, but you are only interested in displaying the levels that occur most frequently. For example, if you are interested in the number of times that a song was purchased on iTunes during the past week, you probably don't want a bar chart with thousands of songs. Instead, you probably want a "Top 40" or "Top 20" chart that shows the number of purchases for the most popular songs.

These kinds of charts arise frequently, and you can create them in SAS by using PROC FREQ to compute the counts for the categories, followed by a call to PROC SGPLOT to display the bar chart for the top categories.

As an example, suppose that you are interested in plotting the number of different models of vehicles that are manufactured by each car company. The Sashelp.Cars data set has a Make variable, which identifies the company, and there is one observation for each different vehicle that a company manufactures. The following call to PROC FREQ counts the number of models that each company makes, sorts the counts in decreasing order, and writes the counts to a SAS data set:

proc freq data=sashelp.cars ORDER=FREQ noprint;
  tables make / out=FreqOut;
run;

The key syntax is the ORDER= option, which sorts the counts so that the first observation in the FreqOut data set is the company that makes the most vehicles, the second observation is the company that makes the second most, and so on. The FreqOut data set also has a Count variable that contains the number of vehicles for each manufacturer.

Because the observations are sorted by the Count variable, you can easily use the OBS= data set option to restrict the observations that appear in a plot:

%let NumCats = 20;   /* limit number of categories in bar chart */
 
proc sgplot data=FreqOut(OBS=&NumCats);   /* restrict to top companies */
  Title "Vehicle Manufacturers That Produce the Most Models";
  Title2 "Top &NumCats Shown";
  hbar make / freq=Count;      /* bar lengths given by Count */
  yaxis discreteorder=data;    /* order of bars same as data set order */
  xaxis label = "Number of Models";
run;

If, instead, you were interested in showing all companies that manufacture at least 12 vehicles, you could use a WHERE clause:

where count >= 12;

That was pretty easy. However, there is a second kind of bar chart that is also useful, but is more difficult to create. It requires aggregating all of the smaller categories into an "Other" category, and creating a bar whose length represents the sum of the counts for the smaller categories. I'll show how to create that kind of a bar chart in my next post.

tags: Data Analysis, Statistical Graphics