data analysis

4月 032013
 

I was recently asked how to compute the difference between two density estimates in SAS. The person who asked the question sent me a link to a paper from The Review of Economics and Statistics that contains several examples of this technique (for example, see Figure 3 on p. 16 and Figure 4 on p. 17). The author of the paper uses the technique to demonstrate that the wage of a Mexican citizen influences his decision to emigrate to the US. Briefly, the paper plots a kernel density estimate of the income distribution of Mexicans who emigrated and those who did not, and shows that the incomes of those who decided to emigrate tend to be less than for those who decided to stay.

In SAS, you can form the difference between two density estimates by doing the following:

  1. Use PROC KDE to compute kernel density estimates of the two groups. Use the GRIDL= and GRIDU= options so that the two kernel densities are evaluated on the same grid of points. Use the OUT= option on the UNIVAR statement to write the density estimates to a SAS data set.
  2. Form the difference of the densities by subtracting one from the other.
  3. Use the SGPLOT procedure to plot the difference.

However, just because you can do something doesn't mean that you should do it! After I produce some graphs, I present some questions about the technique.

The difference of densities for subpopulations

For the sake of an example, I will use data in the Sashelp.Cars data set. I will compare the MPG_CITY for cars that are manufactured in the USA with the MPG_CITY for cars that are manufactured in Asia. The cars from the USA tend to get fewer miles per gallon than the cars from Asia, so these data are a convenient proxy for the Mexican incomes that are shown in the paper.

The following DATA step creates two MPG variables. The call to PROC KDE overlays both densities on a single plot and writes the kernel density estimates to the KerOut data set.

/* create example data: MPG_City for cars built in USA vs. Asia */
data cars;
 keep MPG_Asia MPG_USA;
 merge sashelp.cars(where=(origin="Asia") rename=(MPG_City=MPG_Asia))
       sashelp.cars(where=(origin="USA") rename=(MPG_City=MPG_USA));
run;
 
/* optional: use BWM= option to adjust kernel bandwidth */
proc kde data=cars;
univar MPG_Asia MPG_USA / plots=DensityOverlay out=KerOut GridL=5 GridU=65;
run;

The GRIDL= and GRIDU= options are necessary. Without them, the density estimate would not be evaluated at a common set of points. Then you would need to use an interpolation scheme to obtain the KDEs at a common set of locations. I have previously shown how to use SAS/IML functions for linear interpolation of data.

However, because we used the GRIDL= and GRIDU= options, no interpolation is necessary. You can just use the DATA step to merge the two KDEs and to compute the difference:

data Diff(drop=Var Count);
merge KerOut(where=(Var="MPG_Asia") rename=(Density=Density_Asia))
      KerOut(where=(Var="MPG_USA")  rename=(Density=Density_USA));
by Value;
Diff_Density = Density_USA - Density_Asia;
run;
 
/* display difference of group densities */
proc sgplot data=Diff;
   title "Difference between Densitites";
   series x=Value y=Diff_Density;
   refline 0 / axis=y;
   xaxis grid values=(5 to 65 by 5) label="MPG (City)";
   yaxis label="Difference in Density Estimates (USA - Asia)";
run;

You can interpret the graph as follows. Given that a vehicle was made in the USA, its fuel efficiency is generally less than for vehicles that were manufactured in Asia. The "difference plot" shows the extent of the difference in densities for the two subpopulations. This plot is interesting because American cars that get approximately 27 mpg seem to be an exception to the general rule.

Notice that the difference of densities is not itself a density. Rather, the difference plot is a way to visualize when the density of one distribution is greater than for a second distribution.

A critique of this method

I have never seen this "difference in density" technique before, and I have some questions about it. I am not familiar with a theoretical reference, so my criticisms are based on ignorance rather than on a careful study of the method.

One concern I have is that the scale of the "density difference plot" is clearly important, but the method seems to ignore it. Suppose that I have two normal populations, N(0, 1) and N(ε, 1). When I compute the difference in densities, the difference plot will look like the following plot.

The difference plot shows that the N(0, 1) distribution is to the left of the other distribution, but the plot doesn't warn the reader that the two distributions are essentially identical. No matter how small ε is, the difference plot will look similar...except for the scale of the vertical axis. Can the method can be extended to provide confidence limits? I don't know. Perhaps the integral of the absolute value of the difference (or the square of the difference) is an appropriate measure for how close the densities are.

Another concern I have is that kernel density estimation requires choosing a bandwidth parameter. There are various ways to choose a bandwidth when the goal is to estimate the density. However, it is not clear to me that the bandwidth that best estimates the density is also the bandwidth that best estimates the difference between the densities of two subpopulations.

The authors of the paper acknowledge some of these concerns (p. 17 and the bottom of p. 16), and they perform a nonparametric test, such as can be computed by using PROC NPAR1WAY. Nevertheless, I wonder when it makes sense to display the difference between two densities.

How would you analyze data like these? How would you visualize the results?

tags: Data Analysis
3月 272013
 

In statistics, distances between observations are used to form clusters, to identify outliers, and to estimate distributions. Distances are used in spatial statistics and in other application areas.

There are many ways to define the distance between observations. I have previously written an article that explains Mahalanobis distance, which is used often in multivariate analysis, and I have showed how to compute the Mahalanobis distance in SAS. Today's article is simpler: how do you compute the usual Euclidean distance in SAS?

Recall that the squared Euclidean distance between the point p = (p1, p2, ..., pn) and the point q = (q1, q2, ..., qn) is the sum of the squares of the differences between the components: Dist2(p,q) = Σi (piqi)2. The Euclidean distance is then the square root of Dist2(p,q). This article shows three ways to compute the Euclidean distance in SAS:

  1. By using the DISTANCE procedure in SAS/STAT software.
  2. By using the DISTANCE function in SAS/IML software.
  3. By writing your own SAS/IML function.

The following DATA step creates 27 observations that are arranged on an integer lattice in three dimensions. Each row of the data set contains an observation in three variables. The 27 points are (0,0,0), (0,0,1), (0,0,2), (0,1,0), ..., (2,2,2).

data Obs;
do x=0 to 2;
   do y=0 to 2;
      do z = 0 to 2;
         output;
      end;
   end;
end;
run;

Compute distance by using SAS/STAT software

PROC DISTANCE can compute many kinds of distance, and can also standardize the data variables, which is useful when your variable represent different quantities (such as height, weight, and age). In the simple case of Euclidean distance without any standardization, specify the METHOD=EUCLID option and the NOSTD option on the PROC DISTANCE statement, as follows:

proc distance data=Obs out=Dist method=Euclid nostd;
   var interval(x y z);
run;
 
proc print data=Dist(obs=4);
   format Dist: 8.6;
   var Dist1-Dist4;
run;

The output data set has 27 variables and 27 observations. The preceding table shows the first four observations and the first four variables. You can see that the output data set is the lower-triangular portion of the distance matrix. The ith row gives the distance between the ith observation and the jth observation for ji. For example, the distance between the fourth observation (0,1,0) and the second observation (0,0,1) is sqrt(02 + 12 + 12)= sqrt(2) = 1.414.

If you prefer to output the full, dense, symmetric matrix of distances, use the SHAPE=SQUARE option on the PROC DISTANCE statement.

Computing distance in SAS/IML Software

In SAS/IML software, you can use the DISTANCE function in SAS/IML to compute a variety of distance matrices. The DISTANCE function was introduced in SAS/IML 12.1 (SAS 9.3M2).

By default, the DISTANCE function computes the Euclidean distance, and the output is always a square matrix. Therefore, the following statements compute the Euclidean pairwise distances between the 27 points in the Obs data set:

proc iml;
use Obs;
read all var _NUM_ into X;
close Obs;
 
D = distance(X);
print (D[1:4, 1:4])[format=8.6 c=("Dist1":"Dist4")];

The output shows that the values in the upper-left portion of the distance matrix are the same as were computed by PROC DISTANCE.

A SAS/IML Module for computing Euclidean distance

Sometimes you do not want to compute the complete matrix of pairwise distances between observations. Sometimes you only need the distances between observations that belong to different groups. For example, you might have one set of locations that represent warehouses and another set that represents the locations of retail stores. You might be interested in computing the distances from each store to the warehouses, so that you can efficiently ship goods to each store from the closest warehouse.

The following SAS/IML module computes the distances between observations in the matrix X and the matrix Y. The (i,j)th element of the result is the distance between the ith row of X and the jth row of Y.

/* compute Euclidean distance between points in x and points in y.
   x is a n x d matrix, where each row is a point in d dimensions.
   y is a m x d matrix.
   The function returns the n x q matrix of distances, D, such that
   D[i,j] is the distance between x[i,] and y[j,]. */
start PairwiseDist(x, y);
   if ncol(x)^=ncol(y) then return (.);       /* different dimensions */
   n = nrow(x);  m = nrow(y);
   idx = T(repeat(1:n, m));                   /* index matrix for x   */
   jdx = shape(repeat(1:m, n), n);            /* index matrix for y   */
   diff = X[idx,] - Y[jdx,];
   return( shape( sqrt(diff[,##]), n ) );     /* sqrt(sum of squares) */
finish;
 
p = { 1  0, 
      1  1,
     -1 -1};
q = { 0  0,
     -1  0};
PD = PairwiseDist(p,q); 
print PD;

If you are still running SAS/IML 9.3 or earlier, you can use the PairwiseDist function to define your own function that computes the Euclidean distance between rows of a matrix:

/* compute Euclidean distance between points in x.
   x is a p x d matrix, where each row is a point in d dimensions.
   Use the DISTANCE function in SAS/IML 12.1 and later releases. */
start EuclideanDistance(x);  
   y=x;
   return( PairwiseDist(x,y) );
finish;
 
D = EuclideanDistance(X);
print (D[1:4, 1:4])[format=8.6 c=("Dist1":"Dist4")];

The output is the same as for the built-in DISTANCE function, and is not printed.

tags: Data Analysis, Matrix Computations, Statistical Programming
3月 212013
 
A big part of  "winning" these days (be it sports or a business) is performing analytics better than your competition.  This is demonstrated in awe-inspiring fashion in the book (and movie) "Moneyball."  And on that topic, I'd like to show you a few ways SAS can be used to analyze sports data [...]
3月 202013
 

Someone recently asked a question on the SAS Support Communities about estimating parameters in ridge regression. I answered the question by pointing to a matrix formula in the SAS documentation. One of the advantages of the SAS/IML language is that you can implement matrix formulas in a natural way. The SAS/IML expressions can often be written almost verbatim from the formula. This article uses a ridge regression formula from the PROC REG documentation to illustrate this feature.

When to use ridge regression?

Ridge regression is a variant to least squares regression that is sometimes used when several explanatory variables are highly correlated. The "usual" ordinary least squares (OLS) regression produces unbiased estimates for the regression coefficients (in fact, the Best Linear Unbiased Estimates). However, when the explanatory variables are correlated, the OLS parameter estimates have large variance. It might be desirable to use a different regression technique, such as ridge regression, in order to obtain parameter estimates that have smaller variance. The trade-off is that the estimates for the ridge regression method are biased.

If X is the centered and scaled data matrix, then the crossproduct matrix X`X is nearly singular when columns of X are highly correlated. Mathematically, ridge regression adds a multiple (called the ridge parameter, k) of the identity matrix to the X`X matrix. The nonsingular matrix (X`X + kI) is then used to compute the parameter estimates.

Each value of k results in a different set of parameter estimates. There have been many papers written about how to choose the best value of k. In this article, I merely want to show how to compute the parameters for ridge regression when the ridge parameter is given.

How to compute ridge regression in SAS

In SAS software, you can compute ridge regression by using the REG procedure. The following example from the PROC REG documentation is used to illustrate ridge regression. The RIDGE= option specifies the value(s) of the ridge parameter, k. The OUTEST= option is used to create an output data set that contains the parameter estimates for each value of k.

/* Ridge regression example from PROC REG documentation */
data acetyl;
input x1-x4 @@;
x1x2 = x1 * x2;
x1x1 = x1 * x1;
datalines;
1300  7.5 0.012 49   1300  9   0.012  50.2 1300 11 0.0115 50.5
1300 13.5 0.013 48.5 1300 17   0.0135 47.5 1300 23 0.012  44.5
1200  5.3 0.04  28   1200  7.5 0.038  31.5 1200 11 0.032  34.5
1200 13.5 0.026 35   1200 17   0.034  38   1200 23 0.041  38.5
1100  5.3 0.084 15   1100  7.5 0.098  17   1100 11 0.092  20.5
1100 17   0.086 29.5
;
 
proc reg data=acetyl outest=b ridge=0.02 noprint;
   model x4=x1 x2 x3 x1x2 x1x1;
run;
 
proc print data=b(where=(_TYPE_="RIDGE")) noobs;
   var _RIDGE_ Intercept x1 x2 x3 x1x2 x1x1;
run;

The parameter estimates for the ridge regression are shown for the ridge parameter k = 0.02.

Implementing a matrix formula for ridge regression by using SAS/IML software

The question that was asked on the SAS Discussion Forum was about where to find the matrix formula for estimating the ridge regression coefficients. The documentation for the REG procedure includes a section that provides the formula that PROC REG uses to compute the ridge regression coefficients. The accompanying text says:

Let X be the matrix of the independent variables after centering [and scaling] the data, and let Y be a vector corresponding to the [centered] dependent variable. Let D be a diagonal matrix with diagonal elements as in X`X. The ridge regression estimate corresponding to the ridge constant k can be computed as D-1/2(Z`Z + kI)-1Z`Y.

The terms in brackets do not appear in the original documentation, but I included them for clarity. Since this is a matrix formula, let's use the SAS/IML language to implement the formula. The following SAS/IML program uses the formula to compute the ridged parameter estimates:

/* Ridge regression coefficients computed in SAS/IML */
proc iml;
use acetyl;        
read ALL var {x1 x2 x3 x1x2 x1x1} into X; 
read all var {x4} into y;
close acetyl;        
 
/* ridge regression */
xBar = mean(X);       /* save row vector of means */
yBar = mean(y);       /* save mean of Y */
X = X - xBar;         /* center X and y; do not add intercept */
y = y - yBar;                      
D = vecdiag(X`*X);
Z = X / sqrt(D`);     /* Z`Z = corr(X) */
k = 0.02;             /* choose a ridge parameter */
b =  (1/sqrt(D)) # inv(Z`*Z + k*I(ncol(X))) * (Z`*y); /* formula in REG doc */
print (b`)[colname={"x1" "x2" "x3" "x1x2" "x1x1"}];

So that the formula and the SAS/IML statements match each other, I have written the computation in a "natural" way, rather than in a more efficient way. See the article "Do you really need to compute that matrix inverse?" In any case, you can see that the parameter estimates that are compute from the formula agree with the estimates that are computed by PROC REG.

However, notice that the PROC IML computations do not include an intercept term. This is because the independent and dependent variables were all centered, so the intercept in these coordinates is exactly zero. To obtain the intercept in the original (uncentered) coordinates, you can use a simple formula that recovers the intercept estimate:

   /* The ridge regression uses centered x and y. Recover the intercept:
      y-yBar = b1(x1-x1Bar) + ... + bn(xn-xnBar)
   so      y = yBar-Sum(bi*xibar) + b1 x1 + ... + bn xn
   Consequently, b0 = yBar-Sum(bi*xibar). */
   b0 = ybar - xbar * b;
   print b0[label="Intercept"];

Got a matrix formula? Use SAS/IML software

As this article shows, the SAS/IML language enables you to implement matrix formulas in a natural way. Although this article implements a formula that is already built into a SAS procedure, the same ideas apply for formulas and algorithms that are not available in any SAS procedure.

So the next time that you need to implement a matrix formula that appears in a textbook, in documentation, or in a journal article, reach for the SAS/IML language!

tags: Data Analysis, Matrix Computations, Statistical Programming
3月 132013
 

Argh! I've just spilled coffee on output that shows the least squares coefficients for a regression model that I was investigating. Now the parameter estimate for the intercept is completely obscured, although I can still see the parameter estimates for the coefficients of the continuous explanatory variable. What can I do? Do I need to rerun the regression in order to recover the estimate for the intercept term?

No, it turns out that I do not. If I know (or can compute) the mean values of the response variable and the explanatory variables, then that is enough information to recreate the intercept estimate.

A formula for the intercept of a regression model

In high school or college, you might have learned a simple formula for computing the intercept for a one-variable regression model. If my is the mean of the response variable, mx is the mean of the explanatory variable, and b is the slope of the least square regression line, then the intercept is computed as b0 = myb mx.

For multivariate regression with several continuous variables, the formula is similar: b0 = my − Σibi mi, where mi is the mean of the ith explanatory variable. In vector notation, if b is the vector of coefficients for the explanatory variables and mx is the corresponding vector of means, then you can use the inner product operator to compute the intercept: b0 = myb*mx.

Math saves the day

Let's see how this works on my coffee-stained example. The original analysis was created by using this regression model in PROC REG:

proc reg data=sashelp.cars;
   model Mpg_city = Weight Cylinders EngineSize;
   ods select ParameterEstimates;
quit;

Rather than rerun the regression, I'll use the SAS/IML language to compute the means of the relevant variables, and use the inner product formula to obtain the estimate for the intercept term. In practice, you might also need to take care of two additional details:

  • The REG procedure excludes observations for which any explanatory variable is missing. If your data contain missing values, remove the rows of your data that have missing values. One way to do it is to use the COUNTMISS function in the SAS/IML language to exclude the rows with missing values.
  • The ParameterEstimates table displays the parameter estimates in a format that aligns the decimal points, but does not necessarily represent the full precision of the coefficients. Because I want the following PROC IML statements to give exactly the same estimate for the intercept term as reported by PROC REG, I used additional digits of precision for the coefficient of the WEIGHT variable.

proc iml;
use sashelp.cars;
read all var {Weight Cylinders EngineSize} into X;  /* read explanatory vars */
read all var {Mpg_city} into Y;                     /* read response var */
close sashelp.cars;
 
/* include only complete cases (exclude missing) */
idx = loc(countmiss(X, "row")=0);
X = X[idx, ];
Y = Y[idx, ];
 
xBar = mean(X);                         /* row vector of means for X */
yBar = mean(Y);                         /* mean of Y */
b = {-.003165747, -0.55569, -0.93885};  /* regression coefficients for X */
b0 = yBar - xBar * b;                   /* recover intercept term */
print b0;

The output shows that the intercept term for this multivariable regression is about 37.6. If you run the REG procedure, you obtain the same estimate.

Why should you care?

Okay, you might have guessed by now that I didn't really spill coffee on a piece of paper, so why did I go through this exercise? It is faster and easier to simply rerun PROC REG. Why go to the trouble of computing the interept estimate in terms of the means of the variables?

The reason is that some regression problems are easier to solve if you not only center the explanatory variables but also center the response variable. The centering operation results in a solution that passes through (0,0). In other words, the y-intercept term is zero. However, by using the trick in this article, you can "recover" the intercept term for the uncentered data by using the simple linear operation that I've described here. In a future article, I'll apply this trick to solving a problem that uses ridge regression. That article won't feature any spilled coffee, but the trick will be useful anyway.

tags: Data Analysis, Statistical Programming, Tips and Techniques
3月 062013
 

There is something for everyone at SAS Global Forum 2013. I like to attend presentations in the Statistics and Data Analysis track and talk with SAS customers in the SAS Support and Demo Area. But one activity that I enjoy the most is to stroll through the poster area and to chat informally with the poster presenters. In contrast to the formal lectures and presentations, posters promote a two-way dialog between the presenter and the listener. If a poster interests me, I might stay and talk for 20 minutes. If a poster does not interest me, I can move on to the next poster.

This year, several posters feature SAS/IML computations, so I am keen to see those and to talk with the presenters. This year the presenters will be at their posters on Monday, April 29, 2013, 2:00 pm–3:30 pm. Here are a few posters whose abstracts mention the SAS/IML language:

  • A SAS Macro for Generating a Set of All Possible Samples with Unequal Probabilities without Replacement
  • Optimize SAS/IML Software Codes for Big Data Simulation
  • A Flexible Method to Apply Multiple Imputation Using SAS/IML Studio
  • Analyzing Partially Confounded Factorial Conjoint Choice Experiments Using SAS/IML
  • What Score Should Johnny Get? Missing_Items SAS Macro for Analyzing Missing Item Responses on Summative Scales
  • Utilizing SAS for the Construction of Preassembled Parallel, Computerized Fixed-Test Forms under Item Response Theory Framework

The poster on how to "Optimize SAS/IML [for] Simulation" interests me because I am currently putting the finishing touches on my newest book, Simulating Data with SAS. There will be an advanced copy of my book at the SAS Press booth, in case you want to browse it.

Of course, I am also excited to be giving a free Hands-On Workshop, "Getting Started with the SAS/IML Language," on Tuesday, April 30, 2013, at 1:30 pm–3:10 pm (Room 2024 in the Moscone West Convention Center).

I am also interested in the paper "Using SAS to Measure Airport Connectivity: An Analysis of Airport Centrality in the US Network with SAS/IML Studio," which will be presented on Wednesday, May 1, 2013, at 9:00 am–9:50 am. Because SAS/IML Studio supports dynamically linked statistical graphs, I think it will be an interesting talk and demo. Also, I've done an analysis of US airports using SAS/IML Studio, so it will be fun to see a related analysis.

What talks are you looking forward to at SAS Global Forum 2013? I hope to see you there!

tags: Data Analysis
2月 272013
 

A SAS user asked an interesting question on the SAS/GRAPH and ODS Graphics Support Forum. The question is: Does PROC SGPLOT support a way to display the slope of the regression line that is computed by the REG statement? Recall that the REG statement in PROC SGPLOT fits and displays a line through points in a scatter plot.

In SAS 9.3, you cannot obtain this information directly from PROC SGPLOT. Instead, you need to use PROC REG to compute this information. You can use the following steps to create a plot that displays the parameter estimates:

  1. Use PROC REG to compute the parameter estimates (slope and intercept). Save this information to a SAS data set.
  2. Use a DATA step to create macro variables that contain the parameter estimates.
  3. Use the INSET statement in PROC SGPLOT to add this information to the fitted scatter plot\.

Step 1: Save the parameter estimates

You can use the OUTEST= option or the ODS OUPUT statements to save the parameter estimates to a SAS data set. In the following example, the ODS OUTPUT statement saves the ParameterEstimates table to the PE data set:

ods graphics off;
proc reg data=sashelp.class;
   model weight = height;
   ods output ParameterEstimates=PE;
run;

Step 2: Create macro variables

In the PE data set, the ESTIMATE variable contains the parameter estimates. The first row contains the estimate for the intercept term; the second row contains the estimate for the slope. The following DATA step saves these into macro variables:

data _null_;
   set PE;
   if _n_ = 1 then call symput('Int', put(estimate, BEST6.));    
   else            call symput('Slope', put(estimate, BEST6.));  
run;

Step 3: Use the INSET Statement to display the parameter estimates

You can now create the plot by using PROC SGPLOT. Use the INSET statement to display the parameter estimates in a text box:

proc sgplot data=sashelp.class noautolegend;
   title "Regression Line with Slope and Intercept";
   reg y=weight x=height;
   inset "Intercept = &Int" "Slope = &Slope" / 
         border title="Parameter Estimates" position=topleft;
run;

Of course, you can use a similar strategy to display any other relevant statistics on the scatter plot. There is an example in the SAS/STAT User's Guide that shows other fit statistics, as well as how to use Greek letters and superscripts in the inset text. You can also display a formula that shows the equation of a displayed line.

tags: Data Analysis, Statistical Graphics
2月 062013
 

Last week the SAS Training Post blog posted a short article on an easy way to find variables in common to two data sets. The article used PROC CONTENTS (with the SHORT option) to print out the names of variables in SAS data sets so that you can visually determine whether the data sets have any variables in common. The article also mentioned using the COMPARE procedure or writing a PROC SQL query that interrogates DICTIONARY tables.

But what if you want to find variable names that are common to many data sets?

The PROC SQL approach is a programming solution, so it might be up to the challenge. A quick internet search reveals one way to use PROC SQL to find common variables in two data sets (see p. 4 of the linked paper). I am not a PROC SQL expert, but the approach in that paper seems difficult to generalize to the case of multiple data sets.

Because I like the SAS/IML language, this article shows how to find all variables that are common to multiple data sets. The following statements define six SAS data sets:

data D1 D2;
A=1; b=2; C=3; D=4; E=5; F=6; g=7; h=8; I=9; J=10;
run;
data D3 D4;
j=1; f=2; h=3; a=4; N=7; L=6; c=7;
run;
data D5;
J=1; D=2; A=3; g=4; h=5; P=6; q=7;
run;
data D6;
C=1; M=2; F=3; a=4; j=5; H=6; B=7; R=8; K=9;
run;

I would have a hard time visually determining which variables are common to all of the data sets, so I'm going to write a program. I will use two SAS/IML functions to help:

  • The CONTENTS function returns a sorted list of the variables in a SAS data set. Use the UPCASE function in Base SAS to get the names in uppercase format so that you can perform case-insensitive comparisons.
  • The XSECT function returns the intersection between two or more arrays of values.

With those two functions, you can obtain the variables names that are common to the data sets D1–D6, as follows:

proc iml;
DSNames = "D1":"D6";
InCommon = upcase(contents(DSNames[1]));     /* get all vars in D1      */
do i = 2 to ncol(DSNames);                   /* loop over data sets     */
   varNames = upcase(contents(DSNames[i]));  /* get variable names      */
   InCommon = xsect(InCommon, varNames);     /* intersect with previous */
end;
print InCommon;

The variables that are common to all the SAS data sets are A, H, and J. If you want to generalize the problem even more, you can use the SAS/IML DATASETS function to get the names of all data sets in a library. For example, you could use DSNames = T(datasets("work")) instead of hard-coding the data set names in this example.

I invite you to submit your own solution in the comments.

tags: Data Analysis, Statistical Programming, Tips and Techniques
1月 142013
 

When a categorical variable has dozens or hundreds of categories, it is often impractical and undesirable to create a bar chart that shows the counts for all categories. Two alternatives are popular:

  • Display only the Top 10 or Top 20 categories. As I showed last week, to do this in SAS you can use PROC FREQ to tabulate and order the data, and then use the OBS= data set option to truncate the data. See the article "Create a bar chart with only a few categories" for details and SAS code.
  • Display the Top 10 or Top 20 categories, but also aggregate the counts of the remaining categories to create a new "Others" category. This merging of small categories is done by default in some software, such as in SAS/IML Studio bar charts, and is very useful at the exploratory and model-building phases of data analysis. This bar chart was also available in the now-defunct SAS/INSIGHT product.
This article describes how to create an "Others" category in PROC SGPLOT. It assumes that the categorical variable is a character variable. If your variable is numeric, a short DATA step that uses the PUT function will create a character variable.

Preliminary step: Count and order the categories

As described in the previous article, you can use PROC FREQ to tabulate the counts of observations. The following statements count the number of different vehicles for car manufactures in the Sashelp.Cars data set. The FreqOut data set is sorted in descending order by the Count variable.

/* Combine small categories into an "Other" category; sum the counts */
%let VName = Make;                 /* character categorical variable */
 
proc freq data=sashelp.cars ORDER=FREQ noprint;
tables &VName / out=FreqOut;
run;

Main step: Create the "Others" category by aggregating small categories

In order to create the new category, just sum the counts of the small categories. For example, if you intend to show the 20 largest categories, sum the counts of categories 21, 22, 23, and so forth. There are several ways to do this. For me, the SAS/IML language is to most compact:

/* aggregate counts; append new level "Others" to truncated categorical variable */
%let NumCats = 20; /* number of categories in bar chart */
proc iml;
varNames = {"&VName" "Count"};
/* read categories and counts */
use FreqOut; read all var varNames; close FreqOut;
 
/* sum up the counts for the smaller categories */
OtherCount = sum(Count[&NumCats+1:nrow(Count)]);
&VName = &VName[1:&NumCats] // "Others";  /* truncate and append "Others" */
Count  = Count[1:&NumCats] // OtherCount; /* truncate and append sum of small counts */
 
/* write new data set with the Top categories and "Others" */
create Other var varNames; append; close Other;

The vertical concatenation operator (//) was used to append the "Others" category and counts to the Top 20. You can now create the bar chart by calling the SGPLOT procedure, as follows:

/* create bar chart that includes the "Others" category */
proc sgplot data=Other;
  title "Top &NumCats Vehicle Manufacturers and 'Others' Category";
  hbar &VName / freq=Count;
  yaxis discreteorder=data;
  xaxis label = "Count" grid;
run;

The graph shows that Toyota, Chevrolet, and other companies produce many different models, but that collectively the companies that manufacture 10 or fewer models produce more than 100 models. Advantages of the "Others" category include:

  • You can see the relative size ("market share") of the Top 20 categories.
  • You can plot percentages instead of counts, and the percentages will be correctly interpretable as a percentage of the whole.
  • The graph shows all the data, so it is more honest than a graph that selectively shows only a portion of the data.

Other ways to compute the "Others" category

You could argue that using SAS/IML to add a series of numbers is overkill. However, when I tried to use the DATA step, I realized that I potentially need to change the length of the categorical variable. If the length of the categorical variable is less than six, the word "Others" gets truncated. I am not aware of any way to change the length of a character variable "dynamically" in the DATA step, whereas in PROC IML the concatenation operator ensures that c = c // "Others" always has space for at least six characters.

In the following code, I follow a suggestion made by Chris Hemedinger and create a macro variable that is always greater than or equal to six. That macro variable is then used in a LENGTH statement to ensure that the categorical variable can hold the string "Others":

/* DATA step approach */
/* 1. Check length of the category variable. Create macro variable . */
data _null_;
   dsid = open('sashelp.cars');
   cur_len = varlen(dsid,varnum(dsid,"&VName")); /* length of categorical var */
   rc = close(dsid);
   call symputx('len',max(cur_len, 6));          /* new length is always >= 6 */
run;
 
/* 2. Keep first few obs; accumulate count of Others */
data Other2(keep=&VName Count);
length &VName $ &len;                     /* make sure length >= 6 */
retain OtherCount;
set FreqOut END=DONE;
if _N_>&NumCats & ^DONE then do;          /* sum up the counts for the smaller categories */ 
   OtherCount + Count;
   delete;                                /* get rid of this observation */
end;
if DONE then do;
   OtherCount + Count; 
   &VName = "Others"; Count = OtherCount; /* append last obs */
end;
run;

The Others2 data set is identical to the Others data set that was created by PROC IML. However, my Base SAS implementation is not as compact as the SAS/IML version. Perhaps you can do better? Let me know in the comments.

tags: Data Analysis, Statistical Graphics
1月 142013
 

When a categorical variable has dozens or hundreds of categories, it is often impractical and undesirable to create a bar chart that shows the counts for all categories. Two alternatives are popular:

  • Display only the Top 10 or Top 20 categories. As I showed last week, to do this in SAS you can use PROC FREQ to tabulate and order the data, and then use the OBS= data set option to truncate the data. See the article "Create a bar chart with only a few categories" for details and SAS code.
  • Display the Top 10 or Top 20 categories, but also aggregate the counts of the remaining categories to create a new "Others" category. This merging of small categories is done by default in some software, such as in SAS/IML Studio bar charts, and is very useful at the exploratory and model-building phases of data analysis. This bar chart was also available in the now-defunct SAS/INSIGHT product.
This article describes how to create an "Others" category in PROC SGPLOT. It assumes that the categorical variable is a character variable. If your variable is numeric, a short DATA step that uses the PUT function will create a character variable.

Preliminary step: Count and order the categories

As described in the previous article, you can use PROC FREQ to tabulate the counts of observations. The following statements count the number of different vehicles for car manufactures in the Sashelp.Cars data set. The FreqOut data set is sorted in descending order by the Count variable.

/* Combine small categories into an "Other" category; sum the counts */
%let VName = Make;                 /* character categorical variable */
 
proc freq data=sashelp.cars ORDER=FREQ noprint;
tables &VName / out=FreqOut;
run;

Main step: Create the "Others" category by aggregating small categories

In order to create the new category, just sum the counts of the small categories. For example, if you intend to show the 20 largest categories, sum the counts of categories 21, 22, 23, and so forth. There are several ways to do this. For me, the SAS/IML language is to most compact:

/* aggregate counts; append new level "Others" to truncated categorical variable */
%let NumCats = 20; /* number of categories in bar chart */
proc iml;
varNames = {"&VName" "Count"};
/* read categories and counts */
use FreqOut; read all var varNames; close FreqOut;
 
/* sum up the counts for the smaller categories */
OtherCount = sum(Count[&NumCats+1:nrow(Count)]);
&VName = &VName[1:&NumCats] // "Others";  /* truncate and append "Others" */
Count  = Count[1:&NumCats] // OtherCount; /* truncate and append sum of small counts */
 
/* write new data set with the Top categories and "Others" */
create Other var varNames; append; close Other;

The vertical concatenation operator (//) was used to append the "Others" category and counts to the Top 20. You can now create the bar chart by calling the SGPLOT procedure, as follows:

/* create bar chart that includes the "Others" category */
proc sgplot data=Other;
  title "Top &NumCats Vehicle Manufacturers and 'Others' Category";
  hbar &VName / freq=Count;
  yaxis discreteorder=data;
  xaxis label = "Count" grid;
run;

The graph shows that Toyota, Chevrolet, and other companies produce many different models, but that collectively the companies that manufacture 10 or fewer models produce more than 100 models. Advantages of the "Others" category include:

  • You can see the relative size ("market share") of the Top 20 categories.
  • You can plot percentages instead of counts, and the percentages will be correctly interpretable as a percentage of the whole.
  • The graph shows all the data, so it is more honest than a graph that selectively shows only a portion of the data.

Other ways to compute the "Others" category

You could argue that using SAS/IML to add a series of numbers is overkill. However, when I tried to use the DATA step, I realized that I potentially need to change the length of the categorical variable. If the length of the categorical variable is less than six, the word "Others" gets truncated. I am not aware of any way to change the length of a character variable "dynamically" in the DATA step, whereas in PROC IML the concatenation operator ensures that c = c // "Others" always has space for at least six characters.

In the following code, I follow a suggestion made by Chris Hemedinger and create a macro variable that is always greater than or equal to six. That macro variable is then used in a LENGTH statement to ensure that the categorical variable can hold the string "Others":

/* DATA step approach */
/* 1. Check length of the category variable. Create macro variable . */
data _null_;
   dsid = open('sashelp.cars');
   cur_len = varlen(dsid,varnum(dsid,"&VName")); /* length of categorical var */
   rc = close(dsid);
   call symputx('len',max(cur_len, 6));          /* new length is always >= 6 */
run;
 
/* 2. Keep first few obs; accumulate count of Others */
data Other2(keep=&VName Count);
length &VName $ &len;                     /* make sure length >= 6 */
retain OtherCount;
set FreqOut END=DONE;
if _N_>&NumCats & ^DONE then do;          /* sum up the counts for the smaller categories */ 
   OtherCount + Count;
   delete;                                /* get rid of this observation */
end;
if DONE then do;
   OtherCount + Count; 
   &VName = "Others"; Count = OtherCount; /* append last obs */
end;
run;

The Others2 data set is identical to the Others data set that was created by PROC IML. However, my Base SAS implementation is not as compact as the SAS/IML version. Perhaps you can do better? Let me know in the comments.

tags: Data Analysis, Statistical Graphics