data analysis

9月 162013
 

SAS has supported calling R from the SAS/IML language since 2009. The interface to R is part of the SAS/IML language. However, there have been so many versions of SAS and R since 2009, that it is hard to remember which SAS release supports which versions of R. The following table lists recent SAS releases and the versions of R that each supports:

There were two releases of R that broke backwards compatibility with prior SAS releases:

  • R 2.12.0, which introduced 64-bit R, changed the locations of certain libraries. To use R 2.12.0 or later, you must use SAS 9.3 or later.
  • R 3.0.0 changed certain aspects of the external API to R. To use R 3.0.0 or later you must use SAS 9.4 or later.
An error message that you might see because of incompatible versions is "An installed version of R could not be found," although this message can also occur for other reasons.

SAS ensures that each SAS release supports backwards compatibility with as many previous R releases as possible. However, after a version of SAS ships, it is impossible to ensure forward compatibility, as evidenced by the R 3.0.0 release. That version of R was released in April 2013; it is supported in SAS 9.4 software, which shipped a few months later.

Some software companies distribute their own versions of R, but SAS does not. Consequently, if the interface to R changes, SAS customers need to use a compatible version of R until they can upgrade to a more recent version of SAS. To reiterate:

  • If you are running a version of SAS prior to SAS 9.4, you should use R 2.15.x or earlier.
  • You can call R 3.0.x by using SAS/IML 12.3 or later. SAS/IML 12.3 shipped in July 2013 as part of SAS 9.4.

I'll close with a few comments about 32-bit and 64-bit versions of SAS and R.

  • If you are using SAS software on a 32-bit edition of Windows, you must install the 32-bit edition of R.
  • If you are using SAS software on a 64-bit edition of Windows, you can install either the 32-bit or the 64-bit edition of R. The 32-bit edition of SAS looks first for the 32-bit edition of R and then for the 64-bit edition. The 64-bit edition of SAS looks first for the 64-bit edition of R and then for the 32-bit edition.

tags: Data Analysis, SAS/IML Studio
9月 162013
 

SAS has supported calling R from the SAS/IML language since 2009. The interface to R is part of the SAS/IML language. However, there have been so many versions of SAS and R since 2009, that it is hard to remember which SAS release supports which versions of R. The following table lists recent SAS releases and the versions of R that each supports:

To date, three releases of R broke backwards compatibility with prior SAS releases:

  • R 2.12.0, which introduced 64-bit R, changed the locations of certain libraries. To use R 2.12.0 or later, you must use SAS 9.3 or later.
  • R 3.0.0 changed certain aspects of the external API to R. To use R 3.0.0 or later you must use SAS 9.4 or later.
  • R 3.0.2 changed an internal detail that SAS was using. SAS/IML 13.1 will support R 3.0.2.
An error message that you might see because of incompatible versions is "An installed version of R could not be found," although this message can also occur for other reasons.

SAS ensures that each SAS release supports backwards compatibility with as many previous R releases as possible. However, after a version of SAS ships, it is impossible to ensure forward compatibility, as evidenced by the R 3.0.0 release. That version of R was released in April 2013; it is supported in SAS 9.4 software, which shipped a few months later.

Some software companies distribute their own versions of R, but SAS does not. Consequently, if the interface to R changes, SAS customers need to use a compatible version of R until they can upgrade to a more recent version of SAS. To reiterate:

  • If you are running a version of SAS prior to SAS 9.4, you should use R 2.15.x or earlier.
  • You can call R 3.0.1 by using SAS/IML 12.3 or later. SAS/IML 12.3 shipped in July 2013 as part of SAS 9.4.
  • You can call R 3.0.2 and later by using SAS/IML 13.1 or later. SAS/IML 13.1 is scheduled to ship in December 2013 as part of SAS 9.4m1.

I'll close with a few comments about 32-bit and 64-bit versions of SAS and R.

  • If you are using SAS software on a 32-bit edition of Windows, you must install the 32-bit edition of R.
  • If you are using SAS software on a 64-bit edition of Windows, you can install either the 32-bit or the 64-bit edition of R. The 32-bit edition of SAS looks first for the 32-bit edition of R and then for the 64-bit edition. The 64-bit edition of SAS looks first for the 64-bit edition of R and then for the 32-bit edition.

tags: Data Analysis, SAS/IML Studio
8月 212013
 
Radar chart of word categories used in debates

A common visualization is to compare characteristics of two groups. This article emphasizes two tips that will help make the comparison clear. First, consider graphing the differences between the groups. Second, in any plot that has a categorical axis, sort the categories by a meaningful quantity.

This article is motivated by a radar plot that compares the performance of Barak Obama and Mitt Romney in the 2012 US presidential debates. The radar plot (shown at left) displays 12 characteristics of the candidates' words during the debates. Characteristics included "directness", "talkativeness", and "sophistication." The radar chart accompanies an interesting article about using text analytics to investigate the candidates' styles and performance during the debates.

Unfortunately, I found it difficult to use the radar chart to compare the candidates. I wondered whether a simpler chart might be more effective. I couldn't find the exact numbers used to create the radar chart, but I could use the radar chart to estimate the values. The following DATA step defines the data. The numbers seem to be the relative percentages that each candidate displayed a given characteristic. (Possibly weighted by the amount of time that each candidate spoke?) Each row adds to 100%. For example, of the words and phrases that were classified as exhibiting "individualism," 47% of the phrases were uttered by Obama, and 53% by Romney.

data Debate2012;
input Category $15. Obama Romney;
datalines;
Individualism  47 53
Directness     49 51
Talkativeness  49 51
Achievement    45 55
Perceptual     44 56
Quantitative   36 64
Insight        47 53
Causation      59 41
Thinking       52 48
Certainty      56 44
Sophistication 51 49
Collectivism   56 44
;

You can use the data in this form to display a series plot (line plot) of the percentages versus the categories, but the main visual impact of the line plot is that Obama's values are low when Romney's are high, and vice versa. Since that is just a consequence of the fact that the percentages add to 100%, I do not display the series plot.

To display a bar chart, you can create a grouping variable with the values "Obama" and "Romney." You can then use the GROUP= option on the HBAR statement in PROC SGPLOT to create the following stacked bar chart:

data Long;        /* transpose the data from wide to long */
set Debate2012;
Candidate = "Obama "; Value = Obama;  output;
Candidate = "Romney"; Value = Romney; output;
drop Obama Romney;
run;
 
title "2012 US Presidential Debates";
title2 "Linguistic Style Indexed Across Three Debates";
footnote justify=left "Data from http://blog.odintext.com/?p=179";
proc sgplot data=Long;
   hbar Category / response=Value group=Candidate groupdisplay=stack;
   refline 50 / axis=x lineattrs=(color=black);
   yaxis discreteorder=data;
   xaxis label="Percentage" values=(0 to 100 by 10);
run;

The stacked bar chart makes it easy to compare the styles of the two candidates. In some categories (directness, talkativeness, and sophistication), the candidates are similar. In others (quantitative, causation), the candidates differ more dramatically. A prominent reference line at 50% provides a useful dividing line for comparing the candidates' styles.

However, I see two problems with the stacked bar chart. First, the categories do not seem to be sorted in any useful order. (This order was used for the radar chart.) Second, I often find that if you want to contrast two groups, it often makes sense to plot the difference between the groups, rather than the absolute quantities. This results in the following changes:
  • The reference line becomes zero, which represents no difference.
  • The chart can use a smaller scale, so that it is easier to see small differences.
  • You can sort the categories by the difference, so that the order of the categories becomes useful.
  • The difference is a single quantity, which is easier to visualize than two categories. (Although, since the percentages sum to 100%, there is only one independent quantity no matter how you visualize the data.)

The following DATA step uses the original data to compute the percentage difference between the candidates for each style category. Although it is not necessary to use two colors to visualize a single quantity, in US politics it is a convention to use red to represent Republicans (Romney) and blue to represent Democrats (Obama).

data DebateDiff;
set Debate2012;
label Difference = "Percentage Difference: Romney - Obama";
Difference = Romney - Obama;
if Difference>0 then Advantage = "Romney";
else                 Advantage = "Obama ";
run;
 
proc sort data=DebateDiff;  /* sort categories by difference */
   by Difference;
run;
 
title2 "Linguistic Style Differences Indexed Across Three Debates";
proc sgplot data=DebateDiff;
   hbar Category / response=Difference group=Advantage;
   refline 0 / axis=x;
   yaxis discreteorder=data;
   xaxis grid;
   keylegend / position=topright location=inside title="Candidate" across=1; 
run;

I think that the "difference plot" is superior to the original radar chart and to the stacked bar chart. It makes a clear contrast of the candidates' styles. You can clearly see that Obama's words evoked causation, certainty, and collectivism much more than Romney's did. In contrast, Romney's words were perceptual and quantitative more often than Obama's. In a statistical analysis that includes uncertainties, you could even include a margin of error on this chart. For example, it might be the case that the candidates' styles were not significantly different for the sophistication, directness, and talkativeness categories. If so, those bars might be colored white or grey.

Although radar charts are useful for plotting periodic categories such as months and wind directions, I think a simple bar chart is more effective for this analysis. The main idea that I want to convey is that if you want to contrast two groups, consider displaying the difference between the groups. Also, order categories in a meaningful way, rather than default to an arbitrary order such as alphabetical order. By using these two simple tips, you can create a graph that makes a clear comparison between two groups.

What do you think? For these data, which plot would you choose to display differences between the candidates?

tags: Data Analysis, Statistical Graphics
8月 142013
 

Earlier this week I posted a "guest blog" in which my 8th grade son described a visualization of data for the 2013 ASA Poster Competition. The purpose of today's blog post is to present a higher-level statistical analysis of the same data. I will use a t test and a regression analysis with a classification variable—two concepts that are not generally known to elementary school students. (If you like analyzing kids' experiments with SAS, see also Chris Hemedinger's 2011 analysis of his daughter's "Mentos and Coke" science fair project.)

The experiment measured the drying times for 40 loads of laundry. Each load was weighed upon emerging from the washing machine. For 20 loads, my son inserted two plastic "dryer balls" into the dryer with the laundry. The other 20 loads were dried without using dryer balls. Consequently, the variables are as follows:

  • Weight: the weight, in grams, of the wet clothes
  • Time: the time, in minutes, required to dry the clothes, as measured by an automatic "dampness sensor" in the dryer
  • Balls: an indicator variable that has the value 1 if dryer balls were included in the dryer, and 0 otherwise

The data are available for download it you'd like to analyze these data on your own.

A t-test analysis

On average, does the presence of dryer balls result in shorter drying times? One way to answer this question is to ask whether the mean drying time is the same for the dryer-ball group as compared to the control loads:

proc ttest data=DryerBalls;
   class Balls;
   var Time;
run;

The comparative histograms and side-by-side box plots indicate that the distributions of drying times are similar for the two groups. Notice that the 95% confidence interval for the difference of the means includes 0. A t test produces a large p-value (circled), which indicates that the means are not significantly different.

A regression analysis

How does the drying time depend on the weight of the wet clothes and on the presence or absence of dryer balls? You can use the GLM procedure to model the relationship:

ods html style=htmlbluecml;
proc glm data=DryerBalls plots=all;
   class Balls;
   model Time = Weight Balls / solution;
run;

The fit plot tells the story. There are actually two lines in the plot, one for Balls=1 and the other for Balls=0. However, the lines are right on top of each other, so it looks like there is only one line. The type III sum of squares and the parameter estimates confirm that the Balls variable is not significant to the model. In other words, the presence or absence of dryer balls does not significantly affect the model for drying time when controlling for the weight of the clothes. The parameter estimates indicate that each additional kilogram of clothes adds about 3.5 minutes to the drying time.

I also fit a model that includes a quadratic term for the Weight variable, but the conclusion is the same: dryer balls do not affect drying time.

I am pleased that these more sophisticated statistical analyses give the same conclusions as my son observed while using more elementary methods. Can you think of additional analyses that would be appropriate? Download the data and let me know what you discover.

tags: Data Analysis
8月 122013
 

Editor's Note: My 8th grade son, David, created a poster that he submitted to the 2013 ASA Poster Competition. The competition encourages students to display "two or more related graphics that summarize a set of data, look at the data from different points of view, and answer specific questions about the data." In this guest blog, David explains his experiment and analysis. All prize-winning posters are featured in the August 2013 issue of Amstat News.
-- Rick

Hi! My name is David Wicklin. This blog post describes a research project and statistical graphs that I created for the 2013 ASA Poster Competition.

My poster began one day when my sister saw a small box on a store shelf that contained two spiked plastic balls. The balls were about the size of tennis balls. The box said that you should put these "dryer balls" in your dryer with a load of wet laundry. The box claimed that the balls would "reduce drying time by up to 25%." I was skeptical. Could putting plastic balls in the dryer really reduce drying time? I was looking for a project for my school science fair, so I decided to conduct an experiment to find out.

My family's dryer has a "dampness sensor" that beeps when a load of laundry is dry. I decided to use this feature and a stop watch to determine the drying time for a load of laundry. For half the loads, I'd toss in two dryer balls. For others I would not.

Experimental Method

The first issue was how to design the experiment. One option was a controlled experiment in which I would select clothes of different materials (jeans, sweatshirts, tee shirts, and so forth) and different quantities (small, medium, and large loads). I would wash and dry each load twice: once with dryer balls and once without. That would make comparing drying times easy because the only varying factor would be the dryer balls; the clothes would be exactly the same.

However, from a practical point of view, a controlled experiment was not ideal. It would require wasting time and energy to wash and dry items twice. Plus, my family had real laundry that needed washing and drying! My dad suggested that we wash the regular family clothes and compare the mean drying times. If I washed enough laundry, I should be able to account for the random differences between the loads, such as the materials being dried. I would weigh the clothes to keep track of the size of the load.

I got my mom to agree to let me dry our family's laundry over the next few months. As you might suspect, she was very much in favor of this activity! I dried a total of 40 loads of laundry. For each load, I weighed the wet clothes by using a spring scale. For 20 randomly selected loads, I added two dryer balls to the dryer. The other 20 loads had no balls, just the clothes. Each load was dried using the same dryer setting. All types of clothing were used to randomize the samples. For each load, I recorded the weight and the drying time in a notebook.

Analyzing and graphing the data

I created two different graphs. The first is a side-by-side box and whisker diagram. This shows that the average drying times are not separated by much. For loads without dryer balls, the average drying time was 28.4 minutes. The median drying time was 27.5 minutes, and the first and third quartiles were 22.25 and 34 minutes. For loads with dryer balls added, the average drying time was 28.3 minutes. The median drying time was 28.75 minutes, and the first and third quartiles were 23 and 33 minutes.

The box and whisker diagram shows that it does not matter if you add dryer balls or not. If dryer balls really "reduce drying time by up to 25%," I would expect the red box plot to be shifted down by 5–8 minutes as compared to the blue box. It isn't. The two boxes are essentially the same.

But maybe dryer balls only "work" for big loads (or small loads). To find out, I created a scatter plot that displays the drying time and the weight of the wet clothes. I plotted blue squares for drying times that did not use dryer balls. I plotted red circles for times that included dryer balls.

The scatter plot shows that dryer balls do not affect drying time whether it is a heavy load or a light one. For light loads, it takes about 22 minutes to dry the laundry. For heavy loads it takes about 35 minutes. The red and blue markers are mixed together. If dryer balls actually reduce drying time, I would expect the red markers to be mostly below the blue markers.

The scatter plot enables me to estimate how long it will take to dry clothes, which is pretty cool! If I know that a load of laundry weighs 4 kg, I can predict that the load will take about 27 minutes to dry. I would be very confident to predict that the load will dry between 24 and 34 minutes.

The drying time obviously increases with the weight of the wet clothes. However, it is not clear if the drying time increases in a straight line or maybe a curve.

Conclusions and Reflections

A limitation of my experiment is that I only used one washer and dryer. Our family's dryer is fairly new and our washer is "high efficiency," which means that it spins out a lot of the water. I have not done the experiment with an older, conventional, washer.

Based on my results, I conclude that the mean drying time does not depend on whether dryer balls are used. In particular, dryer balls do not “reduce drying time by up to 25%.” The observed difference was about 1%, but that could be due to randomness. The time required to dry clothes depends on the weight of the wet clothes, but using dryer balls does not make any difference in the time it takes to dry the clothes, whether the load is small or large.

I am very happy to report that this poster was awarded second place in the nation for the 7–9 grade level! I am excited to be chosen for this honor. I have to thank my dad because without him I never would have been motivated enough to make this poster or record and plot 40 loads of laundry.

I got presented the award at a very special ceremony that featured the 2013 president of the ASA, Marie Davidian, a professor in the Statistics Department at NC State University. Also at the ceremony was the 2012 president, Bob Rodriguez, a senior director at SAS. They gave me a plaque and I told them all about my project. They were really nice and very friendly. It really meant a lot to me that they would take time out of their busy day to show interest in my poster and to encourage me to pursue science and math in the future.

tags: Data Analysis, Just for Fun
7月 312013
 

Do you have dozens (or even hundreds) of SAS data sets that you want to read into SAS/IML matrices? In a previous blog post, I showed how to iterate over a series of data sets and analyze each one. Inside the loop, I read each data set into a matrix X. I computed some quantity, and then reused X for the next data set. This technique works well, but it assumes that you only need to analyze one matrix at a time. How can you read the data sets into different matrices so that all the matrices reside in memory concurrently?

The trick is to use the SAS/IML VALSET routine to assign the contents of each data set to a unique matrix name. As I showed in a previous article, the VALSET call can dynamically create matrices whose names are specified at run time. This enables you to automate the reading of data sets into SAS/IML matrices.

The following program creates three data sets and reads the data into SAS/IML matrices whose names are identical to the data set names:

/* create square matrices in data sets */
data A;
  x=1; y=2; output;  x=2; y=3; output;
run;
data B;  set A;  x = x + 2; run;
data C;  set A;  y = y + 1; run;
 
/* read data into SAS/IML matrices of the same name */
proc iml;
dsNames = {A B C};                /* specify names of data set */
MatNames = dsNames;               /* specify names of matrices */ 
do i = 1 to ncol(dsNames);
   use (dsNames[i]);
   read all var _NUM_ into X;    /* read data into X */
   call valset(MatNames[i], X);  /* copy data to matrix with specified name */
   close (dsNames[i]);
end;
free i X;
show names;

The READ statement reads all numerical data into the matrix X. The VALSET call copies data from X into a matrix whose name is stored in MatNames[i]. The SHOW NAMES statement displays the names of all matrices that have values. From the output, you can see that the program creates matrices named A, B, and C.

For this example, the names of the matrices are the same as the names of the data sets. Of course, the matrices could have different names. You can assign MatName to be any array of valid names. For example, the following statement specifies the matrix names as X1, X2, and X3:

MatNames = "x1":("x"+strip(char(ncol(dsNames))));

You can use these ideas to read hundreds of data sets into SAS/IML matrices. For example, suppose that you need to create SAS/IML matrices for all of the numerical data in a library. You can use the DATASETS function to obtain the names of all data sets in a library. There are two special cases that you potentially need to handle:

  • Some data sets might not have any numeric variables. To protect against that case, use the _ALL_ keyword to read the data. If a data set contains even one numerical variable, it will be read into the X matrix. If not, the matrix X will be a character matrix, and you can use the TYPE function to check whether X is numeric before you call the VALSET subroutine.
  • Some data sets might be empty. To protect against this case, use the FREE statement to free X matrix you read each data set. If the next data set is empty, then the X matrix will be undefined. Again, the TYPE function can check for this situation.

The following SAS/IML program reads all of the data sets in the SASHELP library. On my version of SAS, there are 129 data sets in the library. Two are empty and 10 others do not contain any numerical variables. Consequently, the program defines 117 numerical matrices:

proc iml;
libref = "Sashelp";                /* read from this library */
dsNames = T( datasets(libref) );   /* 129 data sets */
MatNames = dsNames;                /* names of matrices */ 
do i = 1 to ncol(dsNames);
   ds = libref + "." + strip(dsNames[i]);
   use (ds);                       /* open sashelp.filename */
   read all var _ALL_ into X;      /* read data into X */
   if type(X)="N" then             /* exclude data with zero rows or all char */
      call valset(MatNames[i], X); /* copy data to matrix with specified name */
   close (ds);
   free X;                         /* in case of empty data sets */
end;
show names;                        /* 117 numerical matrices defined */

The SHOW NAMES statement shows the names of a few of the matrices that were created from the numerical data in the Sashelp library. These matrices reside in memory concurrently. This technique would be useful if you have a SAS program that creates a bunch of matrices (as data sets) that you want to multiply together.

tags: Data Analysis
7月 312013
 

Do you have dozens (or even hundreds) of SAS data sets that you want to read into SAS/IML matrices? In a previous blog post, I showed how to iterate over a series of data sets and analyze each one. Inside the loop, I read each data set into a matrix X. I computed some quantity, and then reused X for the next data set. This technique works well, but it assumes that you only need to analyze one matrix at a time. How can you read the data sets into different matrices so that all the matrices reside in memory concurrently?

The trick is to use the SAS/IML VALSET routine to assign the contents of each data set to a unique matrix name. As I showed in a previous article, the VALSET call can dynamically create matrices whose names are specified at run time. This enables you to automate the reading of data sets into SAS/IML matrices.

The following program creates three data sets and reads the data into SAS/IML matrices whose names are identical to the data set names:

/* create square matrices in data sets */
data A;
  x=1; y=2; output;  x=2; y=3; output;
run;
data B;  set A;  x = x + 2; run;
data C;  set A;  y = y + 1; run;
 
/* read data into SAS/IML matrices of the same name */
proc iml;
dsNames = {A B C};                /* specify names of data set */
MatNames = dsNames;               /* specify names of matrices */ 
do i = 1 to ncol(dsNames);
   use (dsNames[i]);
   read all var _NUM_ into X;    /* read data into X */
   call valset(MatNames[i], X);  /* copy data to matrix with specified name */
   close (dsNames[i]);
end;
free i X;
show names;

The READ statement reads all numerical data into the matrix X. The VALSET call copies data from X into a matrix whose name is stored in MatNames[i]. The SHOW NAMES statement displays the names of all matrices that have values. From the output, you can see that the program creates matrices named A, B, and C.

For this example, the names of the matrices are the same as the names of the data sets. Of course, the matrices could have different names. You can assign MatName to be any array of valid names. For example, the following statement specifies the matrix names as X1, X2, and X3:

MatNames = "x1":("x"+strip(char(ncol(dsNames))));

You can use these ideas to read hundreds of data sets into SAS/IML matrices. For example, suppose that you need to create SAS/IML matrices for all of the numerical data in a library. You can use the DATASETS function to obtain the names of all data sets in a library. There are two special cases that you potentially need to handle:

  • Some data sets might not have any numeric variables. To protect against that case, use the _ALL_ keyword to read the data. If a data set contains even one numerical variable, it will be read into the X matrix. If not, the matrix X will be a character matrix, and you can use the TYPE function to check whether X is numeric before you call the VALSET subroutine.
  • Some data sets might be empty. To protect against this case, use the FREE statement to free X matrix you read each data set. If the next data set is empty, then the X matrix will be undefined. Again, the TYPE function can check for this situation.

The following SAS/IML program reads all of the data sets in the SASHELP library. On my version of SAS, there are 129 data sets in the library. Two are empty and 10 others do not contain any numerical variables. Consequently, the program defines 117 numerical matrices:

proc iml;
libref = "Sashelp";                /* read from this library */
dsNames = T( datasets(libref) );   /* 129 data sets */
MatNames = dsNames;                /* names of matrices */ 
do i = 1 to ncol(dsNames);
   ds = libref + "." + strip(dsNames[i]);
   use (ds);                       /* open sashelp.filename */
   read all var _ALL_ into X;      /* read data into X */
   if type(X)="N" then             /* exclude data with zero rows or all char */
      call valset(MatNames[i], X); /* copy data to matrix with specified name */
   close (ds);
   free X;                         /* in case of empty data sets */
end;
show names;                        /* 117 numerical matrices defined */

The SHOW NAMES statement shows the names of a few of the matrices that were created from the numerical data in the Sashelp library. These matrices reside in memory concurrently. This technique would be useful if you have a SAS program that creates a bunch of matrices (as data sets) that you want to multiply together.

tags: Data Analysis
7月 172013
 

In a previous article I discussed how to bin univariate observations by using the BIN function, which was added to the SAS/IML language in SAS/IML 9.3. You can generalize that example and bin bivariate or multivariate data. Over two years ago I wrote a blog post on 2D binning in the SAS/IML language, but that post was written before the SAS/IML 9.3 release, so it does not take advantage of the BIN function.

The image to the left shows a scatter plot of data from the Sashelp.BWeight data set. For 50,000 pregnancies, the birth weight of the baby (in grams) is plotted against the weight gain of the mother (in pounds), adjusted so that the median weight gain is centered at zero. There are 11 subintervals in the horizontal direction and seven subintervals in the vertical direction. Consequently, there are 77 cells, which you can enumerate in row-major order beginning with the upper-left cell.

A two-dimensional binning of these data enables you to find the bin for each observation. Many statistics use binned data. For example, you can do a chi-square test for association, estimate density, and test for spatial randomness. The following SAS/IML statements associates a bin number to each observation. The Bin2D function calls the BIN function twice.

proc iml;
use sashelp.bweight;
read all var {m_wtgain weight} into Z;
close;
 
/* define two-dimensional binning function */
start Bin2D(u, cutX, cutY);
   bX = bin(u[,1], cutX);   /* bins in X direction: 1,2,...,kx */
   bY = bin(u[,2], cutY);   /* bins in Y direction: 1,2,...,ky */
   bin = bX + (ncol(cutX)-1)*(bY-1);     /* bins 1,2,...,kx*ky */
   return(bin);
finish;
 
cutX = do(-35, 75, 10);            /* cut points in X direction */
cutY = do(0, 7000, 1000);          /* cut points in Y direction */
b = Bin2D(Z, cutX, cutY);          /* bin numbers 1-77 */

I wrote the Bin2D function to return an index, but you can convert the index into a subscript if you prefer. (For example, the index 77 corresponds to the subscript (11, 7).) The index is useful for tabulating the number of observations in each bin. The TABULATE call counts the number of observations in each bin, which you can visualize by forming a frequency table:

call tabulate(binNumber, Freq, b);           /* count in each bin */
Count = j(ncol(cutY)-1, ncol(cutX)-1, 0);    /* allocate matrix */
Count[binNumber] = Freq;                     /* fill nonzero counts */
print Count[c=('bX1':'bX11') r=('bY1':'bY7')];

Notice that the Y axis is reversed on the scatter plot. This makes it easy to compare the scatter plot and the tabular frequencies.

If you want to overlay the frequencies on the scatter plot, you can write the cell counts to a SAS data set and use PROC SGPLOT. The result is shown in the following figure. You can download the program used to create the figures in this blog post.

In summary, you can use the BIN function in the SAS/IML language to bin data in many dimensions. The BIN function is simpler than using the DATA step or PROC FORMAT. It handles unevenly spaced intervals and semi-infinite intervals. It's also fast. So next time you you want to group continuous data into intervals—or higher-dimensional cells!— give it a try.

tags: Data Analysis
7月 172013
 

In a previous article I discussed how to bin univariate observations by using the BIN function, which was added to the SAS/IML language in SAS/IML 9.3. You can generalize that example and bin bivariate or multivariate data. Over two years ago I wrote a blog post on 2D binning in the SAS/IML language, but that post was written before the SAS/IML 9.3 release, so it does not take advantage of the BIN function.

The image to the left shows a scatter plot of data from the Sashelp.BWeight data set. For 50,000 pregnancies, the birth weight of the baby (in grams) is plotted against the weight gain of the mother (in pounds), adjusted so that the median weight gain is centered at zero. There are 11 subintervals in the horizontal direction and seven subintervals in the vertical direction. Consequently, there are 77 cells, which you can enumerate in row-major order beginning with the upper-left cell.

A two-dimensional binning of these data enables you to find the bin for each observation. Many statistics use binned data. For example, you can do a chi-square test for association, estimate density, and test for spatial randomness. The following SAS/IML statements associates a bin number to each observation. The Bin2D function calls the BIN function twice.

proc iml;
use sashelp.bweight;
read all var {m_wtgain weight} into Z;
close;
 
/* define two-dimensional binning function */
start Bin2D(u, cutX, cutY);
   bX = bin(u[,1], cutX);   /* bins in X direction: 1,2,...,kx */
   bY = bin(u[,2], cutY);   /* bins in Y direction: 1,2,...,ky */
   bin = bX + (ncol(cutX)-1)*(bY-1);     /* bins 1,2,...,kx*ky */
   return(bin);
finish;
 
cutX = do(-35, 75, 10);            /* cut points in X direction */
cutY = do(0, 7000, 1000);          /* cut points in Y direction */
b = Bin2D(Z, cutX, cutY);          /* bin numbers 1-77 */

I wrote the Bin2D function to return an index, but you can convert the index into a subscript if you prefer. (For example, the index 77 corresponds to the subscript (11, 7).) The index is useful for tabulating the number of observations in each bin. The TABULATE call counts the number of observations in each bin, which you can visualize by forming a frequency table:

call tabulate(binNumber, Freq, b);           /* count in each bin */
Count = j(ncol(cutY)-1, ncol(cutX)-1, 0);    /* allocate matrix */
Count[binNumber] = Freq;                     /* fill nonzero counts */
print Count[c=('bX1':'bX11') r=('bY1':'bY7')];

Notice that the Y axis is reversed on the scatter plot. This makes it easy to compare the scatter plot and the tabular frequencies.

If you want to overlay the frequencies on the scatter plot, you can write the cell counts to a SAS data set and use PROC SGPLOT. The result is shown in the following figure. You can download the program used to create the figures in this blog post.

In summary, you can use the BIN function in the SAS/IML language to bin data in many dimensions. The BIN function is simpler than using the DATA step or PROC FORMAT. It handles unevenly spaced intervals and semi-infinite intervals. It's also fast. So next time you you want to group continuous data into intervals—or higher-dimensional cells!— give it a try.

tags: Data Analysis
7月 152013
 

It is often useful to partition observations for a continuous variable into a small number of intervals, called bins. This familiar process occurs every time that you create a histogram, such as the one on the left. In SAS you can create this histogram by calling the UNIVARIATE procedure. Optionally, if you want to create a data set that contains the count of the observations in each bin, you can use the OUTHIST= option as shown in the following statements:

ods select Histogram;
proc univariate data=sashelp.cars;
   var Weight;
   histogram Weight / endpoints=(0 to 8000 by 1000) 
                      barlabel=count outhist=BinCount;
run;
/* proc print data=BinCount; run; */ /* optional print */

The end points of the bins are called the cut points. In the histogram, the cut points are evenly spaced: 0, 1000, 2000, ..., 8000. The nine cut points define eight bins of equal width. For the Sashelp.Cars data, no vehicle weighs less than 1000 pounds, so the first bin is empty.

Two related tasks are not supported by the UNIVARIATE procedure: finding the bin number for each observation, and using unequally spaced bins. You could write a DATA step for both these tasks, but SAS/IML language provides the BIN function, which is simpler to use. (The BIN function was introduced in SAS/IML 9.3.) The following example uses the same Sashelp.Cars data:

proc iml;
use sashelp.cars;
read all var {Weight} into x;
close sashelp.cars;
 
cutpts = do(0, 8000, 1000);
b = bin(x, cutpts);         /* i_th element is 1-8 to indicate bin */

The b vector indicates to which bin each observation belongs. The value of b[i] is j when x[i] is in the jth bin. If desired, you can use that information to tabulate the counts in each bin, thus creating a tabular version of the histogram:

call tabulate(BinNumber, Freq, b);
print Freq[colname=(char(BinNumber,2))];

The HISTOGRAM statement in PROC UNIVARIATE does not permit you to use unevenly spaced bins, but the BIN function does. In fact, the BIN function supports two special missing values. The special SAS missing value .M is interpreted as "negative infinity" and the missing value .I is interpreted as "positive infinity." An interval of the form (.M 3000) means "all observations less than 3000" and an interval of the form [4000 .I) means "all observations greater than or equal to 4000." For example, the following cut points define two semi-infinite intervals and three other unevenly spaced bins. A call to the BIN function assigns each observation to the correct bin.

cutpts = {.M 3000 3400 3600 4000 .I};
r = bin(x, cutpts);
 
call tabulate(BinNumber, Freq, r);
lbls = {"< 3000" "3000-3400" "3400-3600" "3600-4000" "> 4000"};
print Freq[colname=lbls];

As I have shown previously, you can use the BIN function in conjunction with the QNTL function to group observations based on quantiles.

tags: Data Analysis, Getting Started