data analysis

2月 022011
 
One of my New Year's resolutions is to learn a new area of statistics. I'm off to a good start, because I recently investigated an issue which started me thinking about spatial statistics—a branch of statistics that I have never formally studied.

During the investigation, I asked myself: Given an configuration of points in a rectangle, how can I compute a statistic that indicates whether the configuration is uniformly spread out in the region?

One way to investigate this question is to look at the density of points in the configuration. Are the points spread out or clumped together?

Binning the Points

In the one-dimensional case, you can use a histogram to visually examine the density of univariate data. If you see that the heights of the bars in each bin are roughly the same height, then you have visual evidence that the data are uniformly spaced.

Of course, you don't really need the histogram: the counts in each bin suffice. You can divide the range of the data into N equally spaced subintervals and write SAS/IML statements that count the number of points that fall into each bin.

It is only slightly harder to compute the number of point that fall into bins of a two-dimensional histogram or, equivalently, into cells of a two-dimensional grid:

  1. For each row in the grid...
  2. Extract the points whose vertical coordinates indicate that they are in this row. (Use the LOC function for this step!)
  3. For these extracted points, apply the one-dimensional algorithm: count the number of points in each horizontal subdivision (column in the grid).
The following SAS/IML module uses this algorithm to count the number of points in each cell. Notice that the algorithm loops over the subdivisions of the grid, rather than over the observations.

proc iml;
/** Count the number of points in each cell of a 2D grid.
    Input: p    - a kx2 matrix of 2D points
           XDiv - a subdivision of the X coordinate direction
           YDiv - a subdivision of the Y coordinate direction
    Output: The number of points in p that are contained in 
            each cell of the grid defined by XDiv and YDiv. **/
start Bin2D(p, XDiv, YDiv);
   x = p[,1];       y = p[,2];
   Nx = ncol(XDiv); Ny = ncol(YDiv);
   
   /** Loop over the cells. Do not loop over observations! **/
   count = j(Ny-1, Nx-1, 0); /** initialize counts for each cell **/
   do i = 1 to Ny-1; /** for each row in grid **/
      idx = loc((y >= YDiv[i]) & (y < YDiv[i+1]));
      if ncol(idx) > 0 then do;
         t = x[idx]; /** extract point in this row (if any) **/
         do j = 1 to Nx-1; /** count points in each column **/
            count[i,j] = sum((t >= XDiv[j]) & (t < XDiv[j+1]));
         end;
      end;
   end;
   return( count );
finish;

Testing the Binning Module

A convenient way to test the module is to generate 100 random uniform points in a rectangle and print out the counts. The following example counts the number of points in each of 20 bins. There are five bins in each horizontal row and four bins in each vertical column.

call randseed(1234);
p = j(100,2);
call randgen(p, "uniform"); /** 100 random uniform points **/

/** define bins: 5 columns and 4 rows on [0,1]x[0,1] **/
XMin = 0; XMax = 1; dx = 1/5;
YMin = 0; YMax = 1; dy = 1/4;
XDiv = do(XMin, XMax, dx);
YDiv = do(YMin, YMax, dy);

count = Bin2D(p, XDiv, YDiv);

A visualization of the random points and the grid is shown in the scatter plot. From the graph, you can quickly estimate that each bin has between 1–10 points. The exact counts are contained in the count matrix.

Notice that the Y axis of the scatter plot has large values of Y at the top of the image and small values at the bottom of the image. In order to compare the values of the count matrix with the scatter plot, it is helpful to reverse the order of the rows in count:


/** reverse rows to conform to scatter plot **/
count = count[nrow(count):1,];
c = char(XDiv, 3); /** form column labels **/
r = char(YDiv[ncol(YDiv)-1:1], 4); /** row labels (reversed) **/
print count[colname=c rowname=r];

count

0

0.2

0.4

0.6

0.8

0.75

2

3

6

7

7

0.5

5

9

7

4

1

0.25

9

2

7

2

7

0

7

5

2

2

6

You can use these bin counts to investigate the density of the point pattern. I'll present that idea on Friday.

1月 282011
 
I sing in the SAS-sponsored VocalMotion show choir. It's like an adult version of Glee, except we have more pregnancies and fewer slushie attacks.

For many musical numbers, the choreographer arranges the 20 performers on stage in an orderly manner, such as four rows of five singers. But every once in a while the director wants to create a disorganized look for a portion of a song (for example, a free-form '60's segment). She tells the performers to "go to random positions."

The choir always has problems achieving the look that the director wants. We end up in positions that are equally spaced from one another, but to the audience it often looks like we tried to arrange ourselves in rows, but failed miserably!

Since I know what a random point pattern looks like, I try to help. I whisper "clump more" and "vary the sizes of the gaps between performers." I tell my fellow performers, "random uniform does not imply uniformly spaced," but mostly they just ignore me or roll their eyes.

This year I took it upon myself to solve the problem, and I discovered that it led to some interesting statistics.

The Experiment

My first attempt to help the director was to generate many examples of random point patterns. Our stage is approximated by the rectangle S = [0, 2] x [0, 1]. It is easy to generate 20 points that are uniformly distributed in S. My idea was that the director would have 10–12 of these patterns. When she wants a random configuration, she can hold up a pattern and say, "go to this configuration." I sent the director the patterns on the adjacent image.

I was quite proud of myself. But guess what?

She didn't like them.


Random Uniform versus Evenly Spaced

The director has been telling us to go to random positions, so how can she not like the random configurations that I sent her? She explained that there are two characteristics that she doesn't like:

  1. Large areas of the stage are devoid of performers, which I translate to mean "the point patterns have too many gaps."
  2. Some of the performers are too close to one another, which I translate to mean "the point patterns have too many clusters."
From her comments, I conclude that I have been wrong all these years. She does not actually want the performers in random uniform positions on stage! The characteristics that she mentions are intrinsic to random uniform point patterns, so if she doesn't want those characteristics, she doesn't want uniform random.

So what does she want?

A Perturbation Algorithm

It was time for Plan B. I talked to the director again and heard her say that she wants the performers in a configuration that is close to a regular grid, but not too close. It has to "look random." To me, this suggests a perturbation approach to the problem: start by arranging the performers on a regular grid, but then perturb that arrangement by a small amount.

I thought about how to physically do this on stage and came up with the following algorithm:

  1. Arrange the performers on a regular grid, such as four rows of five performers.
  2. Ask the performers to turn in place for 5–10 seconds. Each performer chooses his or her own speed and direction. When the director calls "stop turning," each performer should be pointing in a random direction.
  3. Each performer takes a step forward in whatever direction he or she is facing.
This algorithm results in a perturbation of a regular grid in which each point is perturbed (jittered) by a small amount in an arbitrary direction. But does this perturbation give the look that my director wants? There's only one way to find out: implement the algorithm and look at some pictures.

Random Jittering of a Regular Grid

Forming a regular grid with SAS/IML software is no problem: I can use the Define2DGrid module, which I have previously described. It is also easy to use the RANDSEED and RANDGEN subroutines to generate angles that are distributed uniformly in [0, 2 pi]. The following SAS/IML module perturbs each position on a regular grid:

proc iml;
/** Module to define ordered pairs that are a random jittering
    of a uniform grid of points.
    The first six input parameters are the same as for Define2DGrid.
    The last parameter, delta, is the size of the jitter.

    Output: Ordered pairs are returned in a matrix with
         (Nx x Ny) rows and two columns.
**/
start GridJitter(XMin, XMax, Nx, YMin, YMax, Ny, delta);
   /** initial placement (grid) **/
   Lx = XMax - XMin; Ly = YMax - YMin;
   dx = Lx / Nx;     dy = Ly / Ny;
   load module=Define2DGrid;
   xy = Define2DGrid(dx/2, Lx-dx/2,Nx, dy/2, Ly-dy/2,Ny);
   x0 = xy[,1];  y0 = xy[,2];

   /** generate random directions in [0, 2 pi] **/
   NPts = Nx * Ny;
   pi = constant('PI');
   angle = 2 * pi * uniform( j(NPts,1) ); 

   /** each person takes on step in some direction **/
   x = x0 + delta * cos(angle);
   y = y0 + delta * sin(angle);

   /** don't let anyone fall off the stage! **/
   x = choose(x<0, 0, choose(x>Lx, Lx, x));
   y = choose(y<0, 0, choose(y>Ly, Ly, y));
   return ( x || y );
finish;

call randseed(654321);
r = 0.1; /** size of a small step **/
/** jitter 4x5 grid on the "stage" [0,2]x[0,1] **/
Jitter = GridJitter(0,2,5, 0,1,4, r); 

The results of several runs of this algorithm are shown in the following image. I showed the point patterns to the director. "Yes!" she exclaimed. "These are the kinds of configurations that I'm looking for."


Conclusions and Next Steps

One conclusion is obvious. In the initial stages of this project, I violated the Fundamental Rule of Consulting, which is "listen to a client in order to truly understand the problem." I assumed that my notion of "random placement" was the same as my director's, but it wasn't.

I am now excited to begin the next step of this process: a statistical analysis of point patterns.

Regular readers of my blog will recall that one of my New Year's resolutions is to learn something about an area of statistics that is new to me. I've already read a few articles and book chapters that will undoubtedly spark future blog posts about spatial point patterns.

For example, my director emphasized two features of point patterns: clustering and distance between adjacent points. In future blog posts, I will analyze statistics that are associated with these characteristics.

But—shhh!—don't tell the choir that I'm computing these statistics. I'd prefer not to get doused with a slushie.

1月 262011
 
Next week's blog entry will build on this one, so I want you to take notes, OK?

It's not headline news that in most cases, the best way to handle a repeated measures analysis is with a mixed models approach, especially for Normal reponses (for other distributions in the exponential family, GLIMMIX also has some advantages over GEEs. But that's another blog post for another day). You have more flexibility in modeling the variances and covariances among the time points, data do not need to be balanced, and a small amount of missingness doesn't spell the end of your statistical power.

But sometimes the data you have are living in the past: arranged as if you were going to use a multivariate repeated measures analysis. This multivariate data structure arranges the data with one row per subject, each time-based response a column in the data set. This enables the GLM procedure to set up H matrices for the model effects, and E and T matrices for testing hypotheses about those effects. It also means that if a subject has any missing time points, you lose the entire subject's data. I've worked on many repeated measures studies in my pre-SAS years, and I tell you, I've been on the phones, email, snail mail, and even knocked on doors to try to bring subjects back to the lab for follow-ups. I mourned over every dropout. To be able to use at least the observations you have for a subject before dropout would be consolation to a weary researcher's broken heart.

Enter the mixed models approach to repeated measures. But, your data need to be restructured before you can use MIXED for repeated measures analysis. This is, coincidentally, the same data structure you would use for a univariate repeated measures, like in the old-olden days of PROC ANOVA with hand-made error terms (well, almost hand-made). Remember those? The good old days. But I digress.

The MIXED and GLIMMIX procedures require the data be in the univariate structure, with one row per measurement. Notice that these procedures still use CCA, but now the "case" is different. Instead of a subject, which in the context of a mixed model can be many things at once (a person, a clinic, a network...), the "case" is one measurement occurence.

How do you put your wide (multivariate) data into the long (univariate) structure? Well, there are a number of ways, and to some extent it depends on how you have organized your data. If the multivariate response variable names share a prefix, then this macro will convert your data easily.

What if you want to go back to the wide structure (for example, to create graphs to profile subjects over time)? There's a macro for that as well.

What if your variables do not share a prefix, but instead have different names (such as SavBal, CheckBal, and InvestAmt)? Then you will need an alternative strategy. For example:


This needs some rearrangement, but there are two issues. First, there is no subject identifier, and I will want this in next week's blog when I fit a mixed model. Second, the dependent variables are not named with a common prefix. In fact, they aren't even measured over time! They are three variables measured for one person at a given time. (I'll explain why in next week's blog).























So, my preference is to use arrays to handle this:
























Which results in the following:






















































I tip my hat to SAS Tech Support, who provide the %MAKELONG and %MAKEWIDE macros and to Gerhard Svolba, who authored them. If someone wants to turn my arrays into a macro, feel free. I'll tip my hat to you, too.

Tune in next week for the punchline to the joke:
"Three correlated responses walk into a bar..."
1月 182011
 
A colleague posted some data on his internal SAS blog about key trends in the US Mobile phone industry, as reported by comScore. He graciously shared the data so that I could create a graph that visualizes the trends.


The plot visualizes trends in the data: the Android phone is gaining market share, primarily at the expense of the Blackberry, while the iPhone is holding steady.

Plotting the essential features of the data requires only two statements in the SGPLOT procedure:

title "Top Smartphone Platforms";
title2 "Total US Smartphone Subscribers, Ages 13+";
footnote "Source: comScore MobiLens service";

proc sgplot data=Mobile;
series x=Date y=Pctval / group=Type break curvelabel;
xaxis interval=Month;
run;

  1. The SERIES statement plots the lines.
    • The GROUP= option groups the lines according to the platform and automatically assigns unique colors for each platform.
    • The BREAK option is used because the data contain a missing value for March of 2010. (The missing data is what makes this plot unusual!) If you omit the BREAK option, each line continues from the February value to the April value without interruption.
    • The CURVELABEL option labels each curve in the plot. Without the option, the SGPLOT procedure would automatically insert a legend.
  2. The XAXIS statement causes the horizontal axis to display tick marks for each month. This statement is optional, but without the statement the plot displays ticks every three months, which I don't like.
A comment that I received about a previous data analysis reminded me that some SAS customers are just now upgrading to SAS 9.2, and might not have much experience with PROC SGPLOT. For me, PROC SGPLOT has become a favorite tool for creating static images with minimal effort. Of course, for dynamically linked graphics from SAS/IML programs, I use SAS/IML Studio.
1月 142011
 
A colleague related the following story: He was taking notes at a meeting that was attended by a fairly large group of people (about 20). As each person made a comment or presented information, he recorded the two-letter initials of the person who spoke. After the meeting was over, he was surprised to discover that all of the initials of the people in the room were unique! Nowhere in his notes did he write "JS said..." and later wonder "Was that Jim Smith or Joyce Simpson?"

My colleague asked, "If 20 random people are in a room, do they usually have different initials or is it common for two people to share a pair of initials?" In other words, was his experience typical or a rare occurrence?

The Distribution of Initials at a Large US Software Company

In order to answer that question, it is necessary to know the distribution of initials in his workplace.

Clearly, the distribution of initials depends on the population of the people in the workplace. In some cultures, names that begin with X or Q are rare, whereas in other cultures names that begin with those letters (when phonetically translated into English) are more common.

SAS is a large US software company with a diverse base of employees, so I decided to download the names of 4,502 employees that work with me in Cary, NC, and write a DATA step program that extracts the first and last initials of each name.

You can use the FREQ procedure to compute the frequencies of the first initial (I1), the last initial (I2), and the frequency of the initials taken as a pair. The following statements output the frequency of the initials in decreasing order:

proc freq data=Employees order=freq;
tables I1 / out=I1Freq;
tables I2 / out=I2Freq;
tables I1*I2 / out=InitialFreq missing sparse noprint;
run;

As an example, I can display the relevant frequency for my initials (RW) as well as the initial of the SAS cofounders, Jim Goodnight and John Sall:

data SASUSER.InitialFreq; set InitialFreq; 
Initials = I1 || I2;
run;

proc print data=SASUSER.InitialFreq
   (where=(Initials="RW" | Initials="JG" | Initials="JS"));
run;

Obs

I1

I2

COUNT

PERCENT

Initials

1

J

S

61

1.35495

JS

10

J

G

21

0.46646

JG

214

R

W

18

0.39982

RW

The initials "JS" are the most frequent initials in my workplace, with 61 employees (1.35%) having those initials. The initials "JG" are also fairly common; they are the 10th most popular initials. My initials are less common and are shared by only 0.4% of my colleagues.

If you want to conduct your own analysis, you can download a comma-separated file that contains the initials and frequencies.

Click to Enlarge You can use PROC SGPLOT to display bar charts for the first and last initials.

The bar charts show that J, M, S, D, and C are the most common initials for first names, whereas S, B, H, M, and C are the most common initials for last names.

In contrast, U, Q, and X are initials that do not appear often for either first or last names. For first initials, the 10 least popular initials cumulatively occur less than 5% of the time. For last initials, the 10 least popular initials cumulatively occur about 8% of the time.

Clearly, the distribution of initials is far from uniform. However, for the note-taker, the important issue is the distribution of pairs of initials.


The Distribution of Two-Letter Initials

By using the PROC FREQ output, you can analyze the distribution at my workplace of the frequencies of the 262 = 676 pairs of initials:

  • More than 30% of the frequencies are zero. For example, there is no one at my workplace with initials YV, XU, or QX.
  • If you ignore the initials that do not appear, then the quantiles of the remaining observations are as follows:
    • The lower quartile is 0.044.
    • The median is 0.133.
    • The upper quartile is 0.333.
    • Three pairs are much more prevalent than the others. The initials JM, JB, an JS each occur more than 1% of the time.
The distribution of two-letter initials is summarized by the following box plot:

Click to Enlarge

Visualizing the Proportions of Two-Letter Initials

Click to Enlarge

With the help of a SAS Global Forum paper that shows how to use PROC SGPLOT to create a heat map, I created a plot that shows the distribution of two-letter initials in my workplace.

When I create a heat map, I often use the quartiles of the response variable to color the cells in the heat map. For these data, I used five colors: white to indicate pairs of initials that are not represented at my workplace, and a blue-to-red color scheme (obtained from colorbrewer.org) to indicate the quartiles of the remaining pairs. Blue indicates pairs of initials that are uncommon, and red indicates pairs that occur frequently.

In terms of counts, blue indicates pairs of initials that are shared by either one or two individuals, and red indicates 18 or more individuals.


The heat map shows several interesting features of the distribution of pairs of initials:

  • Although W and N are not unusual first initials (1.7% and 1.4%, respectively) and D and F are not unusual last initials (5.0% and 3.2%, respectively), there is no one at my workplace with the initials ND or WF.
  • There are 89 individuals at my workplace who have a unique pair of initials, including YX, XX, and QZ.

You can download the SAS program that is used to produce the analysis in this article.

The Probability of Matching Initials

Computing the probability that a group of people have similar characteristics is called a "birthday-matching problem" because the most famous example is "If there are N people in a room, what is the chance that two of them have the same birthday?"

In Chapter 13 of my book, Statistical Programming with SAS/IML Software, I examine the birthday-matching problem. I review the well-known solution under the usual assumption that birthdays are uniformly distributed throughout the year, but then go on to compare that solution to the more realistic case in which birthdays are distributed in a fashion that is consistent with empirical birth data from the National Center for Health Statistics (NCHS).

Obviously, you can do a similar analysis for the "initial-matching problem." Specifically, you can use the actual distribution of initials at SAS to investigate the question, "What is the chance that two people in a room of 20 randomly chosen SAS employees share initials?" Come back next Wednesday to find out the answer!

1月 062011
 
A student in my multivariate class last month asked a question about prior probability specifications in discriminant function analysis:
What if I don't know what the probabilities are in my population? Is it best to just use the default in PROC DISCRIM?


First, a quick refresher of priors in discriminant analysis. Consider a problem of classifying 150 cases (let's say, irises) into three categories (let's say, variety). I have four different measurements taken from each of the flowers.

If I walk through the bog and pick another flower and measure its 4 characteristics, how well can I expect to perform in classifying it as the right variety? One way to derive a classification algorithm is to use linear discriminant analysis.

A linear discriminant function to predict group membership is based on the squared Mahalanobis distance from each observation to the controid of the group plus a function of the prior probability of membership in that group.
This generalized squared distance is then converted into a score of similarity to each group, and the case is classified into the group it is most similar to.

The prior probability is the probability of an observation coming from a particular group in a simple random sample with replacement.

If the prior probabilities are the same for all three of the groups (also known as equal priors), then the function is only based on the squared Mahalanobis distance.




If the prior for group A is larger than for groups B and C, then the function makes it more likely that an observation will be classified as group A, all else being equal.

The default in PROC DISCRIM is equal priors. This default makes sense in the context of developing computational software: the function with equal priors is the simplest, and therefore the most computationally efficient.
PRIORS equal;

Alternatives are proportional priors (using priors that are the proportion of observations from each group in the same input data set) and user-specified priors (just what it sounds like: specify them yourself).
PRIORS proportional;
PRIORS 'A' = .5 'B' = .25 'C' = .25;


Of course this kind of problem is far more interesting when you consider something like people making choices, such as kids choosing an action figure of Tinkerbell, Rosetta, or Vidia. Those certainly don't have equal priors, and if your daughter's anything like mine, she doesn't want to be classified into the wrong group.

So back to the original question:
What if I don't know what the probabilities are in my population? Is it best to just use the default in PROC DISCRIM?

In this case, using the default would probably not be a great idea, as it would assign the dolls with equal probability, all else being equal.

So if not the default, then what should you use? This depends on what you're going to be scoring. Your priors should reflect the probabilities in the population that you will be scoring in the future. Some strategies for getting a decent estimate:
1. Go to historical data to see what the probabilities have been in the past.
2. If your input data set is a simple random sample, use proportional priors.
3. Take a simple random sample from the population and count up the number from each group. This can determine the priors.
4. Combine the probabilities you think are correct with the cost of different types of misclassification.

For example, suppose that. among 4-year-olds, the probabilities of wanting the Tinkerbell, Rosetta, and Vidia action figures are really 0.50, 0.35, and .15 respectively. After all, not many kids want to be the villian.
PRIORS 'Tink' = .5 'Rosetta' = .35 'Vidia' = .15

What is the cost of giving a girl the Rosetta doll when she wanted Tinkerbell? What's the cost of giving a girl Vidia when she wanted Rosetta?, and so on. A table is shown below (based on a very small sample of three interviews of 4-year-old girls):



Clearly the cost of an error is not the same for all errors. It is far worse to assign Vidia to a girl who doesn't want Vidia than for any other error to occur. Also notice the small detail that Vidia fans would prefer to get Rosetta over Tinkerbell. For birthday party favors, I'd massage those priors to err on the side of giving out Rosetta over Vidia.
PRIORS 'Tink' = .5 'Rosetta' = .4 'Vidia' = .1
Of course, depending on your tolerance for crying, you might just give everyone Rosetta and be done with it. But then, really, isn't variety the spice of life?

I hope this has helped at least one or two of you out there who were having trouble with priors. The same concepts apply in logistic regression with offset variables, by the way. But that's a party favor for another day.
1月 062011
 
The Junk Chart blog discusses problems with a chart which (poorly) presents statistics on the prevalence of shark attacks by different species.

Here is the same data presented by overlaying two bar charts by using the SGPLOT procedure. I think this approach works well because the number of deaths is always fewer than the number of attacks, and the visual juxtaposition of deaths and attacks gives the viewer an impression of the relative risk of each shark killing someone during an attack.


The graph was created by using the following SAS program.

data Sharks;
input Species $14. Attacks Deaths;
label Attacks="Attacks" Deaths="Deaths";
datalines;
White         394 61
Tiger         140 28
Bull           98 21
Sandtiger      74  2
Hammerhead     41  0
Blacktip       33  0
Blue           36  4
Blacktip reef  16  0
Shortfin mako  43  2
Gray reef       9  0
;
run;
proc sort data=Sharks; by descending Attacks; run;
data Others;
Species="Others"; Attacks=276; Deaths=9; output;
run;

data Sharks; set Sharks Others; run;

title 'Shark Attacks and Deaths by Species';
footnote justify=left 'Source: Shark Foundation';
footnote2 justify=left 'http://www.shark.ch/Information/Accidents/index.html';
proc sgplot data=Sharks;
   hbar Species / response=Attacks datalabel;
   hbar Species / response=Deaths datalabel;
   xaxis label=" " grid;
   yaxis label=" " discreteorder=data;
run;
1月 042011
 
Happy New Year!! This is a good time to think about what was going on here in SAS Education one year ago, and to introduce you to a big project that I'm really excited to "take public."

In January 2010 (as well as throughout 2009), we kept getting cries for help like this one: "Where do we find MBAs with quantitative skills? They need to know predictive modeling, large-scale time series forecasting, cluster analysis, design and analysis of experiments, honest assessment, and model management. Our problems cannot be solved with t-tests and most universities are not teaching the skills business analysts really need in the modern workplace."

Companies want to hire professionals who have a good head for business, an understanding of data analysis, and facility with advanced software for fitting sophisticated models. They have told us that these people are getting harder and harder to find.

Of course, SAS has had training on statistical analysis and software for decades. But analytical directors and business analysts frequently find themselves in a position of needing to know more about how to apply analytics in the business context: how do you make the most effective use of your (massive, messy, opportunistic) data? How do you "think analytically"?

About half of my time in 2010 (with help from some brilliant colleagues, as well) was spent developing a course to address this need: Advanced Analytics for the Modern Business Analyst. The first version was made exclusively available to university professors last August, and many of them are already teaching it in their graduate programs. Degree-seeking students are getting better prepared for handling large-scale data analysis in the so-called "real-world." We were all very excited about launching the course with the universities and worked individually with over 30 professors worldwide to train them and provide the software needed to teach this course.

Starting in 2011 the course will also be available to professionals.

In March, we are offering a nine-week class that combines statistical and software training with the business context-- think of it as a "how to solve problems with data analysis" class. It is like having that extra semester of graduate training that you wish had been offered.

This course is also a chance to try out a better way of learning. Instead of just a weekly live web class, this course models the strategies many universities are using to offer their distance learning curricula, making it easier for the student to retain learning from week to week and using multiple learning modes to reinforce understanding (reading, writing, listening, watching).

Each week, students will participate in:
...a 3.5-hour live web lecture with your instructor and classmates.
...practice activities using software on our servers.
...assigned reading of book chapters and articles to support the class topics, as well as discussion forums to exchange ideas about the readings.

The total time commitment is about 5-6 hours a week, roughly the same as a semester-long course.

You can read a little more about it here in the SAS Training Newsletter.

You can see the course outline here. There are only a few seats left, so if you're interested, you might want to jump on it. Since every teaching experience is also a learning experience, I hope that teaching this class will generate lots of new ideas and blog posts in the coming months. I'll let you know how it's coming along.

Does your organization have people who can to make sense of masses of data, forecast future behavior, and find the next best strategy to gain competitive edge?

See you in class!
12月 172010
 
When I finished writing my book, Statistical Programming with SAS/IML Software, I was elated. However, one small task still remained.

I had to write the index.

How Long Should an Index Be?

My editor told me that SAS Press would send the manuscript to a professional editor who would index the book for me. They did, and I got back about 30 pages of terms and page numbers! The index shrunk a little when I incorporated all those terms into LaTeX which displays an index in two columns. However, it still occupies 11 pages and includes almost 1200 entries!

I was a little annoyed. The entire book is only 440 pages—why is the index so long? I asked a colleague of mine who is an editor, and she said it didn't seem long to her. She referred me to Technical Writing 101: A Real-World Guide to Planning and Writing Technical Documentation (p. 148):

In technical documentation,... the rule of thumb is that the index should contain approximately one page of two-column index entries for every 20 pages of documentation.
That's crazy, I thought. There's no way my book should have an 22-page index! That might be a valid rule of thumb for documentation, but not for a statistical book.

But if I reject that rule of thumb, how can I determine whether the length of my index is appropriate compared to the length of my book? I decided to do what any sensible statistician would do in this situation: collect data.

Data for Index Length versus Book Length, by Publisher

A colleague and I marched down to the SAS library. We sampled books from three different publishers: SAS Press, Springer, and Wiley. For each book we recorded the number of pages (not including the index) and the number of pages for the index. You can download the data and the SAS/IML Studio program that analyzes the data.


The previous image summarizes the data. It shows the size of each book's index versus the size of the book. The books are colored according to the publisher. Because the data for each publisher appear to be linear except for outliers (for example, the 447-page Springer book with no index, which is a set of conference proceedings), I fit and added robust regression lines computed by the LTS algorithm in the ROBUSTREG procedure. I used a no-intercept model so that I could more easily interpret the slope of the regression line as a ratio that reveals the expected number of index pages, given the number of pages in the book. (The regression lines look similar for a model that includes an intercept term.)

The coefficients of the regression model indicate that SAS press books average about one page of index for each 26 pages of text. In contrast, the other publishers average about one page of index for each 65 pages of text. In other words, for this sample the professionally compiled SAS Press indexes contain, on average, more than twice as many pages as indexes of comparable books printed by other publishers.

The analysis reveals that the index of my book (marked by an "X") is not too long, at least by the SAS Press standards. In fact, the length of my index is low compared with other SAS Press books, although it is on the high end of Springer and Wiley books of similar length.


I contacted a colleague who wrote one of the Springer books in the sample. I asked whether he had used a professional indexer, or had compiled the index himself. He said that he created the index himself:

We assumed it would cost us [to use a professional indexer], and we were concerned that the indexer wouldn't understand the topic. Doing it myself enabled me to ensure that spelling was consistent. Initially I intended to get students to test the index for me, but at the end of the process I had rather lost my enthusiasm!

I cannot comment on the practices of other publishers, but SAS Press does not charge their authors for this professional service. Still, I share his sense of fatigue: writing an index is hard and tiring work.

If I had compiled the index myself, would I have had the stamina to include 1200 entries? Definitely not. But am I glad that my index is thorough, useful, and professionally compiled? Absolutely.

The cost to me? Three weeks of drudgery. The benefit to my readers? Priceless. Thanks, SAS Press!

12月 082010
 
Sample covariance matrices and correlation matrices are used frequently in multivariate statistics. This post shows how to compute these matrices in SAS and use them in a SAS/IML program. There are two ways to compute these matrices:
  1. Compute the covariance and correlation with PROC CORR and read the results into PROC IML
  2. Compute the matrices entirely with PROC IML

Computing a Covariance and Correlation Matrix with PROC CORR

You can use PROC CORR to compute the correlation matrix (or, more correctly, the "Pearson product-moment correlation matrix," since there are other measures of correlation which you can also compute with PROC CORR). The following statements compute the covariance matrix and the correlation matrix for the three numerical variables in the SASHELP.CLASS data set.

ods select Cov PearsonCorr;
proc corr data=sashelp.class noprob outp=OutCorr /** store results **/
          nomiss /** listwise deletion of missing values **/
          cov;   /**  include covariances **/
var Height Weight Age;
run;

The OutCorr data set contains various statistics about the data, as shown by running PROC PRINT:

proc print data=OutCorr; run;

Obs

_TYPE_

_NAME_

Height

Weight

Age

1

COV

Height

26.287

102.493

6.2099

2

COV

Weight

102.493

518.652

25.1857

3

COV

Age

6.210

25.186

2.2281

4

MEAN

62.337

100.026

13.3158

5

STD

5.127

22.774

1.4927

6

N

19.000

19.000

19.0000

7

CORR

Height

1.000

0.878

0.8114

8

CORR

Weight

0.878

1.000

0.7409

9

CORR

Age

0.811

0.741

1.0000

If you want to use the covariance or correlation matrix in PROC IML, you can read the appropriate values into a SAS/IML matrix by using a WHERE clause on the USE statement:


proc iml;
use OutCorr where(_TYPE_="COV");
read all var _NUM_ into cov[colname=varNames];

use OutCorr where(_TYPE_="CORR");
read all var _NUM_ into corr[colname=varNames];
close OutCorr;

Notice that this SAS/IML code is independent of the number of variables in the data set.

Computation of the Covariance and Correlation Matrix in PROC IML

If the data are in SAS/IML vectors, you can compute the covariance and correlation matrices by using matrix multiplication to form the matrix that contains the corrected sum of squares of cross products (CSSCP).

Suppose you are given p SAS/IML vectors x1, x2, ..., xp. To form the covariance matrix for these data:

  1. Use the horizontal concatenation operator to concatenate the vectors into a matrix whose columns are the vectors.
  2. Center each vector by subtracting the sample mean.
  3. Form the CSSCP matrix (also called the "X-prime-X matrix") by multiplying the matrix transpose and the matrix.
  4. Divide by n-1 where n is the number of observations in the vectors.
This process assumes that there are no missing values in the data. Otherwise, it needs to be slightly amended. Formulas for various matrix quantities are given in the SAS/STAT User's Guide.

The following SAS/IML statements define a SAS/IML module that computes the sample covariance matrix of a data matrix. For this example the data are read from a SAS data set, so Step 1 (horizontal concatenation of vectors) is skipped.

proc iml;
start Cov(A);             /** define module to compute a covariance matrix **/
   n = nrow(A);           /** assume no missing values **/
   C = A - A[:,];         /** subtract mean to center the data **/
   return( (C` * C) / (n-1) );
finish;

/** read or enter data matrix into X **/
varNames = {"Height" "Weight" "Age"};
use sashelp.class; read all var varNames into X; close sashelp.class;

cov = Cov(X);
print cov[c=varNames r=varNames];

cov

Height

Weight

Age

Height

26.286901

102.49342

6.2099415

Weight

102.49342

518.65205

25.185673

Age

6.2099415

25.185673

2.2280702

Computing the Pearson correlation matrix requires the same steps, but also that the columns of the centered data matrix be scaled to have unit standard deviation. SAS/IML software already has a built-in CORR function, so it is not necessary to define a Corr module, but it is nevertheless instructive to see how such a module might be written:

start MyCorr(A);
   n = nrow(A);                   /** assume no missing values     **/
   C = A - A[:,];                 /** center the data              **/
   stdCol = sqrt(C[##,] / (n-1)); /** std deviation of columns     **/
   stdC = C / stdCol;             /** assume data are not constant **/
   return( (stdC` * stdC) / (n-1) );
finish;

corr = MyCorr(X);
print corr[c=varNames r=varNames];

corr

Height

Weight

Age

Height

1

0.8777852

0.8114343

Weight

0.8777852

1

0.7408855

Age

0.8114343

0.7408855

1

You should use the built-in CORR function instead of the previous module, because the built-in function handles the case of constant data.

Computation of the Covariance and Correlation Matrix in PROC IML (post-9.2)

In November 2010, SAS released the 9.22 (pronounced "nine point twenty-two") release of SAS/IML software. This release includes the following:

  • a built-in COV function which handles missing values in either of two ways
  • new features for the built-in CORR function including
    • handling missing values in either of two ways
    • support for different measures of correlation, including rank-based correlations