data analysis

3月 252019
 

An important concept in multivariate statistical analysis is the Mahalanobis distance. The Mahalanobis distance provides a way to measure how far away an observation is from the center of a sample while accounting for correlations in the data. The Mahalanobis distance is a good way to detect outliers in multivariate normal data. It is better than looking at the univariate z-scores of each coordinate because a multivariate outlier does not necessarily have extreme coordinate values.

The geometry of multivariate outliers

In classical statistics, a univariate outlier is an observation that is far from the sample mean. (Modern statistics use robust statistics to determine outliers; the mean is not a robust statistic.) You might assume that an observation that is extreme in every coordinate is also a multivariate outlier, and that is often true. However, the converse is not true: when variables are correlated, you can have a multivariate outlier that is not extreme in any coordinate!

The following schematic diagram gives the geometry of multivariate normal data. The middle of the diagram represents the center of a bivariate sample.

  • The orange elliptical region indicates a region that contains most of the observations. Because the variables are correlated, the ellipse is tilted relative to the coordinate axes.
  • For observations inside the ellipse, their Mahalanobis distance to the sample mean is smaller than some cutoff value. For observations outside the ellipse, their Mahalanobis distance to the sample mean is larger than the cutoff.
  • The green rectangle at the left and right indicate regions where the X1 coordinate is far from the X1 mean.
  • The blue rectangle at the top and bottom indicate regions where the X2 coordinate is far from the X2 mean.
Geometry of multivariate outliers showing the relationship between Mahalanobis distance and univariate outliers. The point 'A' has large univariate z scores but a small Mahalanobis distance. The point 'B' has a large Mahalanobis distance. Only 'B' is a multivariate outlier.

The diagram displays two observations, labeled "A" and "B":

  • The observation "A" is inside the ellipse. Therefore, the Mahalanobis distance from "A" to the center is "small." Accordingly, "A" is not identified as a multivariate outlier. However, notice that "A" is a univariate outlier for both the X1 and X2 coordinates!
  • The observation "B" is outside the ellipse. Therefore, the Mahalanobis distance from "B" to the center is relatively large. The observation is classified as a multivariate outlier. However, notice that "B" is not a univariate outlier for either the X1 or X2 coordinates; neither coordinate is far from its univariate mean.

The main point is this: An observation can be a multivariate outlier even though none of its coordinate values are extreme. It is the combination of values which makes an outlier unusual. In terms of Mahalanobis distance, the diagram illustrates that an observation "A" can have high univariate z scores but not have an extremely high Mahalanobis distance. Similarly, an observation "B" can have a higher Mahalanobis distance than "A" even though its z scores are relatively small.

Applications to real data

This article was motivated by a question from a SAS customer. In his data, one observation had a large Mahalanobis distance score but relatively small univariate z scores. Another observation had large z scores but a smaller Mahalanobis distance. He wondered how that was possible. His data contained four variables, but the following two-variable example illustrates his situation:

Geometry of multivariate outliers. The point 'A' has large univariate z scores but the Mahalanobis distance is only about 2.5. The point 'B' has a Mahalanobis distance of 5 and is a multivariate outlier.

The blue markers were simulated from a bivariate normal distribution with μ = (0, 0) and covariance matrix Σ = {16 32.4, 32.4 81}. The red markers were added manually. The observation marked 'B' is a multivariate outlier. The Mahalanobis distance (MD) from 'B' to the center of the sample is about 5 units. (The center is approximately at (0,0).) In contrast, the observation marked 'A' is not a multivariate outlier even though it has extreme values for both coordinates. In fact, the MD from 'A' to the center of the sample is about 2.5, or approximately half the MD of 'B'. The coordinates (x1, x2), standardized coordinates (z1, z2), and MD for both points are shown below:

You can connect the Mahalanobis distance to the probability of a multivariate random normal variable. The squared MD for multivariate normal data is distributed according to a chi-square distribution. For bivariate normal data, the probability that an observation is within t MD units of the center is 1 - exp(-t2/2). Observations like 'A' are not highly unusual. Observations that have MD ≥ 2.5 occur in exp(-2) = 4.4% of random variates from the bivariate normal distribution. In contrast, observations like 'B' are extremely rare. Observations that have MD ≥ 5 occur with probability exp(-25/2) = 3.73E-6. Yes, if you measure in Euclidean distance, 'A' is farther from the center than 'B' is, but the correlation between the variables makes 'A' much more probable. The Mahalanobis distance incorporates the correlation into the calculation of "distance."

Summary and further reading

In summary, things are not always as they appear. For univariate data, an outlier is an extreme observation. It is far from the center of the data. In higher dimensions, we need to account for correlations among variables when we measure distance. The Mahalanobis distance does that, and the examples in this post show that an observation can be "far from the center" (as measured by the Mahalanobis distance) even if none of its individual coordinates are extreme.

The following articles provide more information about Mahalanobis distance and multivariate outliers:

You can download the SAS program that generates the examples and images in this article.

The post The geometry of multivariate versus univariate outliers appeared first on The DO Loop.

3月 202019
 

An analyst was using SAS to analyze some data from an experiment. He noticed that the response variable is always positive (such as volume, size, or weight), but his statistical model predicts some negative responses. He posted the data and asked if it is possible to modify the graph so that only positive responses are displayed.

This article shows how you can truncate a surface or a contour plot so that negative values are not displayed. You could do something similar to truncate unreasonably high values in a surface plot.

Why does the model predict negative values?

Before showing how to truncate the surface plot, let's figure out why the model predicts negative values when all the observed responses are positive. The following DATA step is a simplified version of the real data. The RSREG procedure uses least squares regression to fit a quadratic response surface. If you use the PLOTS=SURFACE option, the procedure automatically displays a contour plot and surface plot for the predicted response:

data Sample;
input X Y Z @@;
datalines;
10 90 22  22 76 13  22 75  7 
24 78 14  24 76 10  25 63  5
26 62 10  26 94 20  26 63 15
27 94 16  27 95 14  29 66  7
30 69  8  30 74  8
;
 
ods graphics / width=400px height=400px ANTIALIASMAX=10000;
proc rsreg data=Sample plots=surface(fill=pred overlaypairs);
   model Z = Y X;
run;
 
proc rsreg data=Sample plots=surface(3d fill=Pred gridsize=80);
   model Z = Y X;
   ods select Surface;
   ods output Surface=Surface; /* use ODS OUTPUT to save surface data to a data set */
run;

The contour plot overlays a scatter plot of the data. You can see that the data are observed only in the upper-right portion of the plot (the red regions) and that no data are in the lower-left portion of the plot. The RSREG procedure fits a quadratic model to the data. The predicted values near the observed data are all positive. Some of the predicted values that are far from the observed data are negative.

I previously wrote about this phenomenon and showed how to compute the convex hull for these bivariate data. When you evaluate the model inside the convex hull, you are interpolating. When you evaluate the model outside the convex hull, you are extrapolating. It is well known that polynomial regression models can give nonsensical results if you extrapolate far from the data.

Truncating a response surface

The RSREG procedure is not aware that the response variable should be positive. A quadratic surface will eventually get arbitrarily big in the positive and/or negative directions. You can see this on the contour and surface plots, which show the predictions of the model on a regular grid of (X, Y) values.

If you want to display only the positive portion of the prediction surface, you can replace each negative predicted value with a missing value. The first step is to obtain the predicted values on a regular grid. You can use the "missing value trick" to score the quadratic model on a grid, or you can use the ODS OUTPUT statement to obtain the gridded values that are used in the surface plot. I chose the latter option. In the previous section, I used the ODS OUTPUT statement to write the gridded predicted values for the surface plot to a SAS data set named Surface.

As Warren Kuhfeld points out in his article about processing ODS OUTPUT data set, the names in an ODS data object can be "long and hard to type." Therefore, I rename the variables. I also combine the gridded values with the original data so that I can optionally overlay the data and the predicted values.

/* rename vars and set negative responses to missing */
data Surf2;
set Surface(rename=(
       Predicted0_1_0_0 = Pred  /* rename the long ODS names */
       Factor1_0_1_0_0  = GY    /* 'G' for 'gridded' */
       Factor2_0_1_0_0  = GX))  
    Sample(in=theData);         /* combine with original data */
if theData then Type = "Data   ";
else            Type = "Gridded";
if Pred < 0 then Pred = .;      /* replace negative predictions with missing values */
label GX = 'X'  GY = 'Y';
run;

You can use the Graph Template Language (GTL) to generate graphs that are similar to those produced by PROC RSREG. You can then use PROC SGRENDER to create the graph. Because the negative response values were set to missing, the contour plot displays a missing value color (black, for this ODS style) in the lower-left and upper-right portions of the plot. Similarly, the missing values cause the surface plot to be truncated. By using the GRIDSIZE= option, you can make the jagged edges small.

Notice that the colors in the graphs are now based on the range [0, 50], whereas previously the colors were based on the range [-60, 50]. I've added a continuous legend to the plots so that the range of the response variable is obvious.

I'd like to stress that sometimes "nonsensical values" indicate an inappropriate model. If you notice nonsensical values, you should always ask yourself why the model is predicting those values. You shouldn't modify the prediction surface without a good reason. But if you do have a good reason, the techniques in this article should help you.

You can download the complete SAS program that analyzes the data and generates the truncated graphs.

The post Truncate response surfaces appeared first on The DO Loop.

3月 182019
 

Statisticians often emphasize the dangers of extrapolating from a univariate regression model. A common exercise in introductory statistics is to ask students to compute a model of population growth and predict the population far in the future. The students learn that extrapolating from a model can result in a nonsensical prediction, such as trillions of people or a negative number of people! The lesson is that you should be careful when you evaluate a model far beyond the range of the training data.

The same dangers exist for multivariate regression models, but they are emphasized less often. Perhaps the reason is that it is much harder to know when you are extrapolating a multivariate model. Interpolation occurs when you evaluate the model inside the convex hull of the training data. Anything else is an extrapolation. In particular, you might be extrapolating even if you score the model at a point inside the bounding box of the training data. This differs from the univariate case in which the convex hull equals the bounding box (range) of the data. In general, the convex hull of a set of points is smaller than the bounding box.

You can use a bivariate example to illustrate the difference between the convex hull of the data and the bounding box for the data, which is the rectangle [Xmin, Xmax] x [Ymin, Ymax]. The following SAS DATA step defines two explanatory variables (X and Y) and one response variable (Z). The SGPLOT procedure shows the distribution of the (X, Y) variables and colors each marker according to the response value:

data Sample;
input X Y Z @@;
datalines;
10 90 22  22 76 13  22 75  7 
24 78 14  24 76 10  25 63  5
26 62 10  26 94 20  26 63 15
27 94 16  27 95 14  29 66  7
30 69  8  30 74  8
;
 
title "Response Values for Bivariate Data";
proc sgplot data=Sample;
scatter x=x y=y / markerattrs=(size=12 symbol=CircleFilled)
        colorresponse=Z colormodel=AltThreeColorRamp;
xaxis grid; yaxis grid;
run;
Bivariate data. The data are not uniformly distributed in the bounding box of the data.

The data are observed in a region that is approximately triangular. No observations are near the lower-left corner of the plot. If you fit a response surface to this data, it is likely that you would visualize the model by using a contour plot or a surface plot on the rectangular domain [10, 30] x [62, 95]. For such a model, predicted values near the lower-left corner are not very reliable because the corner is far from the data.

In general, you should expect less accuracy when you predict the model "outside" the data (for example, (10, 60)) as opposed to points that are "inside" the data (for example, (25, 70)). This concept is sometimes discussed in courses about the design of experiments. For a nice exposition, see the course notes of Professor Rafi Hafka (2012, p. 49–59) at the University of Florida.

The convex hull of bivariate points

You can use SAS to visualize the convex hull of the bivariate observations. The convex hull is the smallest convex set that contains the observations. The SAS/IML language supports the CVEXHULL function, which computes the convex hull for a set of planar points.

You can represent the points by using an N x 2 matrix, where each row is a 2-D point. When you call the CVEXHULL function, you obtain a vector of N integers. The first few integers are positive and represent the rows of the matrix that comprise the convex hull. The (absolute value of) the negative integers represents the rows that are interior to the convex hull. This is illustrated for the sample data:

proc iml;
use Sample;  read all var {x y} into points;  close;
 
/* get indices of points in the convex hull in counter-clockwise order */
indices = cvexhull( points );
print (indices`)[L="indices"];  /* positive indices are boundary; negative indices are inside */

The output shows that the observation numbers (indices) that form the convex hull are {1, 6, 7, 12, 13, 14, 11}. The other observations are in the interior. You can visualize the interior and boundary points by forming a binary indicator vector that has the value 1 for points on the boundary and 0 for points in the interior. To get the indicator vector in the order of the data, you need to use the SORTNDX subroutine to compute the anti-rank of the indices, as follows:

b = (indices > 0);               /* binary indicator variable for sorted vertices */
call sortndx(ndx, abs(indices)); /* get anti-rank, which is sort index that "reverses" the order */
onBoundary = b[ndx];             /* binary indicator data in original order */
 
title "Convex Hull of Bivariate Data";
call scatter(points[,1], points[,2]) group=onBoundary 
             option="markerattrs=(size=12 symbol=CircleFilled)";
Convex hull: Blue points are on the convex hull and red points are in the interior

The blue points are the boundary of the convex hull whereas the red points are in the interior.

Visualize the convex hull as a polygon

You can visualize the convex hull by forming the polygon that connects the first, sixth, seventh, ..., eleventh observations. You can do this manually by using the POLYGON statement in PROC SGPLOT, which I show in the Appendix section. However, there is an easier way to visualize the convex hull. I previously wrote about SAS/IML packages and showed how to install the polygon package. The polygon package contains a module called PolyDraw, which enables you to draw polygons and overlay a scatter plot.

The following SAS/IML statements extract the positive indices and use them to get the points on the boundary of the convex hull. If the polygon package is installed, you can load the polygon package and visualize the convex hull and data:

hullNdx = indices[loc(b)];         /* get positive indices */
convexHull = points[hullNdx, ];    /* extract the convex hull, in CC order */
 
/* In SAS/IML 14.1, you can use the polygon package to visualize the convex hull:
   https://blogs.sas.com/content/iml/2016/04/27/packages-share-sas-iml-programs-html */
package load polygon;    /* assumes package is installed */
run PolyDraw(convexHull, points||onBoundary) grid={x y}
    markerattrs="size=12 symbol=CircleFilled";

The graph shows the convex hull of the data. You can see that it primarily occupies the upper-right portion of the rectangle. The convex hull shows the interpolation region for regression models. If you evaluate a model outside the convex hull, you are extrapolating. In particular, even though points in the lower left corner of the plot are within the bounding box of the data, they are far from the data.

Of course, if you have 5, 10 or 100 explanatory variables, you will not be able to visualize the convex hull of the data. Nevertheless, the same lesson applies. Namely, when you evaluate the model inside the bounding box of the data, you might be extrapolating rather than interpolating. Just as in the univariate case, the model might predict nonsensical data when you extrapolate far from the data.

Appendix

Packages are supported in SAS/IML 14.1. If you are running an earlier version of SAS, you create the same graph by writing the polygon data and the binary indicator variable to a SAS data set, as follows:

hullNdx = indices[loc(b)];         /* get positive indices */
convexHull = points[hullNdx, ];    /* extract the convex hull, in CC order */
 
/* Write the data and polygon to SAS data sets. Use the POLYGON statement in PROC SGPLOT. */
p = points || onBoundary;       
poly = j(nrow(convexHull), 1, 1) || convexHull;
create TheData from p[colname={x y "onBoundary"}]; append from p;    close;
create Hull from poly[colname={ID cX cY}];         append from poly; close;
quit;
 
data All; set TheData Hull; run;    /* combine the data and convex hull polygon */
 
proc sgplot data=All noautolegend;
   polygon x=cX y=cY ID=id / fill;
   scatter x=x y=y / group=onBoundary markerattrs=(size=12 symbol=CircleFilled); 
   xaxis grid; yaxis grid;
run;

The resulting graph is similar to the one produced by the PolyDraw modules and is not shown.

The post Interpolation vs extrapolation: the convex hull of multivariate data appeared first on The DO Loop.

3月 062019
 

A previous article shows how to use a scatter plot to visualize the average SAT scores for all high schools in North Carolina. The schools are grouped by school districts and ranked according to the median value of the schools in the district. For the school districts that have many schools, the markers might overlap, which makes it difficult to visualize the distribution of scores. This is a general problem with using dot plots. An alternative visualization is to plot a box plot for each school district, which is described in today's article.

Box plots are not for everyone

Box plots (also called box-and-whisker plots) are used by statisticians to provide a schematic visualization of the distribution of some quantity. The previous article was written for non-statisticians, so I did not include any box plots. To understand a box plot, the reader needs to know how to interpret the box and whiskers:

  • The lower and upper portion of the box correspond to the 25th and 75th percentiles of the data, respectively
  • The line in the middle of the box indicates the median score.
  • A diamond marker indicates the mean score.
  • The standard box plot extends the whiskers to observations that are within a distance D from the box, where D is 1.5 times the interquartile range. Other kinds of box plots are sometimes used. The SAS documentation explains the parts of a box plot.

Box plots require a certain level of comfort with statistical ideas. Nevertheless, for a statistical audience, box plots provide a compact way to compare dozens or hundreds of distributions.

The BOXPLOT procedure: An alternative way to display many box plots

I almost always use the SGPLOT procedure to create box plots, but today I'm going to demonstrate the BOXPLOT procedure. The BOXPLOT procedure is from the days before ODS graphics, but it has several nice features, including the following:

  1. When grouping box plots into categories, you can add headers, category dividers, and colors to help distinguish the categories. I demonstrated these so-called "nested box plots" in a previous blog post.
  2. When you want to display statistics about each box plot as a table inside the graph, the BOXPLOT procedure provides a simple syntax. You can use PROC SGPLOT to add a statistical table to a box plot, but you need to pre-compute the statistics and merge the statistics and the data.
  3. When you are plotting dozens or hundreds of box plots, the BOXPLOT procedure automatically splits the graph into a series of panels.

The second and third features are both useful for visualizing the SAT data for public high schools in NC.

Add a statistical table to a graph that contains many box plots

You can use the INSETGROUP statement in PROC BOXPLOT to specify statistics that you want to display under each box plot. For example, the following syntax displays the number of high schools in each district and the median of the schools' SAT scores. The WHERE clause filters the data so that the graph shows only the largest school districts (those with seven or more high schools).

ods graphics / width=700px height=480px;
title "Average SAT Scores for Large NC School Districts";
proc boxplot data=SATSortMerge;
   where _FREQ_ >= 7;      /* restrict to large school districts */
   plot Total*DistrictAbbr / grid odstitle=title nohlabel boxstyle=schematicID vaxis=800 to 1450 by 50;
   insetgroup Q2 N;
run;

The graph shows schematic box plots for 18 large school districts. The districts are sorted according to the median value of the schools' SAT scores. The INSETGROUP statement creates a table inside the graph. The table shows the number of schools in each district and gives the median score for the district. The INSETGROUP statement can display many other statistics such as the mean, standard deviation, minimum value, and maximum value for each district.

Automatically create panels of box plots

One of the coolest features of PROC BOXPLOT is that it will automatically create a panel of box plots. It is difficult to visualize all 115 NC school districts in a single graph. The graph would be very wide (or tall) and the labels for the school districts would potentially collide. However, PROC BOXPLOT will split the display into a panel, which is extremely convenient if you plan to print the graphs on a piece of paper.

For example, the following call to PROC BOXPLOT results in box plots for 115 school districts. The procedure splits these box plots across a panel that contains five graphs and plots 23 box plots in each graph. Notice that I do not have to specify the number of graphs: the procedure uses the data to make an intelligent decision. To save space in this blog post, I omit three of the graphs and only show the first and last graphs:

ods graphics / width=640px height=400px;
title "Average SAT Scores for NC School Districts";
proc boxplot data=SATSortMerge;
   plot Total*DistrictAbbr / grid odstitle=title nohlabel boxstyle=schematicID vaxis=800 to 1450 by 50;
run;

Because the districts are ordered by the median SAT score, the first plot shows the school districts with high SAT scores and the last plot shows districts with lower SAT scores. Districts that have only one school are shown as a diamond (the mean value) with a line through it (the median value). Districts that have two or three schools are shown as a box without whiskers. For larger school districts, the box plots show a schematic representation of the distribution of the schools' SAT scores.

Summary

In summary, PROC BOXPLOT has several useful features for plotting many box plots. This article shows that you can use the INSETGROUP statement to easily add a table of descriptive statistics to the graph. The procedure also automatically creates a panel of graphs so that you can more easily look at dozens or hundreds of box plots.

You can download the SAS program (NCSATBoxplots.sas) that creates the data and the graphs.

The post Use PROC BOXPLOT to display hundreds of box plots appeared first on The DO Loop.

3月 062019
 

A previous article shows how to use a scatter plot to visualize the average SAT scores for all high schools in North Carolina. The schools are grouped by school districts and ranked according to the median value of the schools in the district. For the school districts that have many schools, the markers might overlap, which makes it difficult to visualize the distribution of scores. This is a general problem with using dot plots. An alternative visualization is to plot a box plot for each school district, which is described in today's article.

Box plots are not for everyone

Box plots (also called box-and-whisker plots) are used by statisticians to provide a schematic visualization of the distribution of some quantity. The previous article was written for non-statisticians, so I did not include any box plots. To understand a box plot, the reader needs to know how to interpret the box and whiskers:

  • The lower and upper portion of the box correspond to the 25th and 75th percentiles of the data, respectively
  • The line in the middle of the box indicates the median score.
  • A diamond marker indicates the mean score.
  • The standard box plot extends the whiskers to observations that are within a distance D from the box, where D is 1.5 times the interquartile range. Other kinds of box plots are sometimes used. The SAS documentation explains the parts of a box plot.

Box plots require a certain level of comfort with statistical ideas. Nevertheless, for a statistical audience, box plots provide a compact way to compare dozens or hundreds of distributions.

The BOXPLOT procedure: An alternative way to display many box plots

I almost always use the SGPLOT procedure to create box plots, but today I'm going to demonstrate the BOXPLOT procedure. The BOXPLOT procedure is from the days before ODS graphics, but it has several nice features, including the following:

  1. When grouping box plots into categories, you can add headers, category dividers, and colors to help distinguish the categories. I demonstrated these so-called "nested box plots" in a previous blog post.
  2. When you want to display statistics about each box plot as a table inside the graph, the BOXPLOT procedure provides a simple syntax. You can use PROC SGPLOT to add a statistical table to a box plot, but you need to pre-compute the statistics and merge the statistics and the data.
  3. When you are plotting dozens or hundreds of box plots, the BOXPLOT procedure automatically splits the graph into a series of panels.

The second and third features are both useful for visualizing the SAT data for public high schools in NC.

Add a statistical table to a graph that contains many box plots

You can use the INSETGROUP statement in PROC BOXPLOT to specify statistics that you want to display under each box plot. For example, the following syntax displays the number of high schools in each district and the median of the schools' SAT scores. The WHERE clause filters the data so that the graph shows only the largest school districts (those with seven or more high schools).

ods graphics / width=700px height=480px;
title "Average SAT Scores for Large NC School Districts";
proc boxplot data=SATSortMerge;
   where _FREQ_ >= 7;      /* restrict to large school districts */
   plot Total*DistrictAbbr / grid odstitle=title nohlabel boxstyle=schematicID vaxis=800 to 1450 by 50;
   insetgroup Q2 N;
run;

The graph shows schematic box plots for 18 large school districts. The districts are sorted according to the median value of the schools' SAT scores. The INSETGROUP statement creates a table inside the graph. The table shows the number of schools in each district and gives the median score for the district. The INSETGROUP statement can display many other statistics such as the mean, standard deviation, minimum value, and maximum value for each district.

Automatically create panels of box plots

One of the coolest features of PROC BOXPLOT is that it will automatically create a panel of box plots. It is difficult to visualize all 115 NC school districts in a single graph. The graph would be very wide (or tall) and the labels for the school districts would potentially collide. However, PROC BOXPLOT will split the display into a panel, which is extremely convenient if you plan to print the graphs on a piece of paper.

For example, the following call to PROC BOXPLOT results in box plots for 115 school districts. The procedure splits these box plots across a panel that contains five graphs and plots 23 box plots in each graph. Notice that I do not have to specify the number of graphs: the procedure uses the data to make an intelligent decision. To save space in this blog post, I omit three of the graphs and only show the first and last graphs:

ods graphics / width=640px height=400px;
title "Average SAT Scores for NC School Districts";
proc boxplot data=SATSortMerge;
   plot Total*DistrictAbbr / grid odstitle=title nohlabel boxstyle=schematicID vaxis=800 to 1450 by 50;
run;

Because the districts are ordered by the median SAT score, the first plot shows the school districts with high SAT scores and the last plot shows districts with lower SAT scores. Districts that have only one school are shown as a diamond (the mean value) with a line through it (the median value). Districts that have two or three schools are shown as a box without whiskers. For larger school districts, the box plots show a schematic representation of the distribution of the schools' SAT scores.

Summary

In summary, PROC BOXPLOT has several useful features for plotting many box plots. This article shows that you can use the INSETGROUP statement to easily add a table of descriptive statistics to the graph. The procedure also automatically creates a panel of graphs so that you can more easily look at dozens or hundreds of box plots.

You can download the SAS program (NCSATBoxplots.sas) that creates the data and the graphs.

The post Use PROC BOXPLOT to display hundreds of box plots appeared first on The DO Loop.

3月 042019
 

Standardized tests like the SAT and ACT can cause stress for both high school students and their parents, but according to a Wall Street Journal article, the SAT and ACT "provide an invaluable measure of how students are likely to perform in college and beyond." Naturally, students wonder how their individual scores compare with others at their high school, their school district, or their state. I addressed some of those questions in a previous article that visualizes national ACT scores among college-bound students.

The average test scores at a school (or in a school system) are also scrutinized. Administrators and state legislators look at standardized scores to help assess the effectiveness of school districts, principals, and teachers. Parents might use scores to help them decide whether to send their child to the local public school, a charter school, or a private school.

A recent article in the Raleigh News and Observer links to a website of the NC Department of Public Instruction that contains the full set of SAT scores for all 496 public schools in NC. The table is great if you want to find the scores for a particular school, but is not very illuminating if you want an overview of how SAT scores vary at schools across the state. Thus, I decided to visualize the data for all NC schools by using SAS.

The data are only for public schools. You can download the data and the SAS program that I used to create the graphs in this article.

The distribution of total SAT scores

The SAT score contains two components, a score on a math test and a score on a reading and writing test (sometimes referred to as the "English" score). Colleges look at the component scores and the sum of the scores (the "total" score). The following histogram shows the distribution of the average total SAT score for schools in North Carolina:

From this graph, you can determine several facts about the data:

  1. For most NC schools, the average SAT score is about 1100.
  2. About 73% of NC schools have an average SAT score between 1000 and 1200.
  3. There are a few schools that have much higher scores than the others. Those schools are Early College At Guilford (Total=1442), Raleigh Charter High School (Total=1356), and East Chapel Hill High (Total=1290).

Visualize the SAT math and English scores

If you want to compare the distributions of the math and English SAT scores, you can create a comparative histogram, as shown below:

In the comparative histogram, the average English scores for schools are shown in the top row; the math scores are in the bottom row. The English scores are generally higher, with a median value of 543. The median math score is about 15 points lower, with a value of 528.

Of course, the measurements in these two histograms are not independent. In reality, the math and English scores are paired, since every student takes both the math and English tests. Therefore, it makes sense to plot the joint distribution of the scores in a scatter plot, as follows:

In this plot, each school is represented by a marker. The math and English scores determine the placement of the marker. You can see that the math and English scores are highly correlated and linearly related. Schools with high (respectively, low) English scores tend to have high (respectively, low) math scores.

I used SAS to add tool tips to the graph. When I hover the cursor over a marker, the graph display information about the school and the average test scores. That technique makes it easy to identify the outliers. For example, the image above shows the tool-tip information for the Raleigh Charter High School, which has exceptionally high SAT scores.

And speaking of charter schools, legislators and school boards sometimes discuss the merits and disadvantages of using public tax dollars to establish and run charter schools. I have used red triangles to indicate the 39 charter schools in NC, which represent about 8% of the total schools. Overall, the distribution of SAT scores among the charter schools appears to be similar to the non-charter schools. I also created a comparative histogram of the charter/non-charter schools (not shown), which shows that the overall distribution is similar.

SAT scores for all NC public high schools

The previous graphs show the distribution of SAT scores for all public NC high schools, but do not enable you to easily compare different school districts. One way to compare school districts is to rank them according to the median scores for the schools within the district. You can create a graph that shows the school scores for each district. Because there are 115 school districts in NC, this graph will be very tall or wide. Nevertheless, it can be useful to rank the school districts and simultaneously see the distribution of scores for each school. The following plot displays the top 75 school districts, which collectively contain 385 high schools:

The top school district is Chapel-Hill/Carrboro, which contains three high-performing high schools. The next few school districts are also small, having three or fewer high schools. Wake County, the second-largest school district in the state, is ninth in terms of the median SAT scores, but you can see considerable variation among the schools in the district. Other large school districts (Charter schools, Guilford, and Charlotte/Mecklenburg), show similar variation in scores.

Summary

By using data visualization, you can better understand the distribution of SAT scores across public high schools in NC. The graphs in this article enable you to see that there are several high-performing schools, lots of schools in the middle, and a few low-performing schools. Similarly, you can compare the test scores by school districts. The school-district graph enables you to see the variation of scores of schools within a district. Lastly, when you create these graphs in SAS, you can add tool tips that help you identify the individual schools within each school district.

In my next article, I will describe an alternative visualization that displays a box plot for each school district.

Download the data and the SAS program (NCSAT.sas) for this article.

The post Visualize SAT scores in North Carolina appeared first on The DO Loop.

2月 202019
 

Last year I published a series of blogs posts about how to create a calibration plot in SAS. A calibration plot is a way to assess the goodness of fit for a logistic model. It is a diagnostic graph that enables you to qualitatively compare a model's predicted probability of an event to the empirical probability. I am happy to report that in SAS/STAT 15.1 (SAS 9.4M6), you can create a calibration plot automatically by using the PLOTS=CALIBRATION option on the PROC LOGISTIC statement.

Calibration plots for a model of a binary response

To demonstrate how to create a calibration plot by using PROC LOGISTIC, consider the simulated data that I analyzed in "Calibration plots in SAS." The data contain a binary response variable, Y, which depends quadratically on a uniformly distributed explanatory variable, X. The following call to PROC LOGISTIC fits a quadratic the model to the data. The new GOF option requests an extensive set of goodness-of-fit statistics and the PLOTS=CALIBRATION option requests a calibration plot:

/* NEW in SAS/STAT 15.1 (SAS 9.4M6): PLOTS=CALIBRATION option in PROC LOGISTIC */
title "Calibration Plot for a Quadratic Model";
title2 "Created by PROC LOGISTIC";
proc logistic data=LogiSim plots=calibration(CLM ShowObs);
   model y(Event='1') = x x*x / GOF;      /* New in 15.1: More goodness-of-fit statistics */
run;
Calibration plot for a quadratic logistic model, created by PROC LOGISTIC in SAS

The calibration plot is shown. (Click to enlarge.) The plot contains a gray diagonal line, which represents perfect calibration. If most of the predicted responses agree with the observed responses, then the blue curve should be close to the diagonal line. That is the case in this example. The light blue band is a 95% confidence region for the loess fit and is created by using the CLM option.

Because I used the SHOWOBS option, the calibration plot displays tiny histograms along the top and bottom of the plot. The histograms indicate the distribution of the Y=0 and Y=1 responses. The article "Use a fringe plot to visualize binary data in logistic models" explains more about how fringe plots can add insight to graphs that involve a binary response variable.

The lower right corner of the calibration plot contains one of the many goodness-of-fit statistics that are computed when you use the GOF option on the MODEL statement. A small p-value would indicate a lack of fit. In this case, there is no reason to suspect a lack of fit. The following table shows other goodness-of-fit tests. None of the p-values are small, so none of the tests indicate lack of fit.

Goodness-of-fit statistics for a quadratic logistic model, created by PROC LOGISTIC in SAS

Calibration plots for a polytomous response

An exciting feature of the calibration plots in PROC LOGISTIC is that you can use them for a polytomous response model. Derr (2013) fits a proportional odds model that predicts the probability of the severity of black-lung disease from the length of exposure to coal dust in 371 coal miners. The response variable, Severity, has the levels 'Severe', 'Moderate', and 'Normal'. The following statement create the data and model and request calibration plots for the model.

/* Data, from McCullagh and Nelder (1989, p. 179), used in Derr (2013, p. 8-10).
   The severity of pneumoconiosis (black lung disease) in coal miners
   and the number of years of exposure.
*/
data Coal; 
input Severity $ @@; 
do i=1 to 8; 
   input Exposure freq @@; 
   log10Exposure=log10(Exposure); 
   output; 
end; 
datalines; 
Normal   5.8 98 15 51 21.5 34 27.5 35 33.5 32 39.5 23 46 12 51.5 4 
Moderate 5.8  0 15  2 21.5  6 27.5  5 33.5 10 39.5  7 46  6 51.5 2 
Severe   5.8  0 15  1 21.5  3 27.5  8 33.5  9 39.5  8 46 10 51.5 5 
;
 
title 'Severity of Black Lung vs Log10(Years Exposure)';
proc logistic data=Coal rorder=data plots=Calibration(CLM);
   freq freq; 
   model Severity(descending) = log10Exposure; 
   effectplot / noobs individual;
run;
Panel of calibration plots for a polytomous proportional-odd model, created by PROC LOGISTIC in SAS

Derr (2013) discusses the results of the analysis, which are not shown here. I've displayed only the calibration plot for the model. Notice that PROC LOGISTIC creates a panel of three calibration plots, one for each response level. The calibration curves all lie close to the diagonal, so the diagnostic plots do not indicate a lack of calibration for any part of the model.

Summary

In summary, the PLOTS=CALIBRATION option in SAS/STAT 15.1 enables you to automatically create a calibration plot. The calibration plot is a diagnostic plot that qualitatively compares a model's predicted and empirical probabilities. You can use the PLOTS=CALIBRATION option on the PROC LOGISTIC statement to create a calibration plot. The CALIBRATION option supports several suboptions, which you can read about in the documentation for the PROC LOGISTIC statement.

You can download the SAS code used in this article, which includes SAS code that demonstrates how to create a calibration plot manually.

The post An easier way to create a calibration plot in SAS appeared first on The DO Loop.

2月 182019
 
Maybe if we think and wish and hope and pray
It might come true.
Oh, wouldn't it be nice?

The Beach Boys

Months ago, I wrote about how to use the EFFECT statement in SAS to perform regression with restricted cubic splines. This is the modern way to use splines in a regression analysis in SAS, and it replaces the need to use older macros such as Frank Harrell's %RCSPLINE macro. I shared my blog post with a colleague at SAS and mentioned that the process could be simplified. In order to specify the placement of the knots as suggested by Harrell (Regression Modeling Strategies, 2010 and 2015), I had to use PROC UNIVARIATE to get the percentiles of the explanatory variable. "Wouldn't it be nice," I said, "if the EFFECT statement could perform that computation automatically?"

I am happy to report that the 15.1 release of SAS/STAT (SAS 9.4M6) includes a new option that makes it easy to place internal knots at percentiles of the data. You can now use the KNOTMETHOD=PERCENTILELIST option on the EFFECT statement to place knots. For example, the following statement places five internal knots at percentiles that are recommended in Harrell's book:
EFFECT spl = spline(x / knotmethod=percentilelist(5 27.5 50 72.5 95));

An example of using restricted cubic in regression in SAS

Restricted cubic splines are also called "natural cubic splines." This section shows how to perform a regression fit by using restricted cubic splines in SAS.

For the example, I use the same Sashelp.Cars data that I used in the previous article. For clarity, the following SAS DATA step renames the Weight and MPG_City variables to X and Y, respectively. If you want to graph the regression curve, you can sort the data by the X variable, but this step is not required to perform the regression.

/* create (X,Y) data from the Sashelp.Cars data. Sort by X for easy graphing. */
data Have;
   set sashelp.cars;
   rename mpg_city = Y  weight = X  model = ID;
run;
 
proc sort data=Have;  by X;  run;

The following call to PROC GLMSELECT includes an EFFECT statement that generates a natural cubic spline basis using internal knots placed at specified percentiles of the data. The MODEL statement fits the regression model and the OUTPUT statement writes an output data set that contains the predicted values. The SGPLOT procedure displays a graph of the regression curve overlaid on the data:

/* fit data by using restricted cubic splines using SAS/STAT 15.1 (SAS 9.4M6) */
ods select ANOVA ParameterEstimates SplineKnots;
proc glmselect data=Have;
   effect spl = spline(X/ details naturalcubic basis=tpf(noint)
             knotmethod=percentilelist(5 27.5 50 72.5 95); /* new in SAS/STAT 15.1 (SAS 9.4M6)  */
   model Y = spl / selection=none;       /* fit model by using spline effects */
   output out=SplineOut predicted=Fit;   /* output predicted values */
quit;
 
title "Restricted Cubic Spline Regression";
title2 "Five Knots Placed at Percentiles";
proc sgplot data=SplineOut noautolegend;
   scatter x=X y=Y;
   series x=X y=Fit / lineattrs=(thickness=3 color=red);
run;

In summary, the new KNOTMETHOD=PERCENTILELIST option on the EFFECT statement simplifies the process of using percentiles of a variable to place internal knots for a spline basis. The example shows knots placed at the 5th, 27.5th, 50th, 72.5th, and 95th percentiles of an explanatory variable. These heuristic values are recommended in Harrell's book. For more details about the EFFECT statement and how the location of knots affects the regression fit, see my previous article "Regression with restricted cubic splines in SAS."

You can download the complete SAS program that generates this example, which requires SAS/STAT 15.1 (SAS 9.4M6). If you have an earlier release of SAS, the program also shows how to perform the same computations by calling PROC UNIVARIATE to obtain the location of the knots.

The post An easier way to perform regression with restricted cubic splines in SAS appeared first on The DO Loop.

2月 112019
 

Have you ever run a regression model in SAS but later realize that you forgot to specify an important option or run some statistical test? Or maybe you intended to generate a graph that visualizes the model, but you forgot? Years ago, your only option was to modify your program and rerun it. Current versions of SAS support a less painful alternative: you can use the STORE statement in many SAS/STAT procedures to save the model to an item store. You can then use the PLM procedure to perform many post-modeling analyses, including performing hypothesis tests, showing additional statistics, visualizing the model, and scoring the model on new data. This article shows four ways to use PROC PLM to obtain results from your regression model.

What is PROC PLM?

PROC PLM enables you to analyze a generalized linear model (or a generalized linear mixed model) long after you quit the SAS/STAT procedure that fits the model. PROC PLM was released with SAS 9.22 in 2010. This article emphasizes four features of PROC PLM:

  • You can use the SCORE statement to score the model on new data.
  • You can use the EFFECTPLOT statement to visualize the model.
  • You can use the ESTIMATE, LSMEANS, SLICE, and TEST statements to estimate parameters and perform hypothesis tests.
  • You can use the SHOW statement to display statistical tables such as parameter estimates and fit statistics.

For an introduction to PROC PLM, see "Introducing PROC PLM and Postfitting Analysis for Very General Linear Models" (Tobias and Cai, 2010). The documentation for the PLM procedure includes more information and examples.

To use PROC PLM you must first use the STORE statement in a regression procedure to create an item store that summarizes the model. The following procedures support the STORE statement: GEE, GENMOD, GLIMMIX, GLM, GLMSELECT, LIFEREG, LOGISTIC, MIXED, ORTHOREG, PHREG, PROBIT, SURVEYLOGISTIC, SURVEYPHREG, and SURVEYREG.

The example in this article uses PROC LOGISTIC to analyze data about pain management in elderly patients who have neuralgia. In the PROC LOGISTIC documentation, PROC LOGISTIC fits the model and performs all the post-fitting analyses and visualization. In the following program, PROC LOGIST fits the model and stores it to an item store named PainModel. In practice, you might want to store the model to a permanent libref (rather than WORK) so that you can access the model days or weeks later.

Data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   datalines;
P F 68  1 No  B M 74 16 No  P F 67 30 No  P M 66 26 Yes B F 67 28 No  B F 77 16 No
A F 71 12 No  B F 72 50 No  B F 76  9 Yes A M 71 17 Yes A F 63 27 No  A F 69 18 Yes
B F 66 12 No  A M 62 42 No  P F 64  1 Yes A F 64 17 No  P M 74  4 No  A F 72 25 No
P M 70  1 Yes B M 66 19 No  B M 59 29 No  A F 64 30 No  A M 70 28 No  A M 69  1 No
B F 78  1 No  P M 83  1 Yes B F 69 42 No  B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No  A F 69 12 No  B F 65 14 No  B M 70  1 No  B M 67 23 No  A M 76 25 Yes
P M 78 12 Yes B M 77  1 Yes B F 69 24 No  P M 66  4 Yes P F 65 29 No  P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No  P F 72 27 No  P F 70 13 Yes A M 75  6 Yes
B F 65  7 No  P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No  A M 65 15 No
P F 67  1 Yes A M 67 10 No  P F 72 11 Yes A F 74  1 No  B M 80 21 Yes A F 69  3 No
;
 
title 'Logistic Model on Neuralgia';
proc logistic data=Neuralgia;
   class Sex Treatment;
   model Pain(Event='Yes')= Sex Age Duration Treatment;
   store PainModel / label='Neuralgia Study';  /* or use mylib.PaimModel for permanent storage */
run;

The LOGISTIC procedure models the presence of pain based on a patient's medication (Drug A, Drug B, or placebo), gender, age, and duration of pain. After you fit the model and store it, you can use PROC PLM to perform all sorts of additional analyses, as shown in the subsequent sections.

Use PROC PLM to score new data

An important application of regression models is to predict the response variable for new data. The following DATA step defines three new patients. The first two are females who are taking Drug B. The third is a male who is taking Drug A:

/* 1.Use PLM to score future obs */
data NewPatients;
   input Treatment $ Sex $ Age Duration;
   datalines;
B F 63  5 
B F 79 16 
A M 74 12 
;
 
proc plm restore=PainModel;
   score data=NewPatients out=NewScore predicted LCLM UCLM / ilink; /* ILINK gives probabilities */
run;
 
proc print data=NewScore;
run;

The output shows the predicted pain level for the three patients. The younger woman is predicted to have a low probability (0.01) of pain. The model predicts a moderate probability of pain (0.38) for the older woman. The model predicts a 64% chance that the man will experience pain.

Notice that the PROC PLM statement does not use the original data. In fact, the procedure does not support a DATA= option but instead uses the RESTORE= option to read the item store. The PLM procedure cannot create plots or perform calculations that require the data because the data are not part of the item store.

Use PROC PLM to visualize the model

I've previously written about how to use the EFFECTPLOT statement to visualize regression models. The EFFECTPLOT statement has many options. However, because PROC PLM does not have access to the original data, the EFFECTPLOT statement in PROC PLM cannot add observations to the graphs.

Although the EFFECTPLOT statement is supported natively in the LOGISTIC and GENMOD procedure, it is not directly supported in other procedures such as GLM, MIXED, GLIMMIX, PHREG, or the SURVEY procedures. Nevertheless, because these procedures support the STORE statement, you can use the EFFECTPLOT statement in PROC PLM to visualize the models for these procedures. The following statement uses the EFFECTPLOT statement to visualize the probability of pain for female and male patients that are taking each drug treatment:

/* 2. Use PROC PLM to create an effect plot */
proc plm restore=PainModel;
   effectplot slicefit(x=Age sliceby=Treatment plotby=Sex);
run;

The graphs summarize the model. For both men and women, the probability of pain increases with age. At a given age, the probability of pain is lower for the non-placebo treatments, and the probability is slightly lower for the patients who use Drug B as compared to Drug A. These plots are shown at the mean value of the Duration variable.

Use PROC PLM to compute contrasts and other estimates

One of the main purposes of PROC PLM Is to perform postfit estimates and hypothesis tests. The simplest is a pairwise comparison that estimates the difference between two levels of a classification variable. For example, in the previous graph the probability curves for the Drug A and Drug B patients are close to each other. Is there a significant difference between the two effects? The following ESTIMATE statement estimates the (B vs A) effect. The EXP option exponentiates the estimate so that you can interpret the 'Exponentiated' column as the odds ratio between the drug treatments. The CL option adds confidence limits for the estimate of the odds ratio. The odds ratio contains 1, so you cannot conclude that Drug B is significantly more effective that Drug A at reducing pain.

/* 3. Use PROC PLM to create contrasts and estimates */
proc plm restore=PainModel;
   /* 'Exponentiated' column is odds ratio between treatments */
   estimate 'Pairwise B vs A' Treatment 1 -1 / exp CL;
run;

Use PROC PLM to display statistics from the analysis

One of the more useful features of PROC PLM is that you can use the SHOW statement to display tables of statistics from the original analysis. If you want to see the ParameterEstimates table again, you can do that (SHOW PARAMETERS). You can even display statistics that you did not compute originally, such as an estimate of the covariance of the parameters (SHOW COVB). Lastly, if you have the item store but have forgotten what program you used to generate the model, you can display the program (SHOW PROGRAM). The following statements demonstrate the SHOW statement. The results are not shown.

/* 4. Use PROC PLM to show statistics or the original program */
proc plm restore=PainModel;
   show Parameters COVB Program;
run;

Summary

In summary, the STORE statement in many SAS/STAT procedures enables you to store various regression models into an item store. You can use PROC PLM to perform additional postfit analyses on the model, including scoring new data, visualizing the model, hypothesis testing, and (re)displaying additional statistics. This technique is especially useful for long-running models, but it is also useful for confidential data because the data are not needed for the postfit analyses.

The post 4 reasons to use PROC PLM for linear regression models in SAS appeared first on The DO Loop.

1月 212019
 

In machine learning and other model building techniques, it is common to partition a large data set into three segments: training, validation, and testing.

  • Training data is used to fit each model.
  • Validation data is a random sample that is used for model selection. These data are used to select a model from among candidates by balancing the tradeoff between model complexity (which fit the training data well) and generality (but they might not fit the validation data). These data are potentially used several times to build the final model
  • Test data is a hold-out sample that is used to assess final model and estimate its prediction error. It is only used at the end of the model-building process.

I've seen many questions about how to use SAS to split data into training, validation, and testing data. (A common variation uses only training and validation.) There are basically two approaches to partitioning data:

  • Specify the proportion of observations that you want in each role. For each observation, randomly assign it to one of the three roles. The number of observations assigned to each role will be a multinomial random variable with expected value N pk, where N is the number of observations and pk (k = 1, 2, 3) is the probability of assigning an observation to the k_th role. For this method, if you change the random number seed you will usually get a different number of observations each role because the size is a random variable.
  • Specify the number of observations that you want in each role and randomly allocate that many observations.

This article uses the SAS DATA step to accomplish the first task and uses PROC SURVEYSELECT to accomplish the second. I also discuss how to split data into only two roles: training and validation.

It is worth mentioning that many model-selection routines in SAS enable you to split data by using the PARTITION statement. Example include the "SELECT" procedures (GLMSELECT, QUANTSELECT, HPGENSELECT...) and the ADAPTIVEREG procedure. However, be aware that the procedures might ignore observations that have missing values for the variables in the model.

Random partition into training, validation, and testing data

When you partition data into various roles, you can choose to add an indicator variable, or you can physically create three separate data sets. The following DATA step creates an indicator variable with values "Train", "Validate", and "Test". The specified proportions are 60% training, 30% validation, and 10% testing. You can change the values of the SAS macro variables to use your own proportions. The RAND("Table") function is an efficient way to generate the indicator variable.

data Have;             /* the data to partition  */
   set Sashelp.Heart;  /* for example, use Heart data */
run;
 
/* If propTrain + propValid = 1, then no observation is assigned to testing */
%let propTrain = 0.6;         /* proportion of trainging data */
%let propValid = 0.3;         /* proportion of validation data */
%let propTest = %sysevalf(1 - &propTrain - &propValid); /* remaining are used for testing */
 
/* Randomly assign each observation to a role; _ROLE_ is indicator variable */
data RandOut;
   array p[2] _temporary_ (&propTrain, &propValid);
   array labels[3] $ _temporary_ ("Train", "Validate", "Test");
   set Have;
   call streaminit(123);         /* set random number seed */
   /* RAND("table") returns 1, 2, or 3 with specified probabilities */
   _k = rand("Table", of p[*]); 
   _ROLE_ = labels[_k];          /* use _ROLE_ = _k if you prefer numerical categories */
   drop _k;
run;
 
proc freq data=RandOut order=freq;
   tables _ROLE_ / nocum;
run;

A shown by the output of PROC FREQ, the proportion of observations in each role is approximately the same as the specified proportions. For this random number seed, the proportions are 59.1%, 30.4%, and 10.6%. If you change the random number seed, you will get a different assignment of observations to roles and also a different proportion of data in each role.

The observant reader will notice that there are only two elements in the array of probabilities (p) that is used in the RAND("Table") call. This is intentional. The documentation for the RAND("Table") function states that if the sum of the specified probabilities is less than 1, then the quantity 1 – Σ pi is used as the probability of the last event. By specifying only two values in the p array, the same program works for partitioning the data into two pieces (training and validation) or three pieces (and testing).

Create random training, validation, and testing data sets

Some practitioners choose to create three separate data sets instead of adding an indicator variable to the existing data. The computation is exactly the same, but you can use the OUTPUT statement to direct each observation to one of three output data sets, as follows:

/* create a separate data set for each role */
data Train Validate Test;
array p[2] _temporary_ (&propTrain, &propValid);
set Have;
call streaminit(123);         /* set random number seed */
/* RAND("table") returns 1, 2, or 3 with specified probabilities */
_k = rand("Table", of p[*]);
if      _k = 1 then output Train;
else if _k = 2 then output Validate;
else                output Test;
drop _k;
run;
NOTE: The data set WORK.TRAIN has 3078 observations and 17 variables.
NOTE: The data set WORK.VALIDATE has 1581 observations and 17 variables.
NOTE: The data set WORK.TEST has 550 observations and 17 variables.

This example uses the same random number seed as the previous example. Consequently, the three output data sets have the same observations as are indicated by the partition variable (_ROLE_) in the previous example.

Specify the number of observations in each role

Instead of specifying a proportion, you might want to specify the exact number of observations that are randomly assigned to each role. The advantage of this technique is that changing the random number seed does not change the number of observations in each role (although it does change which observations are assigned to each role). The SURVEYSELECT procedure supports the GROUPS= option, which you can use to specify the number of observations.

The GROUPS= option requires that you specify integer values. For this example, the original data contains 5209 observations but 60% of 5209 is not an integer. Therefore, the following DATA step computes the number of observations as ROUND(N p) for the training and validation sets. These integer values are put into macro variables and used in the GROUPS= option on the PROC SURVEYSELECT statement. You can, of course, skip the DATA step and specify your own values such as groups=(3200, 1600, 409).

/* Specify the sizes of the train/validation/test data from proportions */
data _null_;
   if 0 then set sashelp.heart nobs=N;  /* N = total number of obs */
   nTrain = round(N * &propTrain);      /* size of training data */
   nValid = round(N * &propValid);      /* size of validataion data */
   call symputx("nTrain", nTrain);      /* put integer into macro variable */
   call symputx("nValid", nValid);
   call symputx("nTest", N - nTrain - nValid);
run;
 
/* randomly assign observations to three groups */
proc surveyselect data=Have seed=12345 out=SSOut
     groups=(&nTrain, &nValid, &nTest); /* if no Test data, use  GROUPS=(&nTrain, &nValid) */
run;
 
proc freq data=SSOut order=freq;
   tables GroupID / nocum;           /* GroupID is name of indicator variable */
run;

The training, validation, and testing groups contain 3125, 1563, and 521 observations, respectively. These numbers are the closest integer approximations to 60%, 30% and 10% of the 5209 observations. Notice that the output from the SURVEYSELECT procedure uses the values 1, 2, and 3 for the GroupID indicator variable. You can use PROC FORMAT to associate those numbers with labels such as "Train", "Validate", and "Test".

In summary, there are two basic programming techniques for randomly partitioning data into training, validation, and testing roles. One way uses the SAS DATA step to randomly assign each observation to a role according to proportions that you specify. If you use this technique, the size of each group is random. The other way is to use PROC SURVEYSELECT to randomly assign observations to roles. If you use this technique, you must specify the number of observation in each group.

The post Create training, validation, and test data sets in SAS appeared first on The DO Loop.