13.1

5月 302017
 

By default, when you use the SERIES statement in PROC SGPLOT to create a line plot, the observations are connected (in order) by straight line segments. However, SAS 9.4m1 introduced the SMOOTHCONNECT option which, as the name implies, uses a smooth curve to connect the observations. In Sanjay Matange's blog, he shows an example where the SMOOTHCONNECT option interpolates points in a time series by using a smooth curve. In the example, the smooth curve does a good job of smoothly connecting the evenly spaced data points.

However, in a second post, Sanjay shows how the SMOOTHCONNECT option can result in curves that are less ideal. Here is a SAS program that reproduces the gist of Sanjay's example:

data Ex1;
input segment $ x y;
datalines;
A 0  0 
A 1  0
A 2 -3
A 3 -1
B 0  0 
B 1  0
B 2  2
B 3  1
;
 
proc sgplot data=Ex1;
   series x=x y=y / group=segment markers smoothconnect;
run;

As you can see, groups A and B have identical values for the first two observations. However, the remaining Y values for group A are negative, whereas the remaining Y values for group B are positive. Notice that the interpolating curve for group A rises up before it dives down. Similarly, the curve for group B dips down before it rises up.

This visualization could give the wrong impression to the viewer of the graph, especially if the markers are not displayed. We only have data for four values of X. However, the SMOOTHCONNECT option makes it appear that the graph for group A was higher than group B on the interval (0, 1). That may or may not be true. The curves also give the impression that the curve for group A began to decrease prior to x=1, whereas in fact we have no idea when the curve reached a maximum and began to decrease.


5 things to remember when using the SMOOTHCONNECT option in SGPLOT #DataViz
Click To Tweet


The SMOOTHCONNECT option can be useful for visualizing time series data, but as with any tool, it is important to know how to use it responsibly. In thinking about these issues, I compiled a list of five ways that the SMOOTHCONNECT option might give a misleading view of the data. By understanding these issues, you can use the SMOOTHCONNECT option wisely. Here are some potential pitfalls to connecting points by using a smooth curve.

  1. Before the curve goes down, it often goes up.
  2. The curve gives the impression that we know the location of peaks and valleys.
  3. The high and low points on the curve might exceed the range of the data.
  4. The curve can display quick (possibly unrealistic) changes in direction.
  5. The curve can bend backward, or even create a loop.

Sanjay's example demonstrates the first three issues. Notice that the maximum Y value for group A in the data is Y=0, but the interpolating spline exceeds that value.

To demonstrate the fourth and fifth items requires an artificial example and data that is not evenly spaced along the X variable:

data Ex2;
input x y;
datalines;
0    0 
0.95 0.95
1    1
2   -2
;
proc sgplot data=Ex2;
   series x=x y=y / markers smoothconnect;
   yaxis max=1.5;
run;

There are only four points in the data set, but the SMOOTHCONNECT curve for these data exhibits a sharp turn (in fact, a loop). I intentionally created data that would create this behavior by making the data linear for X < 2. I also made the second data point very close to the third data point so that the interpolating curve would have to gyrate widely to pass through the points before sloping down to pass through the fourth point.

Implications for data visualization

It is well known among applied mathematicians that interpolation can lead to issues like these. These issues are not caused by a bug in SAS, but by the fact that you are trying to force a smooth curve to pass through every data point. I point them out so that SAS users who are using SGPLOT can be aware of them.

If your data are evenly spaced in the horizontal direction, the first three issue are usually small and innocuous. The last two issue will not occur. Therefore, it is safe to use the SMOOTHCONNECT option for evenly spaced data.

If your data are not evenly spaced, here are a few guidelines to help you avoid wild interpolating curves:

In summary, if you use the SMOOTHCONNECT option in the SERIES statement, SAS will pass a smooth curve through the points. When the points are unevenly spaced in X, the curve might need to quickly change direction to pass through all the data. The resulting curve might not be a good representation of the data. In this case, you might need to relax the requirements of a smooth interpolating curve, either by using a piecewise linear interpolation or by using a smooth curve that is not constrained to pass through the data.

The post On the SMOOTHCONNECT option in the SERIES statement appeared first on The DO Loop.

1月 182017
 

This article shows how to solve mixed-integer linear programming (MILP) problems in SAS. In a mixed-integer problem, some of the variables in the problem are integer-valued whereas others are continuous. The objective function is a linear function of the variables and the variables can be subject to linear constraints.

Last month I discussed how to solve linear programming (LP) problems in SAS by using PROC OPTMODEL in SAS/OR or by using SAS/IML software. You can use these same tools to solve mixed-integer linear programming problems. The OPTMODEL procedure has the simpler syntax. The MILPSOLVE function in SAS/IML software provides similar functionality in a n interactive matrix language.

A mixed-integer linear programming problem example

The previous article solved a two-variable LP problem. If you constrain one of the variables to be an integer, you get a MILP problem, as follows:

  • Let x = {x1, x2} be the vector of unknown positive variables. The x1 variable is integer-valued; the x2 variable is continuous.
  • The objective function to maximize is 3*x1 + 5*x2. In vector form, the objective function is cTx where c = {3, 5}.
  • There are three linear constraints, which you can represent in matrix form. Define the 3 x 2 matrix of constraint coefficients A = {3  -2, 5  10, 4  2} and the right-hand-side vector b = {10, 56, 7}. Let LEG = {≤, ≤, ≥} be a "vector of symbols." ("LEG" is a mnemonic for "Less than," "Equal to," or "Greater than.") With this notation, the vector of constraints is written as Ax LEG b.
Feasible region for mixed-integer linear program problem  in SAS. The optimal solution is marked.

The graph shows the feasible region. The interior of the polygon satisfies the constraints. Because x1 must be an integer, the vertical stripes indicate the feasible values of the variables. The color of the stripes indicate the value of the objective function. The green star indicates the optimal solution, which is x = {5, 3.1}.

Mixed-integer linear programming by using the OPTMODEL procedure

The following statements show one way to formulate and solve the MILP problem by using the OPTMODEL procedure in SAS/OR software:

proc optmodel;
var x1 integer >= 0;           /* information about the variables */
var x2 >= 0;
max z = 3*x1 +  5*x2;         /* define the objective function */
con c1: 3*x1 + -2*x2 <= 10;   /* specify linear constraints */
con c2: 5*x1 + 10*x2 <= 56;
con c3: 4*x1 +  2*x2 >=  7;
solve with milp;              /* solve the MILP problem */
print x1 x2;
quit;
PROC O PTMODEL solution to a mixed-integer linear programming problem

The OPTMODEL procedure prints two tables. The first (not shown) describes the optimization algorithm and tells you that the optimal objective value is 30.5. The second table is the solution vector, which is x = {5, 3.1}.

Mixed-integer linear programming by using the MILPSOLVE subroutine in SAS/IML

The MILPSOLVE subroutine in the SAS/IML language uses matrices and vectors to specify the problem. The syntax for the MILPSOLVE subroutine is almost identical to the syntax for the LPSOLVE subroutine, so see the previous article for an explanation of each argument. The following SAS/IML program defines and solves the MILP problem:

proc iml;
/* information about variables (row of column, doesn't matter) */
colType = {I, C};  /*  C, B, or I for cont, binary, int */
LowerB =  {0, 0};  /* lower bound constraints on x */
UpperB = {10,10};  /* upper bound constraints on x */
 
/* objective function */
c = {3 5};        /* vector for objective function c*x */
 
/* linear constraints */
A = {3 -2,        /* matrix of constraint coefficients */
     5 10,
     4  2};
b = {10,          /* RHS of constraint eqns (column vector) */
     56,
      7};         
 
/* specify symbols for constraints:
   'L' for less than or equal
   'E' for equal
   'G' for greater than or equal */
LEG = {L, L, G}; 
 
/* control vector for optimization */
ctrl = {-1,       /* maximize objective */
         1};      /* print level */
 
CALL MILPSOLVE(rc, objVal, result, relgap, /* output variables */
               c, A, b,      /* objective and linear constraints */
               ctrl,         /* control vector */
               coltype, LEG, /*range*/, LowerB, UpperB); 
print rc objVal, result[rowname={x1 x2}];
Solution to linear programming problem by using the LPSOLVE function in SAS/IML software

In the call to MILPSOLVE, the first four arguments are output arguments. The return code (rc) is 0, which indicates that an optimal value was found. The value of the objective function at the optimal value is returned in objVal. The optimal values of the variables are returned in the result argument.

The MILPSOLVE subroutine was introduced in SAS/IML 13.1, which was shipped with SAS 9.4m1. For additional details about the MILPSOLVE subroutine, see the documentation.

tags: 13.1, Optimization

The post Solve mixed-integer linear programming problems in SAS appeared first on The DO Loop.

6月 152016
 

I have previously shown how to overlay basic plots on box plots when all plots share a common discrete X axis. It is interesting to note that box plots can also be overlaid on a continuous (interval) axis. You often need to bin the data before you create the plot.

A typical situation when you plot a time series. You might want to overlay box plots to display a summary of the response distribution within certain time intervals. For example, last week I visualized World Bank data for the average life expectancy in more than 200 countries during the years 1960–2014. Suppose that for each decade you want to draw a box plot that summarizes the response variable for all countries. You can use the DATA step (or create a data view) to create the Decade variable, as follows:

data Decade / view=decade;
set LE;
if      1960 <= Year <= 1969 then Decade=1965;
else if 1970 <= Year <= 1979 then Decade=1975;
else if 1980 <= Year <= 1989 then Decade=1985;
else if 1990 <= Year <= 1999 then Decade=1995;
else if 2000 <= Year <= 2009 then Decade=2005;
else if 2010 <= Year         then Decade=2015;
else Decade = .;   /* handle bad data */
run;

You can use the following call to PROC SGPLOT to overlay the box plot and the scatter plot of the data:

/* overlay data by year and box plots to summarize decades */
title "Distribution of Life Expectancy by Decade";
title2 "Raw Data for 207 Countries by Year";
proc sgplot data=Decade noautolegend;
   vbox Expected / category=Decade;              /* box plot */
   scatter x=year y=Expected / transparency=0.9; /* semitransparent scatter */
   xaxis type=linear display=(nolabel)           /* linear axis */
         values=(1965 to 2015 by 10)             /* optional: set ticks labels */
         valuesdisplay=('60s' '70s' '80s' '90s' '2000s' '2010s');
run;
Box plots by decade overlaid with life expectancy data by year

There are a couple of tricks that make this overlay work:

  • The discrete Decade variable uses the same scale as the continuous Year variable. In fact, the Decade value is in the middle of the continuous interval for each decade.
  • The scatter plot markers are highly transparent so that they show the individual measurements without overwhelming the display.
  • The TYPE=LINEAR option in the XAXIS statement enables you to overlay the scatter plot which has an interval X axis) and the box plot (which has discrete X values).
  • Optionally, you can specify the location of the major tick marks by using the VALUES= option in the XAXIS statement. In this example, the tick marks are set to be 1965, 1975, and so forth, which are the values of the Decade variable. To emphasize the decades, rather than particular years, you can use the VALUESDIPLAY= option to manually set the values that are displayed for each tick mark.

Combine box plots with other plots in #SAS. #DataViz
Click To Tweet


Overlay regression lines on box plots

In my previous article I showed how you can use the CONNECT= option to connect quantiles of adjacent box plots. Some people use the CONNECT= option as a poor man's version of quantile regression. However, quantile regression is more than merely connecting the sample quantiles of binned data. Quantile regression shows trends for various quantiles of the response variable.

Box plots by decade; overlay quantile regression of life expectancy

For brevity, I will not discuss how to use PROC QUANTREG to perform quantile regression on these data. However, you can download the program that performs the analysis and creates the graph.

The graph overlays quantile regression lines on the previous graph. Notice that the 90th and 75th quantiles of life expectancy are increasing at a rate of about 2 years per decade. The median life expectancy is increasing at a faster rate of 3 years per decade. The lower quantiles are increasing even faster: the 25th quantile is increasing at about 4 years per decade and the 10th quantile is increasing at about 3.6 years per decade. Notice that the 25th, 50th, and 75th quantile curves are close to (but not equal to) the corresponding features of the aggregated box plots.

I think that this graph provides an excellent visualization of trends in life expectancy around the world, especially if combined with spaghetti plots or lasagna plots. The countries at the top of the plot have good sanitation, health care, and nutrition. Consequently, their life expectancy increases at a slower rate than countries that have poorer sanitation and nutrition. Small improvements in the poorer countries can make a big difference in the life expectancy of their population.

Summary

In summary, you can overlay continuous plots and box plots in SAS 9.4m1. You need to ensure that the categories of the box plots are on the same scale as the continuous X variable. You need to set the TYPE=LINEAR option on the XAXIS statement. Lastly, you might want to use partial transparency to ensure that the graph doesn't become too crowded.

tags: 13.1, Statistical Graphics

The post Overlay plots on a box plot in SAS: Continuous X axis appeared first on The DO Loop.

6月 132016
 

Box plots summarize the distribution of a continuous variable. You can display multiple box plots in a single graph by specifying a categorical variable. The resulting graph shows the distribution of subpopulations, such as different experimental groups. In the SGPLOT procedure, you can use the CATEGORY= option on the VBOX statement to generate box plots for each level of a categorical variable.

Sometimes you need to overlay additional points or lines on box plots. SAS 9.4M1 and beyond supports overlaying "basic plots" and box plots. There are many different basic plot types, including the scatter plot and the series (line) plot.

This article shows examples when the X axis is discrete. In a subsequent article I will discuss the case of a continuous X axis.


How to overlay box plots and other plots that share a common discrete X variable. #SAStip #DataViz
Click To Tweet


Connect features of adjacent box plots

The simplest way to overlay features on a set of box plots is to use the CONNECT= option on the VBOX statement. The CONNECT= option (which also requires SAS 9.4M1) enables you to overlay line segments that connect the means or selected quantiles (max, min, or quartiles) of adjacent box plots. The article by Sanjay Matange provides details about how to use the CONNECT= option. The CONNECT= option is useful when you want to visually emphasize how the mean (or quartiles) change between levels of a classification variable.

You can use a trick to add multiple CONNECT= options. For example, if you want to connect the first, second, and third quartile values, you can repeat the VBOX statement multiple times, but use the NOFILL and NOOUTLIERS options on all but the first statement. The following statement provides an example. Notice that the NOCYCLEATTRS option prevents the plot from using different colors for the different VBOX statements:

title "Fuel Efficiency by Origin";
proc sgplot data=sashelp.cars noautolegend nocycleattrs;      /* suppress cycling attributes */
vbox mpg_city / category=origin connect=median;               /* draw box and connect medians */
vbox mpg_city / category=origin nofill nooutliers connect=Q1; /* connect Q1 */
vbox mpg_city / category=origin nofill nooutliers connect=Q3; /* connect Q3 */
run;
Box plots with lines connecting the Q1, median, and Q3 quartiles

Overlay arbitrary line segments and points on box plots

In the same article, Sanjay shows how to overlay line segments that connect any precomputed quantities. For example, you can use PROC MEANS to compute statistics for each category and then use one or more SERIES statements to display line segments that connect the statistics.

In a similar way you can overlay markers that represent special observations or reference values. For example, suppose that the Sashelp.Class data set contains data about the heights and weights of a teacher's current students. When two new foreign exchange students (Ivan and Anna) are assigned to her class, she notices that these students are exceptionally tall compared to her current students. She decides to overlay the new student's heights on box plots (one for each gender) that show the height distributions of her current students. One way to do this is to create a box plot of the original data and then overlay a scatter plot of the new observations.

The following SAS program

  1. Creates a data set with the new data.
  2. Concatenates the original and the new data. To overlay the plots they should have a common X axis. In this case, the new data can share the Name, Sex, and Age variables of the original data, but you should create NEW variables that describe the heights and weights of the new students. In this way only the original observations are used to construct the box plots and only the new observations are used to overlay the scatter plot.
  3. Uses the SGPLOT procedure to overlay the box plots and the scatter plot. Use the VBOX statement with CATEGORY=Sex to create the box plot and use the SCATTER statement with X=Sex to overlay the scatter plot.
data NewStudents;
length Name $8 Sex $1;
input Name Sex Age Height Weight;
datalines;
Ivan    M 14 71.5 110 
Anna    F 13 68   105
;
 
data All;          /* concatenate: use same X variable but different Y variables */
set sashelp.class
    NewStudents(rename=(Height=NewHeight Weight=NewWeight));
run;
 
proc sgplot data=All noautolegend;
vbox height / category=sex;
scatter x=sex y=NewHeight / datalabel=Name
        markerattrs=(color=red symbol=CircleFilled size=12);
run;
Box plots with markers that indicated special values

Overlay multiple plots on box plots

It is just as easy to overlay multiple plots. Recall that plots are rendered in the order that they appear in the SGPLOT procedure. This means that plots that change the background (such as a BLOCK plot) should be specified first, and plots that are use markers should be specified last. Another option is to use the TRANSPARENCY= option judiciously so that features in earlier plots are visible even though other plots are overlaid on top.

The following example demonstrates overlaying three plots. The plot shows the distribution of the fuel economy (measured as miles per gallon) for different kinds of vehicles. A block plot in the background emphasizes that hybrid vehicles have the best fuel economy. The box plot then shows the distribution of the fuel economy for six types of vehicles. The BOXWIDTH= option is used to control the width of the box plots. Lastly, the graph displays a scatter plot of the response variable for each model of vehicle. The JITTER option is used to reduce overplotting.

data cars / view=cars;
set Sashelp.Cars;
if type="Hybrid" then FuelType = "Hybrid      ";
else FuelType = "Conventional";
run;
 
proc sgplot data=cars noautolegend;
block x=type block=FuelType / filltype=alternate 
         fillattrs=(color=LightGray)altfillattrs=(color=white);
vbox mpg_city / category=type boxwidth=0.8 nooutliers;
scatter  x=type y=mpg_city / jitter transparency=0.6 
         markerattrs=(color=red symbol=CircleFilled);
yaxis offsetmax=0.1;
run;
boxdiscrete3

The resulting graph looks complicated, but is created by using only a few statements. Again, the key is that each of the three plot types share a common, discrete, X variable.

This article provides several examples of overlaying a box plot with other plots that share a discrete X axis. In my next blog post I will discuss how you can use box plots to summarize conditional distributions when the X axis is continuous.

tags: 13.1, Statistical Graphics

The post Overlay plots on a box plot in SAS: Discrete X axis appeared first on The DO Loop.

4月 042016
 

In SAS procedures, the WHERE clause is a useful way to filter observations so that the procedure receives only a subset of the data to analyze. The IML procedure supports the WHERE clause in two separate statements.

  • On the USE statement, the WHERE clause acts as a global filter. The where clause applies to all subsequent READ statements that read from the open data set.
  • On the READ statement, the WHERE clause further filters the data. Because you can use multiple READ statements, you can easily create matrices that contain disjoint or overlapping subsets of the data. However, be aware multiple READ statements might result in SAS reading the data multiple times, depending on the syntax that you specify.

An interesting fact about the WHERE clause in SAS/IML is that you can specify run-time expressions for the WHERE clause, which makes it a very powerful tool for data analysis.


The WHERE clause in SAS/IML: filter data with run-time expressions #SAStip
Click To Tweet


The WHERE clause in the USE statement

The USE statement opens a SAS data set for read access. When you need only one filter, specify a WHERE clause on the USE statement. For example, suppose that you want a matrix that contains the age, weight, and height of all females in the Sashelp.Class data. The following program reads the female observations into the matrix X and prints the average age, weight, and height:

proc iml;
varNames = {"Age" "Weight" "Height"};
use Sashelp.Class where(sex='F');       /* restrict to females */
   read all var varNames into X;
close Sashelp.Class;
 
avg = mean(X);
print avg[L="Mean Values for Males" colname=varNames format=5.2];
whereiml1

The WHERE clause in the READ statement

You can also put the WHERE clause in the READ statement. This technique is useful if you intend to read the data several times. For example, the following program reads data for females into the X matrix and data for males into the Y matrix:

use Sashelp.Class;
   read all var varNames into X where(sex='F');
   read all var varNames into Y where(sex='M');
close Sashelp.Class;

If you use a WHERE clause on both the USE and READ statements, the SAS log will include the NOTE
NOTE: WHERE clause has been augmented
to inform you that the data filter combines both WHERE clauses by using a "logical AND" operator.

Expressions in the WHERE clause

Beginning with SAS/IML 13.1 (released with SAS 9.4m1), you can use expressions in WHERE clauses. This means that you can call the READ statement in loop. During each iteration, you can read and analyze various subsets of the data during each iteration.

For example, suppose that you have several grouping variables and you want to conduct a BY-group analysis. You can use the UNIQUEBY technique to conduct a BY-group analysis with several variables. However, the UNIQUEBY technique requires that the data be sorted and fit in RAM. It also requires a bit of "bookkeeping" because you need to keep track of indices. If you don't mind the inefficiency of reading the data multiple times, a WHERE clause approach is conceptually easier to program.

As an example, suppose that you want to analyze the MPG_City variable in the Sashelp.Cars data set for each combinations of the Origin and Type variables. To keep it simple, suppose that you want to compute the mean value of MPG_City for all pairwise combinations of the Origin and Type variables, excluding the observations for American-made vehicles. This analysis is simple by using PROC MEANS. (The output for PROC MEANS is not shown.)

proc means data=Sashelp.Cars Mean;
where Origin ^= "USA";
   class Origin Type;
   var MPG_City;
run;

In PROC IML, this computation requires looping over the valid combinations of Origin and Type. To make the analysis simpler, the following call to PROC FREQ writes the valid combinations to a SAS data set:

proc freq data=Sashelp.Cars noprint;
where Origin ^= "USA";
tables Origin*Type / nocum norow nocol nopercent
                     out=FreqOut;  /* unique combinations of Origin and Type */
run;

In PROC IML, you can read the FreqOut data to obtain the unique combinations of the Origin and Type variables. You can iterate over these combinations, reading the Sashelp.Cars data multiple times. During each iteration, you can analyze one of the BY groups, as follows:

proc iml;
use FreqOut;
   read all var {Origin Type};           /* read unique levels of BY groups */
close FreqOut;
NumGroups = nrow(Origin);
 
use Sashelp.Cars where(Origin ^= "USA"); /* open data set for reading */
Stats = j(NumGroups, 2);                 /* allocate vector for results */
do i = 1 to NumGroups;                   /* for each BY group... */
   /* read unsorted data to obtain the i_th BY group */
   /* Notice the EXPRESSIONS in the WHERE clause! */
   read all var {MPG_City} where(origin=(Origin[i]) & type=(Type[i])); 
   Stats[i, 1] = countn(MPG_City);       /* number of nonmissing obs */
   Stats[i, 2] = mean(MPG_City);         /* analyze this BY group */
end;
close Sashelp.Cars;
 
print Origin Type Stats[colname={"N" "Mean"}];
whereiml2

The result of the analysis is similar to the output from PROC MEANS. Notice the use of expressions in the WHERE clause in the READ statement. The expression origin=(Origin[i]) is interpreted as follows:

  • The left side of the equal sign (origin) specifies the name of a variable in the open data set.
  • The right side of the equal sign must be enclosed in parentheses unless it is a literal constant.
  • The expression inside the parentheses can be any matrix computation that results in a scalar value, including calls to built-in or user-defined functions.

The example program reads the data set 10 times, once for each unique combination of Origin and Type. Although re-reading data is inefficient, there are three advantages: the data set does not need to be sorted, only one BY group at a time is ever in RAM, and the program statements are easy to write. By using this method, you do not have to keep track of sorting, indexing, or extracting the data. The WHERE clause in SAS/IML does the work for you.

tags: 13.1, Data Analysis

The post The WHERE clause in SAS/IML appeared first on The DO Loop.

11月 242014
 

SAS software contains a lot of features, and each release adds more.To make sure that you do not miss new features that appear in the SAS/IML language, the word cloud on the right sidebar of my blog contains numbers that relate to SAS or SAS/IML releases. For example, you can click on "9.3" to read about features that first appeared in SAS 9.3. I have also written summaries of recent SAS/IML releases:

Over the past year I've blogged about features that were new to SAS/IML 13.1, which was released in December, 2013, as part of the first maintenance release of SAS 9.4 (SAS 9.4m1). This article collects all those blog posts together for easy reference.

New functions and subroutines in SAS/IML 13.1

The following blog posts discuss new functions and subroutines for data analysis in SAS/IML 13.1:

  • The CV function computes the sample coeficient of variation.
  • The SKEWNESS function computes the sample skewness.
  • The KURTOSIS function computes the sample kurtosis.
  • The LOGABSDET function computes the natural logarithm of the absolute value of the determinant of a matrix. (Say that three times fast!)
  • The PARENTNAME function enables a module to learn the name of a SAS/IML matrix that was passed in as an argument.

In addition, the SAS/IML 13.1 User's Guide documents two new functions for solving linear programming problems:

  • The LPSOLVE subroutine solve linear programming problems. LPSOLVE replaces the older LP call, which has been deprecated.
  • The MILPSOLVE subroutine is a new subroutine for solving mixed integer linear programming problems. It implements effective techniques for finding optimal solutions for linear objective functions that satisfy certain constraints.

New support for heat maps

There are also new routines for creating heat maps that visualize matrices. I produced a video about heat maps, as well as the following articles:

  • The HEATMAPCONT subroutine creates a heat map that uses a continuous color ramp to visualize a matrix.
  • The HEATMAPDISC subroutine creates a heat map that uses a discrete color ramp to visualize a matrix that contains a small number of distinct values.
  • The PALETTE function enables you to choose color palettes that reflect sound design principles.

Enhancements to functionality

There were several enhancements and language improvements in SAS/IML 13.1. For example, the ABORT and STOP statements now optionally print a user-defined message. Another change is that the order of resolution has changed for user-defined modules, so that it is easier to override built-in functions. I direct you to the "What's New" chapter of the documentation for additional new features.

It can be hard to keep up with enhancements to SAS software. Hopefully this reference page will be a useful to SAS/IML users who are upgrading their version of SAS. Have you used any of these new features? Leave a comment and tell me which is your favorite.

tags: 13.1, Getting Started
11月 192014
 

I sometimes wonder whether some functions and options in SAS software ever get used. Last week I was reviewing new features that were added to SAS/IML 13.1. One of the new functions is the CV function, which computes the sample coefficient of variation for data.

Maybe it is just me, but when I compute descriptive statistics for univariate data, the coefficient of variation is not a statistic that I look at. I don't think my undergraduate statistics course even mentioned the coefficient of variation (CV). I first encountered the idea many years later when learning about distribution theory.

The CV is a simple idea. For a distribution, the coefficient of variation is the ratio of the standard deviation to the mean: CV = σ/μ. You can estimate the coefficient of variation from a sample by using the ratio of the sample standard deviation and the sample mean, usually multiplied by 100 so that it is on the percent scale. This ratio is also known as the relative standard deviation when the data are positive.

What does the coefficient of variation mean?

The coefficient of variation is a dimensionless quantity. As such, it provides a measure of the variability of a sample without reference to the scale of the data.

Suppose I tell two people to measure the heights of some plants. The first person reports that the average height is 1.2 meters, with a standard deviation of 0.275 meters. The second person measures the same plants in centimeters. She reports that the average height is 120 centimeters, with a standard deviation of 27.5 centimeters. Obviously, these are the same answers, but one person reports a standard deviation of 0.275 (which sounds small) whereas the other person reports a standard deviation of 27.2 (which sounds big). The coefficient of variation comes to the rescue: for both sets of measurements the coefficient of variation is 22.9.

The CV can also help you compare two completely different measurements. How does variation in height compare to variation in weight? Or age? Or income? These variables are measured on different scales and use different units, but the CV (which is dimensionless) enables you to compare the variation of these variables.

How to compute the coefficient of variation in SAS

The coefficient of variation is computed by several SAS procedures: MEANS, UNIVARIATE, IML, TABULATE, and so forth. The following example shows data for the plant measurement example in the previous paragraph. The MEANS and IML procedure compute the CV for measurements on the meter and centimeter scales:

data Plants;
input height @@;
cm = height * 100;
datalines;
1.6 1.5 .8 1.0 1.2 .9 1.2 1.8 1.2 1.3 1.3 .9 1.2 1.0 1.1
;
proc means data=Plants N mean std cv;
run;
 
proc iml;
use Plants; read all var _NUM_ into X[c=varNames]; close;
cv = cv(X);
print cv[c=varNames];
cv

Theoretical uses of the coefficient of variation

The coefficient of variation has some interesting uses as a theoretical tool. It enables you to compare the variation between different probability distributions. As I mentioned in my article on fat-tailed and long-tailed distributions, the exponential distribution is an important reference distribution in the theory of distributions. Because the standard deviation and the mean of an exponential distribution are equal, the exponential distribution has a CV equal to 1. Distributions with CV < 1 are considered low-variance distributions. Distributions with CV > 1 are high-variance distributions.

Obviously the coefficient of variation is undefined for symmetric distributions such as the normal and t distributions, which is perhaps why the CV is not widely used. The sample CV is undefined for centered data and is highly variable when the population mean is close to zero.

Do you use the coefficient of variation?

Have you ever used the coefficient of variation in a real data analysis problem? Is the CV a useful but underutilized statistic for practical data analysis? Or is it primarily a theoretical tool for comparing the variability of distributions? Leave a comment.

tags: 13.1, Statistical Thinking
11月 172014
 

Have you ever noticed that some SAS/IML programmers use the CALL statement to call a subroutine, whereas others use the RUN statement? Have you ever wondered why the SAS/IML language has two statements that do the same thing?

It turns out that the CALL statement and the RUN statement do not do the same thing! Read on to discover how they differ.

By the way, the RUN statement in PROC IML has only one purpose: to execute a subroutine. Many other SAS procedures (such as REG, GLM, and DATASETS) use the RUN statement to tell SAS to run the procedure. PROC IML is different. Never put the RUN statement at the end of a PROC IML program.

Built-in versus user-defined subroutines

A main feature of the SAS/IML language is that it enables you to define your own modules. A module is a user-defined function or subroutine that you can call from an IML program. A module that returns a value is called a function module; a module that does not return a value is called a subroutine module.

Although I do not recommend that you do so, it is possible to define a subroutine module that has the same name as a built-in subroutine. These two routines co-exist. If you use the RUN statement, SAS/IML will call the subroutine that you wrote. If you use the CALL statement, SAS/IML will call the built-in subroutine.

So that's the difference between the RUN and CALL statements: The RUN statement looks first for a user-defined module with the specified name. If it finds it, then it calls that module. Otherwise, it calls the built-in subroutine that has the specified name. The CALL statement looks for a built-in subroutine and immediately calls that routine if it is found.

An example: Overriding the EIGEN subroutine

As I said, I don't recommend that you routinely write user-defined modules that have the same name as a SAS/IML built-in, but let's examine a situation in which it might be convenient to do so. Let's assume that you often read data sets that represent symmetric matrices, and that these data sets are stored in lower triangular form with missing value in the upper triangular portion of the matrix. For example, this is the default storage method for distance matrices that are created by using the DISTANCE procedure in SAS/STAT software, as shown below:

proc distance data=Sashelp.Cars out=Dist method=Euclid;
   where Type="Truck";
   var interval(Horsepower--Length / std=Std);
   id Model;
run;
 
proc iml;
use Dist;   read all var _NUM_ into D[r=Model];   close Dist;
call heatmapcont(D) title="Distance Matrix";
runcall

The program creates a 24 x 24 symmetric matrix of distances between 24 observations in a six-dimensional space of variables. The program reads the distance matrix into a SAS/IML matrix. A heat map shows the matrix values. The dark gray color shows that the upper triangular portion of the matrix is missing. The white diagonal elements show that the diagonal of the matrix is exactly zero. The remaining cells indicate the distance between observations. Nearly white cells indicate that two observations are close together. Dark cells indicate that observations are relatively far apart.

Suppose that you want to compute the eigenvalues and eigenvectors of this matrix. The built-in EIGEN subroutine in SAS/IML can compute these quantities, but it expects a matrix that does not have any missing values. Therefore to compute the eigenvalues you should first extract the lower triangular elements of the matrix and copy them into the upper triangular portion of the matrix.

You could write a separate subroutine that only copies the lower triangular elements, but for this example I will write a subroutine that has the same name as the built-in EIGEN subroutine. The following statements define a module that inspects the upper triangular elements of a matrix. If the upper triangular elements are all missing, it replaces those missing values with the lower triangular elements. It then calls the built-in EIGEN function to compute the eigenvalues and eigenvectors:

start StrictLowerTriangular(X);    /* return lower triangular elements */
   return( remove(vech(X), cusum(1 || (ncol(X):2))) );
finish;
 
/* define EIGEN module as a custom override of the built-in subroutine */
start Eigen(eval, evec, A);
   r = row(A); c = col(A);
   UpperIdx = loc(c>r);
   /* if upper triangular elements are missing, copy from lower */
   if all(A[UpperIdx]=.) then
      A[UpperIdx] = StrictLowerTriangular(A);
   call eigen(eval, evec, A);          /* CALL the built-in EIGEN subroutine */
finish;
 
run eigen(eval, evec, D);              /* RUN the user-defined EIGEN subroutine */

In summary, the CALL and RUN statements enable you to choose between a built-in subroutine and a custom subroutine that have the same name.

The situation is a little different for user-defined functions. Beginning with SAS/IML 13.1, you can define a function that has the same name as a built-in function, and the order of resolution for calling functions ensures that a user-defined function is found before a built-in function that has the same name. This means that if you choose to override a SAS/IML function, you lose the ability to call the built-in function until you quit PROC IML.

Have you ever had the need to override a built-in SAS/IML subroutine? Let me know the details by leaving a comment.

tags: 13.1, Getting Started
10月 272014
 

One of my presentations at SAS Global Forum 2014 was about the new heat map functions in SAS/IML 13.1. Over the summer I created a short video of my presentation, which gives an overview of visualizing matrices with heat maps, and describes how to choose colors for heat maps:



If your browser does not support embedded video, you can go directly to the video on YouTube.

If you prefer to read articles about heat maps so that you can study the concepts and cut and paste examples, here are a few recent blog posts that are based on my SAS Global Forum presentation:

For a fun application of discrete heat maps, see my article about implementing a one-dimensional cellular automata in SAS.

tags: 13.1, Statistical Graphics, Video
10月 222014
 

What is kurtosis? What does negative or positive kurtosis mean, and why should you care? How do you compute kurtosis in SAS software?

It is not clear from the definition of kurtosis what (if anything) kurtosis tells us about the shape of a distribution, or why kurtosis is relevant to the practicing data analyst. Mathematically, the kurtosis of a distribution is defined in terms of the standardized fourth central moment. The estimate of kurtosis from a sample also involves a summation that accumulates the fourth powers of a standardized variable. In SAS software, the formula for the sample kurtosis is given in the "Simple Statistics" section of the documentation.

This article presents an intuitive explanation of kurtosis that is often true in practice.

What does kurtosis mean?

Let's compute the sample kurtosis for two variables in the Sashelp.Heart data set and see whether the kurtosis values tell us anything about the shape of the data distributions. The variables are AgeAtStart (the age of a patient when he or she entered the study) and Systolic (his or her systolic blood pressure). The following statements call the MEANS procedure to compute estimates for the mean, median, standard deviation, skewness, and kurtosis:

proc means data=sashelp.heart mean median std skew kurtosis;
var AgeAtStart Systolic;
run;
t_kurt1

The last column shows that the kurtosis of the AgeAtStart data is negative, whereas the kurtosis of the Systolic data is positive. What does that mean? Well, first recall that whereas the mean, median, and standard deviation are expressed in the same units as the data, kurtosis (like skewness) is a dimensionless quantity. The zero value corresponds to a standard reference situation. For kurtosis, the reference distribution is the normal distribution, which is defined to have a kurtosis of zero. (Technically, I am describing the excess kurtosis, since this is the value returned by statistical software packages. Researchers sometimes consider the full kurtosis, which is 3 more than the excess kurtosis.)

The interpretation of kurtosis in terms of the shape of the distribution can be tricky, but for many unimodal distributions the kurtosis compares the peak and tails of the distribution to the normal distribution:

  • A data distribution with negative kurtosis is broader, flatter, and has thinner tails than the normal distribution.
  • A data distribution with positive kurtosis is narrower at its peak and has fatter tails than the normal distribution.

Negative kurtosis for a sample indicates that the sample contains many observations that are a moderate distance from the center, but few outliers that are far from the center. The peak—if there is one—looks more like a butte or a rounded hilltop than a jutting spire. The canonical distribution that has negative kurtosis is the continuous uniform distribution, which has a kurtosis of –1.2.

In contrast, positive kurtosis indicates that many values will be near the center, but that you should also expect a certain proportion of values that are far from the center. A heavy-tailed distribution has large kurtosis. The canonical distribution that has a large positive kurtosis is the t distribution with a small number of degrees of freedom. Some heavy-tailed distributions have infinite kurtosis.

Graphically testing the kurtosis interpretation

You can run a little experiment to see how the shape of the AgeAtStart and Systolic distributions compares with the normal distribution. What would you expect to see if you overlay a normal curve on a histogram of those variables? For the AgeAtStart variable, which has negative kurtosis, you should observe a distribution that does not have a very pronounced peak, that is "broad in the shoulders," and that has thin tails. For the Systolic variable, you should observe a distribution that has a pronounced peak, drops off faster than a normal distribution, and has at least one heavy tail. The following statements use PROC SGPLOT to overlay a histogram with a normal curve centered at the median:

ods graphics / width=330px height=250px;
proc sgplot data=sashelp.heart noautolegend;
histogram AgeAtStart / binwidth=5 binstart=30 showbins;
density AgeAtStart / type=normal(mu=43); /* center at median */
run;
 
proc sgplot data=sashelp.heart noautolegend;
histogram Systolic / binwidth=10 binstart=80 showbins;
density Systolic / type=normal(mu=132); /* center at median */
run;
kurt1

The histogram for the AgeAtStart variable shows that the distribution is broad and flat. It does not have a pronounced central peak, and there is substantial mass along most of the range of the data. The distribution appears to be bounded: Apparently the heart study did not accept any patients who were younger than 28 or older than 62. This implies that there are no tails for the population distribution. Overall, the AgeAtStart variable provides a prototypical example of a distribution that has negative kurtosis.


kurt2

The histogram of the Systolic variable has a narrow peak and a long right tail. There are many patients in the study with systolic blood pressure in the range 120–140. (The American Heart Association calls that range prehypertensive.) Compared to the normal distribution, there are relatively few observations in the ranges (80, 120) and (140, 170). However, the distribution has more extreme observations than would be expected for a normally distributed variable, so the distribution has a fat tail.

Computing kurtosis in SAS software

There are three SAS procedures that compute the kurtosis of data distributions. PROC MEANS was used earlier in this article. Use PROC UNIVARIATE when you want to fit a univariate model to the data. PROC IML includes the KURTOSIS function in SAS/IML 13.1, as shown in the following example:

proc iml;
varNames = {"AgeAtStart" "Systolic"};
use Sashelp.Heart;  read all var varNames into X;  close Sashelp.Heart;
 
kurt = kurtosis(X);
print kurt[colname=varNames];
t_kurt2

When the intuitive interpretation of kurtosis can fail

The relationship between kurtosis and the shape of the distribution are not mathematical rules, they are merely intuitive ideas that often hold true for unimodal distributions. If your distribution is multimodal or is a mixture of distributions, these intuitive rules might not apply.

There have been many articles about kurtosis and its interpretation. My favorite is an article by Balanda and MacGillivray (1988, TAS). They give examples of three different densities that each have the same kurtosis value but that look very different. They conclude that "because of the averaging process involved in its definition, a given value of [kurtosis] can correspond to several different distributional shapes" (p. 114).

They recommend that kurtosis be defined as "the location- and scale-free movement of probability mass from the shoulders of a distribution into its center and tails. In particular, this definition implies that peakedness and tail weight are best viewed as components [emphasis mine] of kurtosis.... This definition is necessarily vague because the movement can be formalized in many ways" (p. 116). In other words, the peaks and tails of a distribution contribute to the value of the kurtosis, but so do other features.

Although Balanda and MacGillivray do not mention it, it is also worth remembering that kurtosis is a non-robust statistic that can be severely influenced by the value of a single outlier. For example, if you choose 999 observations from a normal distribution, the sample kurtosis will be close to 0. However, if you add a single observation that has the value 100, the sample kurtosis jumps to more than 800!

If you like pathological distributions and mathematically contrived examples of all the ways that kurtosis can fail to indicate the shape of a distribution, see Balanda and MacGillivray's article. However, the intuitive notions in this article hold true for many unimodal data distributions that arise in practice.

tags: 13.1, Data Analysis, Statistical Thinking