Statistical Programming

4月 122017
 

At SAS Global Forum last week, I saw a poster that used SAS/IML to optimized a quadratic objective function that arises in financial portfolio management (Xia, Eberhardt, and Kastin, 2017). The authors used the Newton-Raphson optimizer (NLPNRA routine) in SAS/IML to optimize a hypothetical portfolio of assets.

The Newton-Raphson algorithm is one of my favorite optimizers. However, for a quadratic objective function there is a simpler choice. You can use the NLPQUA function in SAS/IML to optimize a quadratic objective function.

Optimal value of quadratic function of two variables

Optimize an unconstrainted quadratic objective function in SAS/IML

Let's see how this works by specifying a quadratic polynomial in two variables. The function is
f(x,y) = (9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6
       = 0.5*x` H x + L x + 6
where H = {18  1, 1  8} is a 2x2 symmetric matrix, L = {-12  -4} is a row vector, and x = {x, y} is a column vector.

To use the NLPQUA subroutine, you need to write down the Hessian matrix, H, which is the symmetric matrix of second derivatives. Of course, for a quadratic function the second derivatives are constants. In the following SAS/IML program, the matrix H contains the second derivatives and the vector lin contains the coefficients for the linear and constant terms of f:

/* objective function is
   f(x,y) = (3*x-2)##2 + (2*y-1)##2 + x#y + 1
          = (9*x##2  + x#y + 4*y##2)  + -12*x  - 4*y + 6
   grad(f) = g(x,y) = [ 18*x + y - 12,  x + 8*y - 4 ]
   hess(f)  = [ dg1/dx  dg2/dx ]  =  [ 18  1 ]
              [ dg1/dy  dg2/dy ]     [  1  8 ]
*/
proc iml;
H = {18 1,             /* matrix of second derivatives */
      1 8};
lin = { -12 -4 6 };    /* vector of linear terms and the constant term */

The NLPQUA subroutine enables you to specify the Hessian matrix and the linear coefficients. From those values, the routine uses an efficient solver to find the global minimum of the quadratic polynomial, which occurs at x=92/143, y=60/143:

x0 = j(1, 2, 1);       /* initial guess */
opt = {0 1};           /* minimize, print final parameters */
call nlpqua(rc, xOpt, H, x0, opt) lin=lin;
print xOpt; 
print xOpt[format=Fract.];
Optimal values of a quadratic function

Solve a constrained quadratic program in SAS/IML

You can also use the NLPQUA subroutine to solve a constrained problem. Suppose that you are only interested in values of (x,y) in the unit square and you require that the solution satisfies the linear constraint x + y = 1. In that case you can define a matrix that specifies the linear constraints as follows:

blc = {0 0 . .,   /* lower bound: 0 <= x_i        */
       1 1 . .,   /* upper bound:      x_i <= 1   */
       1 1 0 1};  /* linear constraint x +  y = 1 */
call nlpqua(rc, conOpt, H, x0, opt, blc) lin=lin;
print conOpt[format=Fract.];

Notice how the linear constraints are specified for this (and every other) NLP function. Let p be the number of parameters in the problem. The first p elements of the first row specify the lower bounds for the parameters; use a missing value for variables that are not bounded below. The last two columns of the first row are not used, so you should always set those elements to missing values. Similarly, the first p elements of the second row specify the upper bound for the variables. The last two columns of the second row are not used, so set those elements to missing values.

Additional rows specify coefficients for the linear constraint. An equality constraint of the form a1*x1 + a2*x2 + ... + an *xn = c is encoded as a row of coefficients {a1 a2 ... an 0 c}, where the 0 in the penultimate location indicates equality. A "less than" inequality of the form a1*x1 + a2*x2 + ... + an *xn ≤ c is encoded as the row {a1 a2 ... an  -1  c}, where the -1 indicates the "less than or equal" symbol. Similarly, a value of +1 in the penultimate location indicates the "greater than or equal" symbol (≥). See the documentation for the NLP routines for additional details about specifying constraints.

Solve quadratic programs in PROC OPTMODEL

If you have access to SAS/OR software, PROC OPTMODEL provides a simple and natural language for solving simple and complex optimization problems. For this problem, PROC OPTMODEL detects that the objective function is quadratic and

/* variable names x and y */
proc optmodel;
   /* unconstrained quadratic optimization */
   var x, y;
   min f = 9*x^2 + x*y + 4*y^2 - 12*x - 4*y + 6;
   solve;                    /* call QP solver */
   print x Fract. y Fract.;
 
   /* constrained quadratic optimization */
   x.lb = 0; x.ub = 1;        /* bounds on variables */
   y.lb = 0; y.ub = 1;
   con SumToOne:              /* linear constraint */
      x + y = 1;
   solve;                     /* call QP solver */
   print x Fract. y Fract.;
quit;

In summary, if you want to optimize a quadratic objective function (with or without constraints), use the NLPQUA subroutine in SAS/IML, which is a fast and efficient way to maximize or minimize a quadratic function of many variables. If you have access to SAS/OR software, PROC OPTMODEL provides an even simpler syntax.

The post Quadratic optimization in SAS appeared first on The DO Loop.

4月 102017
 

Many intervals in statistics have the form p ± δ, where p is a point estimate and δ is the radius (or half-width) of the interval. (For example, many two-sided confidence intervals have this form, where δ is proportional to the standard error.) Many years ago I wrote an article that mentioned that you can construct these intervals in the SAS/IML language by using a concatenation operator (|| or //). The concatenation creates a two-element vector, like this:

proc iml;
mu = 50; 
delta = 1.5;
CI = mu - delta  || mu + delta;   /* horizontal concatenation ==> 1x2 vector */

Last week it occurred to me that there is a simple trick that is even easier: use the fact that SAS/IML is a matrix-vector language to encode the "±" sign as a vector {-1, 1}. When SAS/IML sees a scalar multiplied by a vector, the result will be a vector:

CI = mu + {-1  1}*delta;         /* vector operation ==> 1x2 vector */
print CI;

You can extend this example to compute many intervals by using a single statement. For example, in elementary statistics we learn the "68-95-99.7 rule" for the normal distribution. The rule says that in a random sample drawn from a normal population, about 68% of the observations will be within 1 standard deviation of the mean, about 95% will be within 2 standard deviations, and about 99.7 % will be within 3 standard deviations of the mean. You can construct those intervals by using a "multiplier matrix" whose first row is {-1, +1}, whose second row is {-2, +2}, and whose third row is {-3, +3}. The following SAS/IML statements construct the three intervals for the 69-95-99.7 rule for a normal population with mean 50 and standard deviation 8:

mu = 50;  sigma = 8;
m = {-1 1,
     -2 2,
     -3 3};
 
Intervals = mu + m*sigma;
ApproxPct = {"68%", "95%", "99.7"};
print Intervals[rowname=ApproxPct];

Just for fun, let's simulate a large sample from the normal population and empirically confirm the 68-95-99.7 rule. You can use the RANDFUN function to generate a random sample and use the BIN function to detect which observations are in each interval:

call randseed(12345);
n = 10000;                                /* sample size */
x = randfun(n, "Normal", mu, sigma);      /* simulate normal sample */
ObservedPct = j(3,1,0);
do i = 1 to 3;
   b = bin(x, Intervals[i,]);             /* b[i]=1 if x[i] in interval */
   ObservedPct[i] = sum(b) / n;           /* percentage of x in interval */
end;
 
results = Intervals || {0.68, 0.95, 0.997} || ObservedPct;
print results[colname={"Lower" "Upper" "Predicted" "Observed"}
              label="Probability of Normal Variate in Intervals: X ~ N(50, 8)"];

The simulation confirms the 68-95-99.7 rule. Remember that the rule is a mnemonic device. You can compute the exact probabilities by using the CDF function. In SAS/IML, the exact computation is p = cdf("Normal", m[,2]) - cdf("Normal", m[,1]);

In summary, the SAS/IML language provides an easy syntax to construct intervals that are symmetric about a central value. You can use a vector such as {-1, 1} to construct an interval of the form p ± δ, or you can use a k x 2 matrix to construct k symmetric intervals.

The post A simple trick to construct symmetric intervals appeared first on The DO Loop.

4月 052017
 

Most regression models try to model a response variable by using a smooth function of the explanatory variables. However, if the data are generated from some nonsmooth process, then it makes sense to use a regression function that is not smooth. A simple way to model a discontinuous process in SAS is to use spline effects and specify repeated value for the knots.

Discontinuous processes: More common than you might think

The classical ANOVA is one way to analyze data that are collected before and after a known event. For example, you might record gas mileage for a car before and after a tune-up. You might collect patient data before and after they undergo a medical or surgical treatment. You might have data about real estate prices before and after some natural disaster. In all these cases, you might suspect that the response changes abruptly because of the event.

To give a simple example, suppose that a driver records the fuel economy (in mile per gallon) for a car for 12 weeks. Because the car engine is coughing and knocking, the owner brings the car to a mechanic for maintenance. After the maintenance, the car seems to run better and he records the fuel economy for another six weeks. The hypothetical data are below:

data MPG;
input mpg @@;
week = _N_;
period = ifc(week > 12, "After ", "Before");
label mpg="Miles per Gallon";
datalines;
30.5 28.1 27.1 31.2 25.2 31.1 27.7 28.2 29.6 30.6 28.9 25.9 
30.6 33.0 31.2 29.7 32.7 31.1 
;

Notice that the data contains a binary indicator variable (period) that records whether the data are from before or after the tune-up. You can use PROC GLM to perform a simple ANOVA analysis to determine whether there was a significant change in the mean fuel economy after the maintenance. The following call to PROC GLM runs an ANOVA and indicates that the mean fuel economy is about 2.7 mpg better after the tune-up.

proc glm data=MPG plots=fitplot;
   class period / ref=first;
   model mpg = period /solution;
   output out=out predicted=Pred;
quit;

Graphically, this regression analysis is usually visualized by using two box plots. (PROC GLM creates the box plots automatically when ODS graphics are enabled.) However, because the independent variable is time, you could also use a series plot to show the observed data and the mean response before and after the maintenance. By using the GROUP= option on the SERIES statement, you can get two lines for the "before" and "after" time periods.

title "Piecewise Constant Regression with Jump Discontinuity";
proc sgplot data=Out;
   block x=week block=period / transparency=0.8;
   scatter x=week y=mpg / markerattrs=(symbol=CircleFilled color=black);
   series x=week y=pred / group=period lineattrs=(thickness=3) ;
run;
Piecewise constant regression function with jump discontinuity

The graph shows that the model has a jump discontinuity at the time point at which the maintenance intervention occurred. If you include the WEEK variable in the analysis, you can model the response as a linear function of time, with a jump at the time of the tune-up.

All this is probably very familiar. However, did you know that you can use splines to model the data as a continuous function that has a kink or "corner" at the time of the maintenance event? You can use this feature when the model is continuous, but the slope changes at a known time.

Splines for nonsmooth models

Several SAS procedures support the EFFECT statement, which enables you to build spline effects. The paper "Rediscovering SAS/IML Software" (Wicklin 2010, p. 4) has an example where splines are used to construct a highly nonlinear curve for a scatter plot.

A spline effect is determined by the placement of certain points called "knots." Often knots are evenly spaced within the range of the explanatory variable, but the EFFECT statement supports many other ways to position the knots. In fact, the documentation for the EFFECT statement says: "If you remove the restriction that the knots of a spline must be distinct and allow repeated knots, then you can obtain functions with less smoothness and even discontinuities at the repeated knot location. For a spline of degree d and a repeated knot with multiplicity md, the piecewise polynomials that join such a knot are required to have only d – m matching derivatives."

The degree of a linear regression is d=1, so if you specify a knot position once you obtain a piecewise linear function that contains a "kink" at the knot. The following call to PROC GLIMMIX demonstrates this technique. (I use GLIMMIX because neither PROC GLM nor PROC GENMOD support the EFFECT statement.) You can manually specify the position of knots by using the KNOTMETHOD=LIST(list) option on the EFFECT statement.

proc glimmix data=MPG;
   effect spl = spline(week / degree=1 knotmethod=list(1 13 18));     /* p-w linear */
   *effect spl = spline(week / degree=2 knotmethod=list(1 13 13 1); /*p-w quadratic */
   model mpg = spl / solution;  
   output out=out predicted=Pred;
quit;
 
title "Piecewise Linear Regression with Kink";
proc sgplot data=Out noautolegend;
   block x=week block=period / transparency=0.8;
   scatter x=week y=mpg / markerattrs=(symbol=CircleFilled color=black);
   series x=week y=pred / lineattrs=(thickness=3) ;
run;
Piecewise Linear Regression with Kink

The graph shows that the model is piecewise linear, but that the slope of the model changes at week=13. In contrast, the second EFFECT statement in the PROC GLIMMIX code (which is commented out), specifies piecewise quadratic polynomials (d=2) and repeats the knot at week=13. That results in two quadratic models that give the same predicted value at week=13 but the model is not smooth at that location. Try it out!

If you are using a SAS procedure that does not support the EFFECT statement, you can use the GLMMIX procedure to output the dummy variables that are associated with the spline effects. A nice paper by David Pasta (2003) describes how to use dummy variables in a variety of models. The paper was written before the EFFECT statement; many of the ideas in the paper are easier to implement by using the EFFECT statement.

Lastly, the TRANSREG procedure in SAS supports spline effects but has its own syntax. See the TRANSREG documentation, which includes an example of repeating knots to build a regression model for discontinuous data.

Have you ever needed to construct a nonsmooth regression model? Tell your story by leaving a comment.

The post Nonsmooth models and spline effects appeared first on The DO Loop.

4月 032017
 

One of the advantages of the new mixed-type tables in SAS/IML 14.2 is the greatly enhanced printing functionality. You can control which rows and columns are printed, specify formats for individual columns, and even use templates to completely customize how tables are printed. Printing a table is accomplished by using the TableCreateFromDataSet function. Finally, the TablePrint subroutine prints a customize portion of the table:

data class;
set sashelp.class;
Birthday = '03APR2017'd - age*365 - floor(365*uniform(1)); /* create birthday */
format Birthday DATE9.;
run;
 
proc iml;
tClass = TableCreateFromDataSet("Class");    /* read data into table */
run TablePrint(tClass) firstobs=3 numobs=5 
                       var={"Age" "Birthday"} 
                       ID="Name"
                       label="Subset of Class";
Basic printing of SAS/IML Tables

Notice that the table contains both numeric and character columns. Furthermore, the numeric columns have different formats. The TablePrint subroutine has some distinct advantages over the traditional PRINT statement in SAS/IML:

  • The TablePrint subroutine supports an easy way to display a range of observations. When you use the PRINT statement for multiple vectors, you have to use row subscripts in each vector, such as PRINT (X[3:8,]) (Y[3:8,]);
  • The TablePrint subroutine supports printing any columns in any order. When you use the PRINT statement on a matrix, you have to use column subscripts to change the order of the matrix columns: PRINT (X[, {2 3 1}]);
  • The PRINT statement supports the ROWNAME= option (for specifying row headers), the COLNAME= option (for specifying column headers), and the LABEL= option. Those options are easy to work with when you print a single matrix. However, you can't store mixed-type data in a matrix and those options are less convenient when you print a set of vectors.

Advanced printing of SAS/IML tables

Trafic lighting: Color cells in SAS/IML tables by cell contents

The SAS/IML documentation has several sections of documentation devoted to

For statistical programmers, the ability to use ODS templates means that output from PROC IML can look the same as output from some other SAS procedue. For example, suppose that you have a table that contains parameter estimates for a linear regression. The following example prints that table by using the same ODS template that PROC REG uses, which is the Stat.Reg.ParameterEstimates template:

proc iml;
vars = {"Intercept", X, Z};
stats = {32.19   5.08   21.42   42.97,
          0.138  0.0348  0.0644  0.2117, 
          1.227  0.5302  0.1027  2.3506 }; 
tbl = TableCreate("Variable", vars);
call TableAddVar(tbl, {"Estimate" "StdErr" "LowerCL" "UpperCL"},  stats);
 
Confidence=95;
call TablePrint(tbl) template="Stat.Reg.ParameterEstimates"
                     dynamic={Confidence};
Print SAS/IML tables by using existing ODS templates

This example works because the column names in the SAS/IML table match the names that are expected by the Stat.REG.ParameterEstimates template. The DYNAMIC= option specifies a dynamic variable (Confidence) that the template requires. See the documentation for further details.

Summary

In summary, the TablePrint subroutine in SAS/IML gives programmers control over many options for printing tables of data and statistics. For complex layouts, you can use an existing ODS template or create your own template to customize virtually every aspect of your tabular displays.

The post Print tables in SAS/IML appeared first on The DO Loop.

3月 292017
 

Lists are collections of objects. SAS/IML 14.2 supports lists as a way to store matrices, data tables, and other lists in a single object that you can pass to functions. SAS/IML lists automatically grow if you add new items to them and shrink if you remove items. You can also modify existing items. You can use indices to refer to items of a list, or you can assign names to the items to create a named list.

Create a list

You can create a list by using the ListSetItem subroutine to assign a value to each item:

proc iml;
L = ListCreate(2);                  /* L is two-item list */
call ListSetItem(L, 1, 1:3);        /* 1st item is 1x3 vector */
call ListSetItem(L, 2, {4 3, 2 1}); /* 2nd item is 2x2 matrix */

The arguments for the ListSetItem subroutine are list, index, and value, which is the same order that you use when you assign a value to a matrix element, such as A[2] = 5.

If you do not know how many items the list will contain, or if you later want to add additional items, you can use

X = {3 1 4, 2 5 3};                 /* create a 2x3 matrix */
call ListAddItem(L, X);             /* add 3rd item; list grows */

Notice that the syntax for the ListAddItem subroutine only requires the list and the value because the new item is always added to the end of the list. At this point, the list contains three items, as shown below:

Items in a SAS/IML list can be different shapes and sizes

Insert, modify, and delete list items

A convenient feature of SAS/IML lists is that they automatically resize. You can insert new items into any position in the list. You can also replace an existing item or delete an item. The following statements demonstrate modifying an existing list.

call ListInsertItem(L, 2, -1:1);    /* insert new item before item 2; list grows */
call ListSetItem(L, 1, {A B, C D}); /* replace 1st item with 2x2 character matrix */
call ListDeleteItem(L, 3);          /* delete 3rd item; list shrinks */

The arguments for k, existing higher-indexed items are also renumbered. (Conceptually, items "move to the left.") After the previous sequence of operations, the list contains the following items:

You can insert, modify, and delete items in a SAS/IML list

Create named lists

In the previous section, the list acted like a dynamic array: it stored items that you could access by using indices such as 1, 2, and 3. For some applications, it makes more sense to name the list items and refer to the items by their names rather than their positions. This is similar to "structs" in some languages, where you can refer to members of a struct by name.

For example, suppose a teacher wants to store information about students in her classes. For each student, she might want to store the student's name, class, and test scores. The following statements create a SAS/IML list that has three items named "Name", "Class", and "Scores". The items are then assigned values:

Student = ListCreate({"Name" "Class" "Scores"});  /* create named list with 3 items */
call ListSetItem(Student, "Name", "Ron");         /* set "name" value for student */
call ListSetItem(Student, "Class", "Statistics"); /* set "class" value  for student */
Tests = {100 97 94 100 100};                      /* test scores */
call ListSetItem(Student, "Scores", Tests);       /* set "scores" value for student */

Although you can still use positional indices to access list items ("Ron" is the first item, "Statistics" is the second item, ...), you can also use the name of items. For example, to extract the test scores into a SAS/IML matrix or vector, you can use

s = ListGetItem(Student, "Scores");               /* get test scores from list */

Lists are for storing items, not for computing. You cannot add, subtract, or multiply lists. However, the example shows that you can extract an item into a matrix and subsequently perform algebraic operations on the matrix.

Lists of lists

Lists can contain sublists. In this way you can represent nested or hierarchical data. For example, the teacher might want to store information about several students. If information about each student is stored in a list, then she can use a list of lists to store information about multiple students.

The following statements create a list called All. The first student ("Ron") is added to the list, which means that the item is a copy of the Student list. Then information about a second student ("Karl") is stored in the Student list and copied into the All list as a second item. This process could continue until all students are stored in the list.

All = ListCreate();                  /* empty list */
call ListAddItem(All, Student);      /* add "Ron" to list */
 
call ListSetItem(Student, "Name", "Karl"); 
call ListSetItem(Student, "Class", "Calculus");
call ListSetItem(Student, "Scores", {90 92 84 70 80});  
call ListAddItem(All, Student);      /* add "Karl" to list */

At this point, the All list contains two items. Each item is a list that contains three items.

Summary

SAS/IML 14.2 supports lists, which are containers that store other objects. Lists dynamically grow or shrink as necessary. You can index items by their position in the list (using indices) or you can create a named list and reference items by their names. You can use built-in functions to extract items a list or add new items to a list. You can insert, delete, and modify list items. You can represent hierarchical data by using lists of lists. For more information about SAS/IML lists, see the chapter in the SAS/IML documentation. The documentation shows how lists can be used to emulate other data structures, such as stacks, queues, and trees.

The post Lists: Nonmatrix data structures in SAS/IML appeared first on The DO Loop.

3月 222017
 

Prior to SAS/IML 14.2, every variable in the Interactive Matrix Language (IML) represented a matrix. That changed when SAS/IML 14.2 introduced two new data structures: data tables and lists. This article gives an overview of data tables. I will blog about lists in a separate article.

A matrix is a rectangular array that contains all numerical or all character values. Numerical matrices are incredibly useful for computations because linear algebra provides a powerful set of tools for implementing analytical algorithms. However, a matrix is somewhat limiting as a data structure. Matrices are two-dimensional, rectangular, and cannot contain mixed-type data (numeric AND character). Consequently, you can't use one single matrix to pass numeric and character data to a function.

Data tables in SAS/IML are in-memory versions of a data set. They contain columns that can be numeric or character, as well as column attributes such as names, formats, and labels. The data table is associated with a single symbol and can be passed to modules or returned from a module. the TableCreateFromDataSet function, as shown:

proc iml;
tClass = TableCreateFromDataSet("Sashelp", "Class"); /* SAS/IML 14.2 */

The function reads the data from the Sashelp.Class data set and creates an in-memory copy. You can use the tClass symbol to access properties of the table. For example, if you want to obtain the names of the columns in the table, you can use

varNames = TableGetVarName(tClass);
print varNames;
Column names for a data table in SAS/IML

Extracting columns and adding new columns

Data tables are not matrices. You cannot add, subtract, or multiply with tables. When you want to compute something, you need to extract the data into matrices. For example, if you want to compute the body-mass index (BMI) of the students in Sashelp.Class, you can use the TableAddVar function to add the BMI as a new column in the table:

Y = TableGetVarData(tClass, {"Weight" "Height"});
wt = Y[,1]; ht = Y[,2];                /* get Height and Weight variables */
BMI = wt / ht##2 * 703;                /* BMI formula */
call TableAddVar(tClass, "BMI", BMI);  /* add new "BMI" column to table */

Passing data tables to modules

As indicated earlier, you can use data tables to pass mixed-type data into a user-defined function. For example, the following statements define a module whose argument is a data table. The module prints the mean value of the numeric columns in the table, and it prints the number of unique levels for character columns. To do so, it first extracts the numeric data into a matrix, then later extracts the character data into a matrix.

start QuickSummary(tbl);
   type = TableIsVarNumeric(tbl);      /* 0/1 vector   */
   /* for numeric columns, print mean */
   idx = loc(type=1);                  /* numeric cols */
   if ncol(idx)>0 then do;             /* there is a numeric col */
      varNames = TableGetVarName(tbl, idx);         /* get names */
      m = TableGetVarData(tbl, idx);   /* extract numeric data   */
      mean = mean(m);
      print mean[colname=varNames L="Mean of Numeric Variables"];
   end;
   /* for character columns, print number of levels */
   idx = loc(type=0);                  /* character cols */
   if ncol(idx)>0 then do;             /* there is a character col */
      varNames = TableGetVarName(tbl, idx);           /* get names */
      m = TableGetVarData(tbl, idx);   /* extract character data   */
      levels = countunique(m, "col");
      print levels[colname=varNames L="Levels of Character Variables"];
   end;
finish;
 
run QuickSummary(tClass);
Pass data tables to SAS/IML functions and modules

Summary

SAS/IML 14.2 supports data tables, which are rectangular arrays of mixed-type data. You can use built-in functions to extract columns from a table, add columns to a table, and query the table for attributes of columns. For more information about data tables, see the chapter

The post Data tables: Nonmatrix data structures in SAS/IML appeared first on The DO Loop.

2月 152017
 

A categorical response variable can take on k different values. If you have a random sample from a multinomial response, the sample proportions estimate the proportion of each category in the population. This article describes how to construct simultaneous confidence intervals for the proportions as described in the 1997 paper "A SAS macro for constructing simultaneous confidence intervals for multinomial proportions" by Warren May and William Johnson (Computer Methods and Programs in Biomedicine, p. 153–162).

Estimates of multinomial proportions

In their paper, May and Johnson present data for a random sample of 220 psychiatric patients who were categorized as either neurotic, depressed, schizophrenic or having a personality disorder. The observed counts for each diagnosis are as follows:

data Psych;
input Category $21. Count;
datalines;
Neurotic              91
Depressed             49
Schizophrenic         37
Personality disorder  43
;

If you divide each count by the total sample size, you obtain estimates for the proportion of patients in each category in the population. However, the researchers wanted to compute simultaneous confidence intervals (CIs) for the parameters. The next section shows several methods for computing the CIs.

Methods of computing simultaneous confidence intervals

May and Johnson discussed six different methods for computing simultaneous CIs. In the following, 1–α is the desired overall coverage probability for the confidence intervals, χ2(α, k-1) is the upper 1–α quantile of the χ2 distribution with k-1 degrees of freedom, and π1, π2, ..., πk are the true population parameters. The methods and the references for the methods are:

  1. Quesenberry and Hurst (1964): Find the parameters πi that satisfy
    N (pi - πi)2 ≤ χ2(α, k-1) πi(1-πi).
  2. Goodman (1965): Use a Bonferroni adjustment and find the parameters that satisfy
    N (pi - πi)2 ≤ χ2(α/k, 1) πi(1-πi).
  3. Binomial estimate of variance: For a binomial variable, you can bound the variance by using πi(1-πi) ≤ 1/4. You can construct a conservative CI for the multinomial proportions by finding the parameters that satisfy
    N (pi - πi)2 ≤ χ2(α, 1) (1/4).
  4. Fitzpatrick and Scott (1987): You can ignore the magnitude of the proportion when bounding the variance to obtain confidence intervals that are all the same length, regardless of the number of categories (k) or the observed proportions. The formula is
    N (pi - πi)2 ≤ c2(1/4)
    where the value c2= χ2(α/2, 1) for α ≤ 0.016 and where c2= (8/9)χ2(α/3, 1) for 0.016 < α ≤ 0.15.
  5. Q and H with sample variance: You can replace the unknown population variances by the sample variances in the Quesenberry and Hurst formula to get
    N (pi - πi)2 ≤ χ2(α, k-1) pi(1-pi).
  6. Goodman with sample variance: You can replace the unknown population variances by the sample variances in the Goodman Bonferroni-adjusted formula to get
    N (pi - πi)2 ≤ χ2(α/k, 1) pi(1-pi).

In a separate paper, May and Johnson used simulations to test the coverage probability of each of these formulas. They conclude that the simple Bonferroni-adjusted formula of Goodman (second in the list) "performs well in most practical situations when the number of categories is greater than 2 and each cell count is greater than 5, provided the number of categories is not too large." In comparison, the methods that use the sample variance (fourth and fifth in the list) are "poor." The remaining methods "perform reasonably well with respect to coverage probability but are often too wide."

A nice feature of the Q&H and Goodman methods (first and second on the list) is that they procduce unsymmetric intervals that are always within the interval [0,1]. In contrast, the other intervals are symmetric and might not be a subset of [0,1] for extremely small or large sample proportions.

Computing CIs for multinomial proportions

You can download SAS/IML functions that are based on May and Johnson's paper and macro. The original macro used SAS/IML version 6, so I have updated the program to use a more modern syntax. I wrote two "driver" functions:

  • CALL MultCIPrint(Category, Count, alpha, Method) prints a table that shows the point estimates and simultaneous CIs for the counts, where the arguments to the function are as follows:
    • Category is a vector of the k categories.
    • Count is a vector of the k counts.
    • alpha is the significance level for the (1-alpha) CIs. The default value is alpha=0.05.
    • Method is a number 1, 2, ..., 6 that indicates the method for computing confidence intervals. The previous list shows the method number for each of the six methods. The default value is Method=2, which is the Goodman (1965) Bonferroni-adjusted method.
  • MultCI(Count, alpha, Method) is a function that returns a three-column matrix that contains the point estimates, lower limit, and upper limit for the CIs. The arguments are the same as above, except that the function does not use the Category vector.

Let's demonstrate how to call these functions on the psychiatric data. The following program assumes that the function definitions are stored in a file called CONINTS.sas; you might have to specify a complete path. The PROC IML step loads the definitions and the data and calls the MultCIPrint routine and requests the Goodman method (method=2):

%include "conint.sas";   /* specify full path name to file */
proc iml;
load module=(MultCI MultCIPrint);
use Psych; read all var {"Category" "Count"}; close;
 
alpha = 0.05;
call MultCIPrint(Category, Count, alpha, 2); /* Goodman = 2 */
Estimates and 95% simultaneous confidence intervals for m ultinomial proportions

The table shows the point estimates and 95% simultaneous CIs for this sample of size 220. If the intervals are wider than you expect, remember the goal: for 95% of random samples of size 220 this method should produce a four-dimensional region that contains all four parameters.

You can visualize the width of these intervals by creating a graph. The easiest way is to write the results to a SAS data set. To get the results in a matrix, call the MultCI function, as follows:

CI = MultCI(Count, alpha, 2);  /*or simply CI = MultCI(Count) */
/* write estimates and CIs to data set */
Estimate = CI[,1];  Lower = CI[,2];  Upper = CI[,3];
create CIs var {"Category" "Estimate" "Lower" "Upper"};
append;
close;
quit;
 
ods graphics / width=600px height=240px;
title 'Simultaneous Confidence Intervals for Multinomial Proportions';
title2 'Method of Goodman (1965)';
proc sgplot data=CIs;
  scatter x=Estimate y=Category / xerrorlower=Lower xerrorupper=upper
          markerattrs=(Size=12  symbol=DiamondFilled)
          errorbarattrs=GraphOutlines(thickness=2);
  xaxis grid label="Proportion" values=(0.1 to 0.5 by 0.05);
  yaxis reverse display=(nolabel);
run;
Graph of estimates and simultaneous confidence intervals for multinomial proportions

The graph shows intervals that are likely to enclose all four parameters simultaneously. The neurotic proportion in the population is probably in the range [0.33, 0.50] and at the same time the depressed proportion is in the range [0.16, 0.30] and so forth. Notice that the CIs are not symmetric about the point estimates; this is most noticeable for the smaller proportions such as the schizophrenic category.

Because the cell counts are all relatively large and because the number of categories is relatively small, Goodman's CIs should perform well.

I will mention that it you use the Fitzpatrick and Scott method (method=4), you will get different CIs from those reported in May and Johnson's paper. The original May and Johnson macro contained a bug that was corrected in a later version (personal communication with Warren May, 25FEB2016).

Conclusions

This article presents a set of SAS/IML functions that implement six methods for computing simultaneous confidence intervals for multinomial proportions. The functions are updated versions of the macro %CONINT, which was presented in May and Johnson (1987). You can use the MultCIPrint function to print a table of statistics and CIs, or you can use the MultCI function to retrieve that information into a SAS/IML matrix.

tags: Statistical Programming

The post Simultaneous confidence intervals for multinomial proportions appeared first on The DO Loop.

2月 132017
 

A common question on SAS discussion forums is how to repeat an analysis multiple times. Most programmers know that the most efficient way to analyze one model across many subsets of the data (perhaps each country or each state) is to sort the data and use a BY statement to repeat the analysis for each unique value of one or more categorical variables. But did you know that a BY-group analysis can sometimes be used to replace macro loops? This article shows how you can efficiently run hundreds or thousands of different regression models by restructuring the data.

One model: Many samples

As I've written before, BY-group analysis is also an efficient way to analyze simulated sample or bootstrapped samples. I like to tell people that you can choose "the slow way or the BY way" to analyze many samples.

In that phrase, "the slow way" refers to the act of writing a macro loop that calls a SAS procedure to analyze one sample. The statistics for all the samples are later aggregated, often by using PROC APPEND. As I (and others) have written, macro loops that call a procedure hundreds or thousands of time are relatively slow.

As a general rule, if you find yourself programming a macro loop that calls the same procedure many times, you should ask yourself whether the program can be restructured to take advantage of BY-group processing.


Stuck in a macro loop? BY-group processing can be more efficient. #SASTip
Click To Tweet


Many models: One sample

There is another application of BY-group processing, which can be incredibly useful when it is applicable. Suppose that you have wide data with many variables: Y, X1, X2, ..., X1000. Suppose further that you want to compute the 1000 single-variable regression models of the form Y=Xi, where i = 1 to 1000.

One way to run 1000 regressions would be to write a macro that contains a %DO loop that calls PROC REG 1000 times. The basic form of the macro would look like this:

%macro RunReg(DSName, NumVars);
...
%do i = 1 %to &NumVars;                    /* repeat for each x&i */
   proc reg data=&DSName noprint
            outest=PE(rename=(x&i=Value)); /* save parameter estimates */
   model Y = x&i;                          /* model Y = x_i */
   quit;
 
   /* ...then accumulate statistics... */
%end;
%mend;

The OUTEST= option saves the parameter estimates in a data set. You can aggregate the statistics by using PROC APPEND or the DATA step.

If you use a macro loop to do this computation, it will take a long time for all the reasons stated in the article "The slow way or the BY way." Fortunately, there is a more efficient alternative.

The BY way for many models

An alternative way to analyze those 1000 regression models is to transpose the data to long form and use a BY-group analysis. Whereas the macro loop might take a few minutes to run, the BY-group method might complete in less than a second. You can download a test program and compare the time required for each method by using the link at the end of this article.

To run a BY-group analysis:

  1. Transpose the data from wide to long form. As part of this process, you need to create a variable (the BY-group variable) that will be unique for each model.
  2. Sort the data by the BY-group variable.
  3. Run the SAS procedure, which uses the BY statement to specify each model.

1. Transpose the data

In the following code, the explanatory variables are read into an array X. The name of each variable is stored by using the VNAME function, which returns the name of the variable that is in the i_th element of the array X. If the original data had N observations and p explanatory variables, the LONG data set contains Np observations.

/* 1. transpose from wide (Y, X1 ,...,X100) to long (varNum VarName Y Value) */
data Long;
set Wide;                       /* <== specify data set name HERE         */
array x [*] x1-x&nCont;         /* <== specify explanatory variables HERE */
do varNum = 1 to dim(x);
   VarName = vname(x[varNum]);  /* variable name in char var */
   Value = x[varNum];           /* value for each variable for each obs */
   output;
end;
drop x:;
run;

2. Sort the data

In order to perform a BY-group analysis in SAS, sort the data by the BY-group variable. You can use the VARNUM variable if you want to preserve the order of the variables in the wide data. Or you can sort by the name of the variable, as done in the following call to PROC SORT:

/* 2. Sort by BY-group variable */
proc sort data=Long;  by VarName;  run;

3. Run the analyses

You can now call a SAS procedure one time to compute all regression models:

/* 3. Call PROC REG and use BY statement to compute all regressions */
proc reg data=Long noprint outest=PE;
by VarName;
model Y = Value;
quit;
 
/* Look at the results */
proc print data=PE(obs=5);
var VarName Intercept Value;
run;

The PE data set contains the parameter estimates for every single-variable regression of Y onto Xi. The table shows the parameter estimates for the first few models. Notice that the models are presented in the order of the BY-group variable, which for this example is the alphabetical order of the name of the explanatory variables.

Conclusions

You can download the complete SAS program that generates example data and runs many regressions. The program computes the regression estimates two ways: by using a macro loop (the SLOW way) and by transforming the data to long form and using BY-group analysis (the BY way).

This technique is applicable when the models all have a similar form. In this example, the models were of the form Y=Xi, but a similar result would work for GLM models such as Y=A|Xi, where A is a fixed classification variable. Of course, you could also use generalized linear models such as logistic regression.

Can you think of other ways to use this trick? Leave a comment.

tags: Data Analysis, Getting Started, Statistical Programming

The post An easy way to run thousands of regressions in SAS appeared first on The DO Loop.

2月 082017
 

On discussion forums, I often see questions that ask how to Winsorize variables in SAS. For example, here are some typical questions from the SAS Support Community:

  • I want an efficient way of replacing (upper) extreme values with (95th) percentile. I have a data set with around 600 variables and want to get rid of extreme values of all 600 variables with 95th percentile.
  • I have several (hundreds of) variables that I need to “Winsorize” at the 95% and 5%. I want all the observations with values greater 95th percentile to take the value of the 95th percentile, and all observations with values less than the 5th percentile to take the value of the 5th percentile.

It is clear from the questions that the programmer wants to modify the extreme values of dozens or hundreds of variables. As we will soon learn, neither of these requests satisfy the standard definition of Winsorization. What is Winsorization of data? What are the pitfalls and what are alternative methods?


Winsorization: Definition, pitfalls, and alternatives #StatWisdom
Click To Tweet


What is Winsorization?

The process of replacing a specified number of extreme values with a smaller data value has become known as Winsorization or as Winsorizing the data. Let's start by defining Winsorization.

Winsorization began as a way to "robustify" the sample mean, which is sensitive to extreme values. To obtain the Winsorized mean, you sort the data and replace the smallest k values by the (k+1)st smallest value. You do the same for the largest values, replacing the k largest values with the (k+1)st largest value. The mean of this new set of numbers is called the Winsorized mean. If the data are from a symmetric population, the Winsorized mean is a robust unbiased estimate of the population mean.

The graph to right provides a visual comparison. The top graph shows the distribution of the original data set. The bottom graph shows the distribution of Winsorized data for which the five smallest and five largest values have been modified. The extreme values were not deleted but were replaced by the sixth smallest or largest data value.

I consulted the Encyclopedia of Statistical Sciences (Kotz et al. (Eds), 2nd Ed, 2006) which has an article "Trimming and Winsorization " by David Ruppert (Vol 14, p. 8765). According to the article:

  • Winsorizaion is symmetric: Some people want to modify only the large data values. However, Winsorization is a symmetric process that replaces the k smallest and the k largest data values.
  • Winsorization is based on counts: Some people want to modify values based on quantiles, such as the 5th and 95th percentiles. However, using quantiles might not lead to a symmetric process. Let k1 be the number of values less than the 5th percentile and let k2 be the number of values greater than the 95th percentile. If the data contain repeated values, then k1 might not equal to k2, which means that you are potentially changing more values in one tail than in the other.

As shown by the quotes at the top of this article, posts on discussion forums sometimes muddle the definition of Winsorization. If you modify the data in an unsymmetric fashion, you will produce biased statistics.

Winsorization: The good

Why do some people want to Winsorize their data? There are a few reasons:

  • Classical statistics such as the mean and standard deviation are sensitive to extreme values. The purpose of Winsorization is to "robustify" classical statistics by reducing the impact of extreme observations.
  • Winsorization is sometimes used in the automated processing of hundreds or thousands of variables when it is impossible for a human to inspect each and every variable.
  • If you compare a Winsorized statistic with its classical counterpart, you can identify variables that might contain contaminated data or are long-tailed and require special handling in models.

Winsorization: The bad

There is no built-in procedure in SAS that Winsorizes variables, but there are some user-defined SAS macros on the internet that claim to Winsorize variables. BE CAREFUL! Some of these macros do not correctly handle missing values. Others use percentiles to determine the extreme values that are modified. If you must Winsorize, I have written a SAS/IML function that Winsorizes data and correctly handles missing values.

As an alternative to Winsorizing your data, SAS software provides many modern robust statistical methods that have advantages over a simple technique like Winsorization:

Winsorization: The ugly

If the data contains extreme values, then classical statistics are influenced by those values. However, modifying the data is a draconian measure. Recently I read an article by John Tukey, one of the early investigator of robust estimation. In the article "A survey of sampling from contaminated distributions" (1960), Tukey says (p. 457) that when statisticians encounter a few extreme values in data,

we are likely to think of them as 'strays' [or] 'wild shots' ... and to focus our attention on how normally distributed the rest of the distribution appears to be. One who does this commits two oversights, forgetting Winsor's principle that 'all distributions are normal in the middle,' and forgetting that the distribution relevant to statistical practice is that of the values actually provided and not of the values which ought to have been provided.

A little later in the essay (p. 458), he says

Sets of observations which have been de-tailed by over-vigorous use of a rule for rejecting outliers are inappropriate, since they are not samples.

I love this second quote. All of the nice statistical formulas that are used to make inferences (such as standard errors and confidence intervals) are based on the assumption that the data are a random sample that contains all of the observed values, even extreme values. The tails of a distribution are extremely important, and indiscriminately modifying large and small values invalidates many of the statistical analyses that we take for granted.

Summary

Should you Winsorize data? Tukey argues that indiscriminately modifying data is "inappropriate." In SAS, you can get the Winsorized mean directly from PROC UNIVARIATE. SAS also provides alternative robust methods such the ones in the ROBUSTREG and QUANTREG procedures.

If you decide to use Winsorization to modify your data, remember that the standard definition calls for the symmetric replacement of the k smallest (largest) values of a variable with the (k+1)st smallest (largest). If you download a program from the internet, be aware that some programs use quantiles and others do not handle missing values correctly.

What are your thoughts about Winsorizing data? Share them in the comments.

tags: Statistical Programming, Statistical Thinking

The post Winsorization: The good, the bad, and the ugly appeared first on The DO Loop.

2月 012017
 

In a previous article, I showed how to simulate data for a linear regression model with an arbitrary number of continuous explanatory variables. To keep the discussion simple, I simulated a single sample with N observations and p variables. However, to use Monte Carlo methods to approximate the sampling distribution of statistics, you need to simulate many samples from the same regression model.

This article shows how to simulate many samples efficiently. Efficient simulation is the main emphasis of my book Simulating Data with SAS. For a detailed discussion about simulating data from regression models, see chapters 11 and 12.

The SAS DATA step in my previous post contains four steps. To simulate multiple samples, put a DO loop around the steps that generate the error term and the response variable for each observation in the model. The following program modifies the previous program and creates a single data set that contains NumSamples (=100) samples. Each sample is identified by an ordinal variable named SampleID.

/* Simulate many samples from a  linear regression model */
%let N = 50;            /* N = sample size               */
%let nCont = 10;        /* p = number of continuous variables */
%let NumSamples = 100;  /* number of samples                  */
data SimReg(keep= SampleID i Y x:);
call streaminit(54321);
array x[&nCont];        /* explanatory variables are named x1-x&nCont */
 
/* 1. Specify model coefficients. You can hard-code values such as
array beta[0:&nCont] _temporary_ (-4 2 -1.33 1 -0.8 0.67 -0.57 0.5 -0.44 0.4 -0.36);
      or you can use a formula such as the following */
array beta[0:&nCont] _temporary_;
do j = 0 to &nCont;
   beta[j] = 4 * (-1)**(j+1) / (j+1);       /* formula for beta[j] */
end;
 
do i = 1 to &N;              /* for each observation in the sample */
   do j = 1 to &nCont;
      x[j] = rand("Normal"); /* 2. Simulate explanatory variables  */
   end;
 
   eta = beta[0];                       /* model = intercept term  */
   do j = 1 to &nCont;
      eta = eta + beta[j] * x[j];       /*     + sum(beta[j]*x[j]) */
   end;
 
   /* 5. simulate response for each sample */
   do SampleID = 1 to &NumSamples;      /* <== LOOP OVER SAMPLES   */
      epsilon = rand("Normal", 0, 1.5); /* 3. Specify error distrib*/
      Y = eta + epsilon;                /* 4. Y = model + error    */
      output;
   end;
end;
run;

The efficient way to analyzed simulated samples with SAS is to use BY-group processing. With By-group processing you can analyze all samples with a single procedure call. The following statements sort the data by the SampleID variable and call PROC REG to analyze all samples. The NOPRINT option ensures that the procedure does not spew out thousands of tables and graphs. (For procedures that do not support the NOPRINT option, there are other ways to turn off ODS when analyzing simulated data.) The OUTEST= option saves the parameter estimates for all samples to a SAS data set.

proc sort data=SimReg;
   by SampleID i;
run;
 
proc reg data=SimReg outest=PE NOPRINT;
   by SampleID;
   model y = x:;
quit;

The PE data set contains NumSamples rows. Each row contains the p parameter estimates for the analysis of one simulated sample. The distribution of estimates is an approximation to the true (theoretical) sampling distribution of the statistics. The following image visualizes the joint distribution of the estimates of four regression coefficients. You can see that the distribution of the estimates appears to be multivariate normal and centered at the values of the population parameters.

You can download the SAS program that simulates the data, analyzes it, and produces the graph. The program is very efficient. For 10,000 random samples of size N=50 that contain p=10 variables, it takes about one second to run the Monte Carlo simulation and analyses.

tags: Simulation, Statistical Programming

The post Simulate many samples from a linear regression model appeared first on The DO Loop.