Statistical Programming

2月 212018

This article describes and implements a fast algorithm that estimates a median for very large samples. The traditional median estimate sorts a sample of size N and returns the middle value (when N is odd). The algorithm in this article uses Monte Carlo techniques to estimate the median much faster.

Of course, there's no such thing as a free lunch. The Monte Carlo estimate is an approximation. It is useful when a quick-and-dirty estimate is more useful than a more precise value that takes longer to compute. For example, if you want to find the median value of 10 billion credit card transactions, it might not matter whether the median $16.74 or $16.75. Either would be an adequate estimate for many business decisions.

Although the traditional sample median is commonly used, it is merely an estimate. All sample quantiles are estimates of a population quantity, and there are many ways to estimate quantiles in statistical software. To estimate these quantities faster, researchers have developed approximate methods such as the piecewise-parabolic algorithm in PROC MEANS or "probabilistic methods" such as the ones in this article.

Example data: A sample from a mixture distribution

Sample from mixture distribution showing sample median

Consider the following data set, which simulates 10 million observations from a mixture of an exponential and a normal distribution. You can compute the exact median for the distribution, which is 8.065. A histogram of the data and the population median are shown to the right.

Because the sample size is large, PROC MEANS takes a few seconds to compute the sample median by using the traditional algorithm which sorts the data (an O(N log N) operation) and then returns the middle value. The sample median is 8.065.

/* simulate from a mixture distribution. See */
%let N = 1e7;
data Have;
call streaminit(12345);
do i = 1 to &N;
   d = rand("Bernoulli", 0.6);
   if d = 0 then
      x = rand("Exponential", 0.5);
      x = rand("Normal", 10, 2);
keep x;
/* quantile estimate */
proc means data=Have N Median;
   var x;

A naive resampling algorithm

The basic idea of resampling methods is to extract a random subset of the data and compute statistics on the subset. Although the inferential statistics (standard errors and confidence interval) on the subset are not valid for the original data, the point estimates are typically close, especially when the original data and the subsample are large.

The simplest form of a resampling algorithm is to generate a random subsample of the data and compute the median of the subsample. The following example uses PROC SURVEYSELECT to resample (with replacement) from the data. The subsample is one tenth as large as the original sample. The subsequent call to PROC MEANS computes the median of the subsample, which is 8.063.

proc surveyselect data=Have seed=1234567 NOPRINT 
     method=urs samprate=0.1;                 /* 10% sampling with replacement */
title "Estimate from 10% of the data";
proc means data=SubSample N Median;
   freq NumberHits;
   var x;

It takes less time to extract a 10% subset and compute the median than it takes to compute the median of the full data. This naive resampling is simple to implement and often works well, as in this case. The estimate on the resampled data is close to the true population median, which is 8.065. (However, the standard error of the traditional estimate is smaller.)

You might worry, however, that by discarding 90% of the data you might discard too much information. In fact, there is a 90% chance that we discarded the middle data value! Thus it is wise to construct the subsample more carefully. The next section constructs a subsample that excludes data from the tails of the distribution and keeps data from the middle of the distribution.

A probabilistic algorithm for the median

This section presents a more sophisticated approximate algorithm for the median. My presentation of this Monte Carlo algorithm is based on the lecture notes of Prof. H-K Hon and a Web page for computer scientists. Unfortunately, neither source credits the original author of this algorithm. If you know the original reference, please post it in the comments.

Begin with a set S that contains N observations. The following algorithm returns an approximate median with high probability:

  1. Choose a random sample (with replacement) of size k = N3/4 from the data. Call the sample R.
  2. Choose lower and upper "sentinels," L and U. The sentinels are the lower and upper quantiles of R that correspond to the ranks k/2 ± sqrt(N). These sentinels define a symmetric interval about the median of R.
  3. (Optional) Verify that the interval [L, U] contains the median value of the original data.
  4. Let C be the subset of the original observations that are within the interval [L, U]. This is a small subset.
  5. Return the median of C.

The idea is to use the random sample only to find an interval that probably contains the median. The width of the interval depends on the size of the data. If N=10,000, the algorithm uses the 40th and 60th percentiles of the random subset R. For N=1E8, it uses the 49th and 51th percentiles.

The following SAS/IML program implements the Monte Carlo estimate for the median:

proc iml;
/* Choose a set R of k = N##(3/4) elements in x, chosen  at random with replacement. 
   Choose quantiles of R to form an interval [L,U].
   Return the median of data that are within [L,U]. */
start MedianRand(x);
   N = nrow(x);
   k = ceil(N##0.75);    /* smaller sample size */
   R = T(sample(x, k));  /* bootstrap sample of size k (with replacement) */
   call sort(R);         /* Sort this subset of data */
   L = R[ floor(k/2 - sqrt(N)) ]; /* lower sentinel */
   U = R[  ceil(k/2 + sqrt(N)) ]; /* upper sentinel */
   UTail = (x > L);       /* indicator for large values */
   LTail = (x < U);       /* indicator for small values */
   if sum(UTail) > N/2 & sum(LTail) > N/2 then do;
      C = x[loc(UTail & LTail)]; /* extract central portion of data */
      return ( median(C) );      
   else do;
      print "Median not between sentinels!";
      return (.);
/* run the algorithm on example data */
use Have; read all var {"x"}; close;
call randseed(456);
t1 = time();
   approxMed = MedianRand(x);  /* compute approximate median; time it */
tApproxMed = time()-t1;
t0 = time();
   med = median(x);            /* compute traditional estimate; time it */
tMedian = time()-t0;
results = (approxMed || med) // (tApproxMed|| tMedian);
print results[colname={"Approx" "Traditional"}
              rowname={"Estimate" "Time"} F=Best6.];

The output indicates that the Monte Carlo algorithm gives an estimate that is close to the traditional estimate, but produces it three times faster. If you repeat this experiment for 1E8 observations, the Monte Carlo algorithm computes an estimate in 2.4 seconds versus 11.4 seconds for the traditional estimate, which is almost five times faster. As the data size grows, the speed advantage of the Monte Carlo algorithm increases. See Prof. Hon's notes for more details.

In summary, this article shows how to implement a probabilistic algorithm for estimating the median for large data. The Monte Carlo algorithm uses a bootstrap subsample to estimate a symmetric interval that probably contains the true median, then uses the interval to strategically choose and analyze only part of the original data. For large data, this Monte Carlo estimate is much faster than the traditional estimate.

The post A Monte Carlo algorithm to estimate a median appeared first on The DO Loop.

2月 192018

Your statistical software probably provides a function that computes quantiles of common probability distributions such as the normal, exponential, and beta distributions. Because there are infinitely many probability distributions, you might encounter a distribution for which a built-in quantile function is not implemented. No problem! This article shows how to numerically compute the quantiles of any probability distribution from the definition of the cumulative distribution (CDF).

In SAS, the QUANTILE function computes the quantiles for about 25 distributions. This article shows how you can use numerical root-finding methods (and possibly numerical integration) in SAS/IML software to compute the quantile function for ANY continuous distribution. I have previously written about related topics and particular examples, such as the following:

The quantile is the root of an integral equation

Quantiles are the solutions to the equation CDF(x)-p=0, where p is a probability

Computing a quantile would make a good final exam question for an undergraduate class in numerical analysis. Although some distributions have an explicit CDF, many distributions are defined only by a probability density function (the PDF, f(x)) and numerical integration must be used to compute the cumulative distribution (the CDF, F(x)). A canonical example is the normal distribution. I've previously shown how to use numerical integration to compute a CDF from a PDF by using the definition F(x) = ∫ f(t) dt, where the lower limit of the integral is –∞ and the upper limit is x.

Whether the CDF is defined analytically or through numerical integration, the quantile for p is found implicitly as the solution to the equation F(x) = p, where p is a probability in the interval (0,1). This is illustrated by the graph at the right.

Equivalently, you can define G(x; p) = F(x) – p so that the quantile is the root of the equation G(x; p) = 0. For well-behaved densities that occur in practice, a numerical root is easily found because the CDF is monotonically increasing. (If you like pathological functions, see the Cantor staircase distribution.)

Example: Create a custom distribution

SAS/IML software provides the QUAD subroutine, which provides numerical integration, and the FROOT function, which solves for roots. Thus SAS/IML is an ideal computational environment for computing quantiles for custom distributions.

As an example, consider a distribution that is a mixture of an exponential and a normal distribution:
F(x) = 0.4 Fexp(x; 0.5) + 0.6 Φ(x; 10, 2),
where Fexp(x; 0.5) is the exponential distribution with scale parameter 0.5 and Φ(x; 10, 2) is the normal CDF with mean 20 and standard deviation 2. In this case, you do not need to use numerical integration to compute the CDF. You can compute the CDF as a linear combination of the exponential and normal CDFs, as shown in the following SAS/IML function:

/* program to numerically find quantiles for a custom distribution */
proc iml;
/* Define the cumulative distribution function here. */
start CustomCDF(x); 
   F = 0.4*cdf("Exponential", x, 0.5) +
       0.6*cdf("Normal", x, 10, 2);
   return F;

The previous section shows the graph of the CDF on the interval [0, 16]. The vertical and horizontal lines correspond to the first, second and third quartiles of the distribution. The quartiles are close to the values Q1 ≈ 0.5, Q2 ≈ 8, and Q3 ≈ 10.5. The next section shows how to compute the quantiles.

Compute quantiles for an arbitrary distribution

As long as you can define a function that evaluates the CDF, you can find quantiles. For unbounded distributions, it is usually helpful to plot the CDF so that you can visually estimate an interval that contains the quantile. (For bounded distributions, the support of the distribution contains all quantiles.) For the mixture distribution in the previous section, it is clear that the quantiles are in the interval [0, 16].

The following program finds arbitrary quantiles for whichever CDF is evaluated by the CustomCDF function. To find quantiles for a different function, you can modify the CustomCDF and change the interval on which to find the quantiles. You do not need to modify the RootFunc or CustomQuantile functions.

/* Express CDF(x)=p as the root for the function CDF(x)-p. */
start RootFunc(x) global(gProb);
   return CustomCDF(x) - gProb;  /* quantile for p is root of CDF(x)-p */
/* You need to provide an interval on which to search for the quantiles. */
start CustomQuantile(p, Interval) global(gProb);
   q = j(nrow(p), ncol(p), .);              /* allocate result matrix */
   do i = 1 to nrow(p)*ncol(p);             /* for each element of p... */
      gProb = p[i];                         /*    set global variable   */
      q[i] = froot("RootFunc", Interval);   /*    find root (quantile)  */ 
   return q;
/* Example: look for quartiles in interval [0, 16] */
probs = {0.25 0.5 0.75};         /* Q1, Q2, Q3 */
intvl = {0 16};                  /* interval on which to search for quantiles */
quartiles = CustomQuantile(probs, intvl);
print quartiles[colname={Q1 Q2 Q3}];
Quartiles for mixture of exponential and normal distributions

In summary, you can compute an arbitrary quantile of an arbitrary continuous distribution if you can (1) evaluate the CDF at any point and (2) numerically solve for the root of the equation CDF(x)-p for a probability value, p. Because the support of the distribution is arbitrary, the implementation requires that you provide an interval [a,b] that contains the quantile.

The computation should be robust and accurate for non-pathological distributions provided that the density is not tiny or zero at the value of the quantile. Although this example is illustrated in SAS, the same method will work in other software.

The post Compute the quantiles of any distribution appeared first on The DO Loop.

1月 242018

A popular way to use lists in the SAS/IML language is to pack together several related matrices into a single data structure that can be passed to a function. Imagine that you have written an algorithm that requires a dozen different parameters. Historically, you would have to pass those parameters to the function by explicitly listing each argument. If you use lists (introduced in SAS/IML 14.2, which is SAS 9.4M4), you can pack all parameters into one list and pass that list into the function module where the individual items can be extracted as needed.

Example: An algorithm that requires multiple parameters

To illustrate this list-passing technique, consider the algorithm for Newton's method, which is an iterative method for finding the roots of a nonlinear systems of equations. An implementation of Newton's method is included the SAS/IML documentation. Newton's method requires several arguments:

  • The name of a module that evaluates the function whose roots are sought.
  • The name of a module that evaluates the Jacobian matrix (the derivative of the function).
  • An initial guess for the solution.
  • The maximum number of iterations before the algorithm decides that the method is not converging.
  • A convergence criterion that determines the accuracy of the solution.

The example in the documentation (which was written in the 1980s) uses global variables and hard-codes the names of functions and the convergence criteria. The example is meant to demonstrate the basic ideas, but is not reusable. You can make the example more flexible and robust by passing parameters to the module as shown in the next section.

Create a list of parameters

Suppose that you define the modules F and DF to evaluate a multivariate function and the matrix of partial derivatives (the Jacobian matrix), respectively. The following modules evaluate the same mathematical function and derivatives that are used in the SAS/IML documentation:

proc iml;
/* Newton's method to solve for roots of the system of equations
  F(x1,x2) = [ x1+x2-x1*x2+2 ] = [ 0 ]
             [ x1*exp(-x2)-1 ]   [ 0 ]
/* 1. define a function F that evaluates a multivariate function */
start F(x);
   x1 = x[1];   x2 = x[2];                 /* extract the values */
   f = (x1+x2-x1*x2+2) // (x1*exp(-x2)-1); /* evaluate the function */
   return ( f );
finish F;
/* 2. define a function DF that evaluates the Jacobian of F */
start DF(x);
   x1 = x[1];   x2 = x[2];                 /* extract the values */
   J = {. ., . .};
   J[1,1] = 1-x2;       J[1,2] = 1-x1;
   J[2,1] = exp(-x2);   J[2,2] =  -x1*exp(-x2);
   return ( J );
finish DF;

As shown in a previous article, you can create a named list that contains the parameters for Newton's algorithm. In SAS/IML 14.3, you can create the list by specifying a sequence of name-value pairs:

/* 3. list of NAME  = VALUE pairs  (syntax uses SAS/IML 14.3) */
args = [#"Func"     = "F",         /* name of module that evaluates the function */
        #"Deriv"    = "DF",        /* name of module that evaluates the deivative */ 
        #"x0"       = {0.1, -2.0}, /* initial guess for solution */
        #"MaxIters" = 100,         /* maximum iterations */
        #"ConvCrit" = 1e-6 ];      /* convergence criterion */

In SAS/IML 14.3, you can use the list item operator ($) to specify an item in a list. You can also get the name of the function and use CALL EXECUTE to evaluate the function at the initial guess, as follows:

x = args$"x0";                           /* initial guess */ 
EvalFunc = "y=" + args$"Func" + "(x);";  /* string for "y=F(x);" */
call execute(EvalFunc);                  /* evaluates function at x */
print EvalFunc, y;

This is a useful trick for evaluating a function by name. The next section uses this trick inside a module that implements Newton's method.

Pass a list of parameters to and from a SAS/IML module

You can now implement Newton's method by defining a SAS/IML module that takes one argument, which is a list. Inside the module, the parameters are extracted from the list and used to evaluate the functions F and DF and to control the iteration and convergence of Newton's method, as follows:

/* 4. Newton's method for solving the roots for F(x)=0. The argument to the module is a list. */
start Newton( L );
   x = L$"x0";                              /* initial guess */
   EvalFunc = "y=" + L$"Func" + "(x);";     /* string for "y=F(x);" */
   EvalDeriv = "D=" + L$"Deriv" + "(x);";   /* string for "D=DF(x);" */
   call execute(EvalFunc);          /* evaluate F at starting values */
   converged = (max(abs(y)) < L$"ConvCrit");              /* 0 or 1 */
   do iter = 1 to L$"Maxiters"      /* iterate until max iterations */
             while( ^converged );   /*      or until convergence */
      call execute(EvalDeriv);      /* evaluate derivatives D=DF(x) */
      delta = -solve(D,y);          /* solve for correction vector */
      x = x+delta;                  /* update x with new approximation */
      call execute(EvalFunc);       /* evaluate the function y=F(x)*/
      converged = (max(abs(y)) < L$"ConvCrit");
   /* you can return x or  a list that summarizes the results */
   result = [#"x"          = x,           /* approximate root (if converged) */
             #"iterations" = iter,        /* total number of iterations */
             #"status"      = converged]; /* 1 if root found; 0 otherwise */
   return ( result );
finish Newton;
/* 5. call Newton's method and pass in a list of parameters */
soln = Newton( args );              
xSoln = soln$"x";                   /* get root */
if soln$"status" then               /* check convergence statust */
   print "Newton's method converged in " (soln$"iterations") " iterations", xSoln;
   print "Newton's method did not converge in " (soln$"iterations") " iterations";

The Newton module returns a named list that has three items. The item named "status" is a binary value that indicates whether the method converged. If the method converged, the item named "x" contains the approximate root. The item named "iterations" contains the number of iterations of the algorithm.

In summary, you can use lists to create a structure that contains many parameters. You can pass the list to a SAS/IML module, which can extract and use the parameters. This enables you to simplify the calling syntax for user-defined modules that require many parameters.

For additional examples of using lists to pass heterogeneous data to SAS/IML modules, see "More Than Matrices: SAS/IML Software Supports New Data Structures" (Wicklin, 2017, pp. 11–12) and the SAS/IML documentation for lists.

The post Use lists to pass parameters to SAS/IML functions appeared first on The DO Loop.

1月 222018

SAS/IML 14.3 (SAS 9.4M5) introduced a new syntax for creating lists and for assigning and extracting item in a list. Lists (introduced in SAS/IML 14.2) are data structures that are convenient for holding heterogeneous data. A single list can hold character matrices, numeric matrices, scalar values, and other lists, as discussed in a previous article about how to use lists in SAS/IML.

The list creation operator

You can use square brackets to create a list. The elements of the list are separated by commas. For example, the following syntax creates a list that contains three elements. The list represents a hypothetical patient in a cholesterol-lowering study.

proc iml;
/* 1. New list creation syntax (L = [val1, val2,...]) */
/*                 Cholesterol at
       Name  Age   0mo  3mo  6mo  12mo */
P1 = ["Bob",  36, {182, 170, 170, 162}]; /* Patient 1 */

The name of the list is P1. The first element of the list is a string, the second is a scalar number, and the third is a numeric vector. In this case, the elements are specified by using literal values, but you can also use previously defined variables. The following statement is equivalent but defines the list by using existing variables:

Name = "Bob"; Age = 36; Chol = {182, 170, 170, 162};
P1 = [Name, Age, Chol];   /* define list from copies of existing variables */

As mentioned earlier, lists can contain other lists. For example, if you have multiple patients in a study, you can create a list of each patient's data, then create a list that contains all the patients' data, as follows:

P2 = ["Fred", 52, {175, 165, 155}     ]; /* Patient 2 */
P3 = ["Tom",  45, {160, 145,   ., 139}]; /* Patient 3 */
Patients = [P1, P2, P3];                 /* a list of patients */

Assign and extract list items

You can use the list item operator ($) to specify an item in a list. For each patient, the age is stored as the second item in a list. If you want the age of Bob (the first patient), you can use either of the following statements:

/* 2. New list item syntax ($) */
BobAge = P1$2;         /* get 2nd item from P1 list */
BobAge = Patients$1$2; /* get 1st item of Patients, then 2nd item from that list */

The first statement is straightforward: P1$2 means "get the second item from the P1 list." The second statement is parsed left to right. The syntax Patient$1 means "get the first item from the Patients list," which is equivalent to P1. Thus Patient$1$2 gets Bob's age.

The preceding example uses literal values to specify the position of an item in a list, but you can also use a variable. This makes it possible to extract items in a loop. For example, the following statements loop over all patients, extract the name and cholesterol values of the patient, and compute each patient's average cholesterol value during the study:

N = ListLen(Patients);      /* number of patients */
Name = j(N,1,"     ");      /* allocate vector for names */
AvgChol = j(N,1,.);         /* allocate vector for results */
do i = 1 to N;
   Name[i] = Patients$i$1;  /* get name of i_th patient */
   Chol = Patients$i$3;     /* get sequence of cholesterol values */
   AvgChol[i] = mean(Chol); /* average value */
print AvgChol[rowname=Name];

You can use the list item operator on the left side of an assignment operator to change an item in a list. For example, suppose you discover that Bob's age was typed wrong: Bob is really 63, not 36. You can update Bob's data as follows:

P1$2 = 63;         /* update 2nd item in P1 list */
Patients$1 = P1;   /* update entire patient list */

Alternatively, you could use the syntax Patients$1$2 = 63 to update the data in place.

Extract sublists

To subset a list, use square brackets ([]) and specify the indices of the items. For example, the following statement extracts the second and third patients into a new list:

/* 3. Extract sublist SL = L[ {i1, i2,...} ] */
SubList = Patients[ {2 3} ]; /* Fred and Tom */

The sublist operator ([]) ALWAYS returns a list, even if you specify only one item! Thus Patients[1] is a LIST that contains one item. If you want the item itself, use Patients$1.

Named lists

In the previous example, the items in the list P1 are a patient's name, age, and cholesterol readings. If you want to extract Bob's age, you can write P1$2, but someone unfamiliar with the order of the list items would have no idea what that item represents. Thus it is helpful to define a named list, which is also called an associative array. When you specify a named list, you specify the items by using name-value pairs, as follows:

P = [#"Name" = "Bob",                        /* P$1 or P$"Name" */
     #"Age"  = 63,                           /* P$2 or P$"Age" */
     #"Cholesterol" = {182, 170, 170, 162}]; /* P$3 or P$"Cholesterol" */

You can use the names to refer to the items in a list. For example, the following statements extract the patient's name and cholesterol readings by using the list item operator:

Name = P$"Name";
Chol = P$"Cholesterol";       /* get Bob's measurements */
print Chol[Label=Name];

You can also use names in the sublist operator to extract a sublist:

L = P[ {"Age" "Cholesterol"} ];  /* sublist that contains two items */


In summary, SAS/IML 14.3 contains new syntax for creating a list and for extracting items and sublists. This syntax makes it easier to use lists and to read and write SAS/IML programs that use lists.

The post Create lists by using a natural syntax in SAS/IML appeared first on The DO Loop.

1月 102018

Last week I wrote about the 10 most popular articles from The DO Loop in 2017. My most popular articles tend to be about elementary statistics or SAS programming tips. Less popular are the articles about advanced statistical and programming techniques. However, these technical articles fill an important niche. Not everyone needs to know how to interpret a diffogram that visually compares the differences between means between groups, but those who do often send me a note of thanks, which motivates me to research and write similar articles.

Statistical programmers might want to review the following technical articles from 2017. This ain't summertime, and the reading ain't easy, but I think these articles are worth the effort. I've broken them into three categories. Enjoy!

Statistical Concepts

Visualization of regression that uses a weight variable in SAS

Statistical Data Analysis and Visualization

Visualize multivariate regression model by slicing the continuous variables. Graph created by using the EFFECTPLOT SLICEFIT statement in SAS.

Advanced SAS Programming Techniques

If you are searching for a way to enhance your SAS knowledge in this New Year, I think these 10 articles from The DO Loop are worth a second look. Was there an article from 2017 that you found particularly useful? Leave a comment.

The post 10 posts from 2017 that deserve a second look appeared first on The DO Loop.

11月 272017

When you run an optimization, it is often not clear how to provide the optimization algorithm with an initial guess for the parameters. A good guess converges quickly to the optimal solution whereas a bad guess might diverge or require many iterations to converge. Many people use a default value (such as 1) for all initial guesses. I have previously written about other ways to obtain an initial guess for an optimization.

In the special case of maximum likelihood estimation, you use a trick to provide a good guess to the optimization algorithm. For maximum likelihood estimation, the objective function is the log-likelihood function of a distribution. Consequently, you can use the method of moments to provide the initial guess for the parameters, which often results in fast convergence to the maximum likelihood estimates. The faster convergence can save you considerable time when you are analyzing thousands of data sets, such as during a simulation study.

The method of moments

Suppose you believe that some data can be modeled by a certain distribution such as a lognormal, Weibull, or beta distribution. A common way to estimate the distribution parameters is to compute the maximum likelihood estimates (MLEs). These are the values of the parameters that are "most likely" to have generated the observed data. To compute the MLEs you first construct the log-likelihood function and then use numerical optimization to find the parameter values that maximize the log-likelihood function.

The method of moments is an alternative way to fit a model to data. For a k-parameter distribution, you write the equations that give the first k central moments (mean, variance, skewness, ...) of the distribution in terms of the parameters. You then replace the distribution's moments with the sample mean, variance, and so forth. You invert the equations to solve for the parameters in terms of the observed moments.

For example, suppose you believe that some data are distributed according to a two-parameter beta distribution. If you look up the beta distribution on Wikipedia, you will see that the mean and variance of a Beta(a, b) distribution can be expressed in terms of the two parameters as follows:

Mean and variance of the beta distribution

The first equation represents the expected (mean) value of a beta-distributed random variable X. The second equation represents the variance. If the sample mean of the data is m and the sample variance is v, you can "plug in" these values for the left-hand side of the equations and solve for the values of a and b that satisfy the equations. Solving the first equation for a yields a = b m / (1 - m). If you substitute that expression into the second equation and solve for b, you get b = m - 1 + (m/v)(1 - m)2. Those equations give the parameter estimates from the method of moments.

Maximum likelihood estimation: Using an arbitrary guess

The method of moments can lead to improved convergence when fitting a distribution. For comparison, first consider what happens if you use an arbitrary initial guess for the optimization, such as a = b = 1. The following statements use PROC NLMIXED to compute maximum likelihood estimates for the parameters of the beta distribution. (You can read about how to use PROC NLMIXED to construct MLEs. You can also compute MLEs by using SAS/IML.)

data Beta;
input x @@;
0.58 0.23 0.41 0.69 0.64 0.12 0.07 0.19 
0.47 0.05 0.76 0.18 0.28 0.06 0.34 0.52 
0.42 0.48 0.11 0.39 0.24 0.25 0.22 0.05 
0.23 0.49 0.45 0.29 0.18 0.33 0.09 0.16 
0.45 0.27 0.31 0.41 0.30 0.19 0.27 0.57 
/* Compute parameter estimates for the beta distribution */
proc nlmixed data=Beta;
   parms a=1 b=1;                /* use arbitrary initial guesses */
   bounds a b > 0;
   LL = logpdf("beta", x, a, b); /* log-likelihood function */
   model x ~ general(LL);
Solution to fitting a beta distribution in SAS

The MLEs are a = 1.8487 and b = 3.9552. The "Iteration History" table (not shown) shows that 8 iterations of a quasi-Newton algorithm were required to converge to the MLEs, and the objective function (the LOGPDF function) was called 27 times during the optimization.

The following graph shows a histogram of the data overlaid with a Beta(a=1.85, b=3.96) density curve. The curve fits the data well.

Histogram of data overlaid with a beta density curve, fitted by maximum likelihood estimation

Maximum likelihood estimation: Using method-of-moments for guesses

If you use the method-of-moments to provide an initial guess on the PARMS statement, the quasi-Newton algorithm might converge in fewer iterations. The following statements use PROC MEANS to compute the sample mean and variance, the use the DATA step to compute the method-of-moments estimates, as follows:

/* output mean and variance */
proc means data=Beta noprint;
   var x;
   output out=Moments mean=m var=v;  /* save sample moments */
/* use method of moments to estimate parameters for the beta distribution */
data ParamEst;
   retain a b;               /* set order of output variables */
   set Moments;
   b = m*(1-m)**2/v + m - 1; /* estimate parameters by using sample moments */
   a = b*m / (1-m);
   keep a b;

The method of moments estimates are a = 1.75 and b = 3.75, which are close to the MLE values. The DATA= option in the PARMS statement in PROC NLMIXED enables you to specify a data set that contains starting values for the parameters. The following call to PROC NLMIXED is exactly the same as the previous call except that the procedure uses the initial parameter values from the ParamEst data set:

proc nlmixed data=Beta;
   parms / data=ParamEst;        /* use initial guesses from method of moments */
   bounds a b > 0;
   LL = logpdf("beta", x, a, b); /* log-likelihood function */
   model x ~ general(LL);
Iteration history for MLE for beta distribution. Initial guesses are maximum likelihood estimates.

The MLEs from this call (not shown) are exactly the same as before. However, the "Iteration History" table indicates that quasi-Newton algorithm converged in 5 iterations and required only 14 calls the objective function. This compares with 8 iterations and 27 calls for the previous computation. Thus the method of moments resulted in about half as many functional evaluations. In my experience this example is typical: Choosing initial parameters by using the method of moments often reduces the computational time that is required to obtain the MLEs.

It is worth mentioning that the PARMS option in PROC NLMIXED supports two nice features:

  • The DATA= option accepts both wide and long data sets. The example here is wide: there are k variables that supply the values of the k parameters. You can also specify a two-variable data set with the names "Parameter" and "Estimate."
  • The BYDATA option enables you to specify initial guesses for each BY group. In simulation studies, you might use BY-group processing to compute the MLEs for thousands of samples. The BYDATA option is essential to improving the efficiency of a large simulation study. It can also help ensure that the optimization algorithm converges when the parameters for the samples range over a large set of values.

Summary and cautions

In summary, this article shows how to use the method of moments to provide a guess for optimizing a log-likelihood function, which often speeds up the time required to obtain the MLEs. The method requires that you can write the equations that relate the central moments of a distribution (mean, variance, ...) to the parameters for the distribution. (For a distribution that has k parameters, use the equations for the first k moments.) You then "plug in" the sample moments and algebraically invert the equations to obtain the parameter values in terms of the sample moments. This is not always possible analytically; you might need to use numerical methods for part of the computation.

A word of caution. For small samples, the method of moments might produce estimates that do not make sense. For example, the method might produce a negative estimate for a parameter that must be positive. In these cases, the method might be useless, although you could attempt to project the invalid parameters into the feasible region.

The post The method of moments: A smart way to choose initial parameters for MLE appeared first on The DO Loop.

11月 222017

A statistical programmer read my article about the beta-binomial distribution and wanted to know how to compute the cumulative distribution (CDF) and the quantile function for this distribution. In general, if you know the PDF for a discrete distribution, you can also compute the CDF and quantile functions. This article shows how.

Continuous versus discrete distributions

Recall the four essential functions for any probability distribution: the probability density function (PDF), the cumulative distribution function (CDF), the quantile function, and the random-variate generator. SAS software provides the PDF, CDF, QUANTILE, and RAND function, which support for about 25 common families of distributions. However, there are infinitely many probability distributions. What can you do if you need a distribution that is not natively supported in SAS? The answer is that you can use tools in SAS (especially in SAS/IML software) to implement these functions yourself.

I have previously shown how to use numerical integration and root-finding algorithms to compute the CDF and quantile function for continuous distributions such as the folded normal distribution and the generalized Gaussian distribution. For discrete distributions, you can use a summation to obtain the CDF from the PDF. The CDF at X=x is the sum of the PDF values for all values of X that are less than or equal to x. The discrete CDF is a step function, so it does not have an inverse function. Given a probability, p, the quantile for p is defined as the smallest value of x for which CDF(x) ≥ p.

To simplify the discussion, assume that the random variable X takes values on the integers 0, 1, 2, .... Many discrete probability distributions in statistics satisfy this condition, including the binomial, Poisson, geometric, beta-binomial, and more.

Compute a CDF from a PDF

I will use SAS/IML software to implement the CDF and quantile functions. The following SAS/IML modules define the PDF function and the CDF function. If you pass in a scalar value for x, the CDF function computes the PDF from X=0 to X=x and sums up the individual probabilities. When x is a vector, the implementation is slightly more complicated, but the idea is the same.

proc iml;
/* PMF function for the beta-binomial distribution. See */
start PDFBetaBinom(x, nTrials, a, b);
   logPMF = lcomb(nTrials, x) + logbeta(x + a, nTrials - x + b) - logbeta(a, b);
   return ( exp(logPMF) );
/* CDF function for the beta-binomial distribution */
start CDFBetaBinom(x, nTrials, a, b);
   y = 0:max(x);                               /* all values of X less than max(x) */
   PMF = PDFBetaBinom(y, NTrials, a, b);       /* compute PMF(t) for t=0..max(x) */
   cumPDF = cusum(PMF);                        /* cdf(T) = sum( PMF(t), t=0..T ) */
   /* The input x can be a vector of values in any order.
      Use the LOC function to assign CDF(x) for each value of x */
   CDF = j(nrow(x), ncol(x), .);               /* allocate result (same size as x) */
   u = unique(x);                              
   do i = 1 to ncol(u);                        /* for each unique value of x... */
      idx = loc(x=u[i]);                       /*    find the indices */
      CDF[idx] = cumPDF[1+u[i]];               /*    assign the CDF   */
   return CDF;
/* test the CDF function */
a = 6; b = 4; nTrials = 10;
c = CDFBetaBinom(t, nTrials, a, b);

Notice the computational trick that is used to compute the CDF for a vector of values. If you want to compute the CDF at two or more values (say X=5 and X=7), you can compute the PDF for X=0, 1, ... M where M is the maximum value that you need (such as 7). Along the way, you end up computing the CDF for all smaller values.

You can write the CDF values to a SAS data set and graph them to visualize the CDF of the beta-binomial distribution, as shown below:

Beta-binomial cumulative distribution

Compute quantiles from the CDF

Given a probability value p, the quantile for p is smallest value of x for which CDF(x) ≥ p. Sometimes you can find a formula that enables you to find the quantile directly, but, in general, the definition of a quantile requires that you compute the CDF for X=0, 1, 2,... and so forth until the CDF equals or exceeds p.

The support of the beta-binomial distribution is the set {0, 1, ..., nTrials}. The following function first computes the CDF function for all values. It then looks at each value of p and returns the corresponding value of X that satisfies the quantile definition.

start QuantileBetaBinom(_p, nTrials, a, b);
   p = colvec(_p);                         /* ensure input is column vector */
   y = 0:nTrials;                          /* all possible values of X   */
   CDF = CDFBetaBinom(y, NTrials, a, b);   /* all possible values of CDF */
   q = j(nrow(p), 1, .);                   /* allocate result vector */
   do i = 1 to nrow(p);
      idx = loc( p[i] <= CDF );            /* find all x such that p <= CDF(x) */ 
      q[i] = idx[1] - 1;                   /* smallest index. Subtract 1 from index to get x */
   return ( q );
p = {0.2, 0.5, 0.8};
q = QuantileBetaBinom(p, nTrials, a, b);
print p q;

The output shows the quantile values for p = {0.2, 0.5, 0.8}. You can find the quantiles visually by using the graph of the CDF function, shown in the previous section. For each value of p on the vertical axis, draw a horizontal line until the line hits a bar. The x value of the bar is the quantile for p.

Notice that the QuantileBetaBinom function uses the fact that the beta-binomial function has finite support. The function computes the CDF at the values 0, 1, ..., nTrials. From those values you can "look up" the quantile for any p. If you have distribution with infinite support (such as the geometric distribution) or very large finite support (such as nTrials=1E9), it is more efficient to apply an iterative process such as bisection to find the quantiles.


In summary, you can compute the CDF and quantile functions for a discrete distribution directly from the PDF. The process was illustrated by using the beta-binomial distribution. The CDF at X=x is the sum of the PDF evaluated for all values less than x. The quantile for p is the smallest value of x for which CDF(x) ≥ p.

The post Compute the CDF and quantiles of discrete distributions appeared first on The DO Loop.

11月 152017

Did you know that a SAS/IML function can recover from a run-time error? You can specify how to handle run-time errors by using a programming technique that is similar to the modern "try-catch" technique, although the SAS/IML technique is an older implementation.

Preventing errors versus handling errors

In general, SAS/IML programmers should try to detect potential problems and prevent errors from occurring. For example, before you compute the square root of a number, you should test whether the number is greater than or equal to zero. If not, you can handle the bad input. By testing the input value before you call the SQRT function, you prevent a run-time error.

However, sometimes you don't know whether a computation will fail until you attempt it. Other times, the test to determine whether an error will occur is very expensive, and it is cheaper to attempt the computation and handle the error if it occurs.

A canonical example is computing the inverse of a square matrix. It is difficult to test whether a given matrix is numerically singular. You can compute the rank of the matrix or the condition number of the matrix, but both computations are expensive because they rely on the SVD or eigenvalue decomposition. Cheaper methods include computing a determinant or using Gaussian elimination, but these methods can be numerically unstable and are not good indicators of whether a matrix is numerically singular.

Run-time error in SAS/IML modules

To understand how to handle run-time errors in SAS/IML modules, recall the following facts about how to use the execution queue in SAS/IML modules.

In summary, when a run-time error occurs in a module, the module pauses and waits for additional commands. However, if there are statements in the execution queue, those statements will be executed.

You can use these facts to handle errors. At the top of the module, use the PUSH statement to add error-handling statements into the execution queue. If an error occurs, the module pauses, sees that there are statements in the queue, and executes those statements. If one of the statements contains a RESUME statement, the module resumes execution.

Push, pause, execute, and resume

Let's apply these ideas to computing the inverse of a square matrix. The following SAS/IML function attempts to compute the inverse of the input matrix. If the matrix is singular, the computation fails and the module handles that error. The main steps of the modules are as follows:

  1. Define the statements that will be executed in the event of an error. The last executable statement should be the RESUME statement. (Technical point: The RESUME statement is only valid when the module is paused. Surround the statements with IF-THEN logic to prevent the error-handling statements from running when the module exits normally.)
  2. Push these statements into the execution queue.
  3. If an error occurs, the statements in the queue will be executed. The RESUME statement clears the error and resumes execution of the module. (If no error occurs, the queue is flushed after the RETURN statement.)
proc iml;
/* If the matrix is not invertible, return a missing value. 
   Otherwise, return the inverse.
start InvEx(A); 
   errFlag = 1;           /* set flag. Assume we will get an error :-) */   
   on_error = "if errFlag then do;  AInv = .;  resume; end;";
   call push(on_error);   /* PUSH code that will be executed if an error occurs */
   AInv = inv(A);         /* if error, AInv set to missing and function resumes */
   errFlag = 0;           /* remove flag for normal exit from module */
   return ( AInv );
A1 = {1 1,
      0 1};
B1 = InvEx(A1);
A2 = {1 1,
      1 1};
B2 = InvEx(A2);
print B1, B2;

The function is called first with a nonsingular matrix, A1. The INV function does not fail, so the module exits normally. When the module exits, it flushes the execution queue. Because ErrFlag=0, the pushed statements have no effect. B1 is equal to the inverse of A1.

The function is called next with a singular matrix, A2. The INV function encounters an error and the module pauses. When the module pauses, it executes the statements in the queue. Because ErrFlag=1, the statements set AINV to a missing value and resume the module execution. The module exits normally and the B2 value is set to missing.


In summary, you can implement statements that handle run-time errors in SAS/IML modules. The key is to push error-handling code into a queue. The error-handling statements should include a RESUME statement. If the module pauses due to an error, the statements in the queue are executed and the module resumes execution. Be aware that the queue is flushed when the module exits normally, so use a flag variable and IF-THEN logic to control how the statements behave when there is no error.

The post Catch run-time errors in SAS/IML programs appeared first on The DO Loop.

11月 132017

Debugging is the bane of every programmer. SAS supports a DATA step debugger, but that debugger can't be used for debugging SAS/IML programs. In lieu of a formal debugger, many SAS/IML programmers resort to inserting multiple PRINT statements into a function definition. However, there is an easier way to query the values of variables inside a SAS/IML function: Use the PAUSE statement to halt the execution of program inside the misbehaving function. You can then interactively submit PRINT statements or any other valid statements. When you are finished debugging, you can submit the RESUME statement and the function will resume execution. (Or you can submit the STOP statement to halt execution and return to the main scope of the program.)

The PAUSE statement

The SAS/IML language is interactive. The PAUSE statement pauses the execution of a SAS/IML module. While the program is paused you can print or assign local variables inside the module. When you are ready for the module to resume execution, you submit the RESUME statement. (The SAS/IML Studio application works slightly differently. The PAUSE statement brings up a dialog box that you can use to submit statements. You press the Resume button to resume execution.)

For example, suppose you write a function called 'Func' whose purpose is to compute the sum of squares of the elements in a numeric vector. While testing the function, you discover that the answer is not correct when you pass in a row vector. You decide to insert a PAUSE statement ("set a breakpoint") near the end of the module, just before the RETURN statement, as follows:

proc iml;
/* in SAS/IML, the CALL EXECUTE statement runs immediately */
start Func(x);
   y = x`*x;
  /* Debugging tip: Use PAUSE to enter INTERACTIVE mode INSIDE module! */
  pause "Inside 'Func'; submit RESUME to resume computations.";
  /* Execution pauses until you submit RESUME, then continues from the next statement. */
   return (y);
w = Func( {1 2 3} );

When you run the program, it prints Inside 'Func'; submit RESUME to resume computations. The program then waits for you to enter commands. The program will remain paused until you submit the RESUME statement (include a semicolon!).

Because the program has paused inside the function, you can query or set the values of local variables. For this function, there are only two local variables, x and y. Highlight the following statement, and press F3 to submit it.

print y;     /* inside module; print local variable */

The output shows that the value of y is a matrix, not a scalar, which indicates that the expression x`*x does not compute the sum of squares for this input vector. You might recall that SAS/IML has a built-in SSQ function, which computes the sum of squares. Highlight the following statement and press F3 to submit it:

y = ssq(x);  /* assign local variable inside module */

This assignment has overwritten the value of the local variable, y. When you submit a RESUME statement, the function will resume execution and return the value of y. The program is now at the main scope, which you can demonstrate by printing the value of w:

resume;      /* resume execution. Return to main scope */
print w;     /* print result at main scope */

To make the change permanent, you must edit the function definition and redefine the module. Be sure to remove the PAUSE statement when you are finished debugging: If you run a program in batch mode that contains a PAUSE statement, the program will forever wait for input!

In conclusion, the PAUSE statement enables you to pause the execution of a program inside a module and interactively query and change the local variables in the module. This can help you to debug a function. After you finish investigating the state of the local variables, you can submit the RESUME statement, which tells the function to continue executing from the line after the PAUSE statement. (Or submit STOP to exit the function.) The PAUSE statement can be a useful alternative to inserting many PRINT statements inside a function during the debugging phase of program development.

The post A tip for debugging SAS/IML modules: The PAUSE statement appeared first on The DO Loop.

10月 232017

A common question on discussion forums is how to compute a principal component regression in SAS. One reason people give for wanting to run a principal component regression is that the explanatory variables in the model are highly correlated which each other, a condition known as multicollinearity. Although principal component regression (PCR) is a popular technique for dealing with almost collinear data, PCR is not a cure-all. This article shows how to compute a principal component regression in SAS; a subsequent article discusses the problems with PCR and presents alternative techniques.

Multicollinearity in regression

Near collinearity among the explanatory variables in a regression model requires special handling because:

  • The crossproduct matrix X`X is ill-conditioned (nearly singular), where X is the data matrix.
  • The standard errors of the parameter estimates are very large. The variance inflation factor (VIF), which is computed by PROC REG, is one way to measure how collinearities inflate the variances of the parameter estimates.
  • The model parameters are highly correlated, which makes interpretation of the parameters difficult.

Principal component regression keeps only the most important principal components and discards the others. This means that you compute the principal components for the explanatory variables and drop the components that correspond to the smallest eigenvalues of X`X. If you keep k principal components, then those components enable you to form a rank-k approximation to the crossproduct matrix. If you regress the response variable onto those k components, you obtain a PCR. Usually the parameter estimates are expressed in terms of the original variables, rather than in terms of the principal components.

In SAS there are two easy ways to compute principal component regression:

  • The PLS procedure supports the METHOD=PCR to perform principal component regression. You can use the NFAC= option to determine the number of principal components to keep.
  • The MODEL statement in PROC REG supports the PCOMIT= option. (This option is read as "PC omit.") The argument to the PCOMIT= option is the number of principal components to drop (omit) from the regression.

Notice that neither of these methods calls PROC PRINCOMP. You could call PROC PRINCOMP, but it would be more complicated than the previous methods. You would have to extract the first principal components (PCs), then use PROC REG to compute the regression coefficients for the PCs, then use matrix computations to convert the parameter estimates from the PCs to the original variables.

Principal component regression is also sometimes used for general dimension reduction. Instead of projecting the response variable onto a p-dimensional space of raw variables, PCR projects the response onto a k-dimensional space where k is less than p. For dimension reduction, you might want to consider another approach such as variable selection by using PROC GLMSELECT or PROC HPGENSELECT. The reason is that the PCR model retains all of the original variables whereas variable selection procedures result in models that have fewer variables.

Use PROC PLS for principal component regression

I recommend using the PLS procedure to compute a principal component regression in SAS. As mentioned previously, you need to use the METHOD=PCR and NFAC= options. The following data for 31 men at a fitness center is from the documentation for PROC REG. The goal of the study is to predict oxygen consumption from age, weight, and various physiological measurements before and during exercise. The following call to PROC PLS computes a PCR that keeps four principal components:

data fitness;
   input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
44 89.47 44.609 11.37 62 178 182   40 75.07 45.313 10.07 62 185 185
44 85.84 54.297  8.65 45 156 168   42 68.15 59.571  8.17 40 166 172
38 89.02 49.874  9.22 55 178 180   47 77.45 44.811 11.63 58 176 176
40 75.98 45.681 11.95 70 176 180   43 81.19 49.091 10.85 64 162 170
44 81.42 39.442 13.08 63 174 176   38 81.87 60.055  8.63 48 170 186
44 73.03 50.541 10.13 45 168 168   45 87.66 37.388 14.03 56 186 192
45 66.45 44.754 11.12 51 176 176   47 79.15 47.273 10.60 47 162 164
54 83.12 51.855 10.33 50 166 170   49 81.42 49.156  8.95 44 180 185
51 69.63 40.836 10.95 57 168 172   51 77.91 46.672 10.00 48 162 168
48 91.63 46.774 10.25 48 162 164   49 73.37 50.388 10.08 67 168 168
57 73.37 39.407 12.63 58 174 176   54 79.38 46.080 11.17 62 156 165
52 76.32 45.441  9.63 48 164 166   50 70.87 54.625  8.92 48 146 155
51 67.25 45.118 11.08 48 172 172   54 91.63 39.203 12.88 44 168 172
51 73.71 45.790 10.47 59 186 188   57 59.08 50.545  9.93 49 148 155
49 76.32 48.673  9.40 56 186 188   48 61.24 47.920 11.50 52 170 176
52 82.78 47.467 10.50 53 170 172
proc pls data=fitness method=PCR nfac=4;          /* PCR onto 4 factors */
   model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / solution;

The output includes the parameter estimates table, which gives the estimates for the four-component regression in terms of the original variables. Another table (not shown) shows that the first four principal components explain 93% of the variation in the explanatory variables and 78% of the variation in the response variable.

For another example of using PROC PLS to combat collinearity, see Yu (2011), "Principal Component Regression as a Countermeasure against Collinearity."

Use PROC REG for principal component regression

I recommend PROC PLS for principal component regression, but you can also compute a PCR by using the PCOMIT= option on the MODEL statement in PROC REG. However, the parameter estimates are not displayed in any table but must be written to OUTEST= data set, as follows:

proc reg data=fitness plots=none outest=PE; /* write PCR estimates to PE data set */
   model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
         / PCOmit=2;       /* omit 2 PCs ==> keep 6-2=4 PCs */
proc print data=PE(where=(_Type_="IPC")) noobs;
   var Intercept--MaxPulse;

Notice that the PCOMIT=2 option specifies that two PCs should be dropped, which is equivalent to keeping four components in this six-variable model. The parameter estimates are written to the PE data set and are displayed by PROC PRINT. The estimates the same as those found by PROC PLS. In the PE data, the PCR estimates are indicated by the value "IPC" for the _TYPE_ variable, which stands for incomplete principal component regression. The word "incomplete" indicates that not all the principal components are used.

It is worth noting that even though the principal components themselves are based on centered and scaled data, the parameter estimates are reported for the original (raw) variables. It is also worth noting that you can use the OUTSEB option on the PROC REG statement to obtain standard errors for the parameter estimates.

Should you use principal component regression?

This article shows you how to perform principal component regression in SAS by using PROC PLS with METHOD=PCR. However, I must point out that there are statistical drawbacks to using principal component regression. The primary issue is that principal component regression does not use any information about the response variable when choosing the principal components. Before you decide to use PCR, I urge you to read my next post about the drawbacks with the technique. You can then make an informed decision about whether you want to use principal component regression for your data.

The post Principal component regression in SAS appeared first on The DO Loop.