optimization

3月 072018
 

Data analysts often fit a probability distribution to data. When you have access to the data, a common technique is to use maximum likelihood estimation (MLE) to compute the parameters of a distribution that are "most likely" to have produced the observed data. However, how can you fit a distribution if you do not have access to the data?

This question was asked by a SAS programmer who wanted to fit a gamma distribution by using sample quantiles of the data. In particular, the p[rogrammer said, "we have the 50th and 90th percentile" of the data and "want to find the parameters for the gamma distribution [that fit] our data."

This is an interesting question. Recall that the method of moments uses sample moments (mean, variance, skewness,...) to estimate parameters in a distribution. When you use the method of moments, you express the moments of the distribution in terms of the parameters, set the distribution's moments equal to the sample moments, and solve for the parameter values for which the equation is true.

In a similar way, you can fit a distribution matching quantiles: Equate the sample and distributional quantiles and solve for the parameters of the distribution. This is sometimes called quantile-matching estimation (QME). Because the quantiles involve the cumulative distribution function (CDF), the equation does not usually have a closed-form solution and must be solved numerically.

Fit a two-parameter distribution from two quantiles

CDF that matches quantiles of a sample

To answer the programmer's question, suppose you do not have the original data, but you are told that the 50th percentile (median) of the data is x = 4 and the 90th percentile is x = 8. You suspect that the data are distributed according to a gamma distribution, which has a shape parameter (α) and a scale parameter (β). To use quantile-matching estimation, set F(4; α, β) = 0.5 and F(8; α, β) = 0.9, where F is the cumulative distribution of the Gamma(α, β) distribution. You can then solve for the values of (α, β) that satisfy the equations. You will get a CDF that matches the quantiles of the data, as shown to the right.

I have previously written about four ways to solve nonlinear equations in SAS. One way is to use PROC MODEL, as shown below:

data initial;
   alpha=1; beta=1;    /* initial guess for finding root */
   p1=0.5;  X1 = 4;    /* eqn for 1st quantile: F(X1; alpha, beta) = p1 */
   p2=0.9; X2 =  8;    /* eqn for 2nd quantile: F(X2; alpha, beta) = p2 */
run;
 
proc model data=initial;
  eq.one = cdf("Gamma", X1, alpha, beta) - p1;   /* find root of eq1 */
  eq.two = cdf("Gamma", X2, alpha, beta) - p2;   /*    and eq2 */
  solve alpha beta / solveprint out=solved outpredict;
run;quit;
 
proc print data=solved noobs;
   var alpha beta;
run;
Quantile-matching estimates: parameters estimated from observed quantiles

The output indicates that the parameters (α, β) = (2.96, 1.52) are the values for which the Gamma(α, β) quantiles match the sample quantiles. You can see this by graphing the CDF function and adding reference lines at the 50th and 90th percentiles, as shown at the beginning of this section. The following SAS code creates the graph:

/* Graph the CDF function to verify that the solution makes sense */
data Check;
set solved;            /* estimates of (alpha, beta) from solving eqns */
do x = 0 to 12 by 0.2;
   CDF = cdf("gamma", x, alpha, beta);
   output;
end;
run;
 
title "CDF of Gamma Distribution";
title2 "Showing 50th and 90th Percentiles";
proc sgplot data=Check;
   series x=x y=CDF / curvelabel;
   dropline y=0.5 X=4 / dropto=both;   /* first percentile */
   dropline y=0.9 X=8 / dropto=both;   /* second percentile */
   yaxis values=(0 to 1 by 0.1) label="Cumulative Probability";
   xaxis values=(0 to 12 by 2);
run;

Least squares estimates for matching quantiles

The previous section is relevant when you have as many sample quantiles as parameters. If you have more sample quantiles than parameters, then the system is overconstrained and you probably want to compute a least squares solution. If there are m sample quantiles, the least squares solution is the set of parameters that minimizes the sum of squares Σim (piF(xi; α, β))2.

For example, the following DATA step contains five sample quantiles. The observation (p,q) = (0.1, 1.48) indicates that the 10th percentile is x=1.48. The second observation indicates that the 25th percentile is x=2.50. The last observation indicates that the 90th percentile is x=7.99. You can use PROC NLIN to find a least squares solution to the quantile-matching problem, as follows:

data SampleQntls;
input p q;  /* p is cumul probability; q is p_th sample quantile */
datalines;
0.1  1.48 
0.25 2.50
0.5  4.25
0.75 6.00
0.9  7.99 
;
 
/* least squares fit of parameters */
proc nlin data=SampleQntls /* sometimes the NOHALVE option is useful */
          outest=PE(where=(_TYPE_="FINAL"));
   parms alpha 2 beta 2;
   bounds 0 < alpha beta;
   model p = cdf("Gamma", q, alpha, beta);
run;
 
proc print data=PE noobs;
   var alpha beta;
run;
Least squares estimates based on matching sample quantiles

The solution indicates the parameter values (α, β) = (2.72, 1.70) minimize the sum of squares between the observed and theoretical quantiles. The following graph shows the observed quantiles overlaid on the CDF of the fitted Gamma(α, β) distribution. Alternatively, you can graph the quantile-quantile plot of the observed and fitted quantiles.

CDF for fitted distribution where parameters are based on matching sample quantiles

Weighted least squares estimates for matching quantiles

For small samples, quantiles in the tail of a distribution have a large standard error, which means that the observed quantile might not be close to the theoretical quantile. One way to handle that uncertainty is to compute a weighted regression analysis where each sample quantile is weighted by the inverse of its variance. According to Stuart and Ord (Kendall’s Advanced Theory of Statistics, 1994, section 10.10), the standard error of the p_th sample quantile in a sample of size n is σ2 = p(1-p) / (n fp)2), where ξp is the p_th quantile of the distribution and f is the probability density function.

In PROC NLIN, you can perform weighted analysis by using the automatic variable _WEIGHT_. The following statements define the variance of the p_th sample quantile and define weights equal to the inverse variance. Notice the NOHALVE option, which can be useful for iteratively reweighted least squares problems. The option eliminates the requirement that the weighted sum of squares must decrease at every iteration.

/* weighted least squares fit where w[i] = 1/variance[i] */
proc nlin data=SampleQntls NOHALVE
          outest=WPE(where=(_TYPE_="FINAL"));
   parms alpha 2 beta 2;
   bounds 0 < alpha beta;
   N = 80;                            /* sample size */
   xi = quantile("gamma", p, alpha, beta); /* quantile of distrib */
   f = pdf("Gamma", xi, alpha, beta); /* density at quantile */
   variance = p*(1-p) / (N * f**2);   /* variance of sample quantiles */
   _weight_ = 1 / variance;           /* weight for each observation */
   model p = cdf("Gamma", q, alpha, beta);
run;
Parameter estimates where parameters are based on a weighted matching of observed quantiles

The parameter estimates for the weighted analysis are slightly different than for the unweighted analysis. The following graph shows the CDF for the weighted estimates, which does not pass as close to the 75th and 90th percentiles as does the CDF for the unweighted estimates. This is because the PDF of the gamma distribution is relatively small for those quantiles, which causes the regression to underweight those sample quantiles.

CDF for fitted distribution where parameters are based on a weighted matching of observed quantiles

In summary, this article shows how to use SAS to fit distribution parameters to observed quantiles by using quantile-matching estimation (QME). If the number of quantiles is the same as the number of parameters, you can numerically solve for the parameters for which the quantiles of the distribution equal the sample quantiles. If you have more quantiles than parameters, you can compute a least squares estimate of the parameters. Because quantile estimates in the tail of a distribution have larger uncertainty, you might want to underweight those quantiles. One way to do that is to run a weighted least squares regression where the weights are inversely proportional to the variance of the sample quantiles.

The post Fit a distribution from quantiles appeared first on The DO Loop.

11月 272017
 

When you run an optimization, it is often not clear how to provide the optimization algorithm with an initial guess for the parameters. A good guess converges quickly to the optimal solution whereas a bad guess might diverge or require many iterations to converge. Many people use a default value (such as 1) for all initial guesses. I have previously written about other ways to obtain an initial guess for an optimization.

In the special case of maximum likelihood estimation, you use a trick to provide a good guess to the optimization algorithm. For maximum likelihood estimation, the objective function is the log-likelihood function of a distribution. Consequently, you can use the method of moments to provide the initial guess for the parameters, which often results in fast convergence to the maximum likelihood estimates. The faster convergence can save you considerable time when you are analyzing thousands of data sets, such as during a simulation study.

The method of moments

Suppose you believe that some data can be modeled by a certain distribution such as a lognormal, Weibull, or beta distribution. A common way to estimate the distribution parameters is to compute the maximum likelihood estimates (MLEs). These are the values of the parameters that are "most likely" to have generated the observed data. To compute the MLEs you first construct the log-likelihood function and then use numerical optimization to find the parameter values that maximize the log-likelihood function.

The method of moments is an alternative way to fit a model to data. For a k-parameter distribution, you write the equations that give the first k central moments (mean, variance, skewness, ...) of the distribution in terms of the parameters. You then replace the distribution's moments with the sample mean, variance, and so forth. You invert the equations to solve for the parameters in terms of the observed moments.

For example, suppose you believe that some data are distributed according to a two-parameter beta distribution. If you look up the beta distribution on Wikipedia, you will see that the mean and variance of a Beta(a, b) distribution can be expressed in terms of the two parameters as follows:

Mean and variance of the beta distribution

The first equation represents the expected (mean) value of a beta-distributed random variable X. The second equation represents the variance. If the sample mean of the data is m and the sample variance is v, you can "plug in" these values for the left-hand side of the equations and solve for the values of a and b that satisfy the equations. Solving the first equation for a yields a = b m / (1 - m). If you substitute that expression into the second equation and solve for b, you get b = m - 1 + (m/v)(1 - m)2. Those equations give the parameter estimates from the method of moments.

Maximum likelihood estimation: Using an arbitrary guess

The method of moments can lead to improved convergence when fitting a distribution. For comparison, first consider what happens if you use an arbitrary initial guess for the optimization, such as a = b = 1. The following statements use PROC NLMIXED to compute maximum likelihood estimates for the parameters of the beta distribution. (You can read about how to use PROC NLMIXED to construct MLEs. You can also compute MLEs by using SAS/IML.)

data Beta;
input x @@;
datalines;
0.58 0.23 0.41 0.69 0.64 0.12 0.07 0.19 
0.47 0.05 0.76 0.18 0.28 0.06 0.34 0.52 
0.42 0.48 0.11 0.39 0.24 0.25 0.22 0.05 
0.23 0.49 0.45 0.29 0.18 0.33 0.09 0.16 
0.45 0.27 0.31 0.41 0.30 0.19 0.27 0.57 
;
 
/* Compute parameter estimates for the beta distribution */
proc nlmixed data=Beta;
   parms a=1 b=1;                /* use arbitrary initial guesses */
   bounds a b > 0;
   LL = logpdf("beta", x, a, b); /* log-likelihood function */
   model x ~ general(LL);
run;
Solution to fitting a beta distribution in SAS

The MLEs are a = 1.8487 and b = 3.9552. The "Iteration History" table (not shown) shows that 8 iterations of a quasi-Newton algorithm were required to converge to the MLEs, and the objective function (the LOGPDF function) was called 27 times during the optimization.

The following graph shows a histogram of the data overlaid with a Beta(a=1.85, b=3.96) density curve. The curve fits the data well.

Histogram of data overlaid with a beta density curve, fitted by maximum likelihood estimation

Maximum likelihood estimation: Using method-of-moments for guesses

If you use the method-of-moments to provide an initial guess on the PARMS statement, the quasi-Newton algorithm might converge in fewer iterations. The following statements use PROC MEANS to compute the sample mean and variance, the use the DATA step to compute the method-of-moments estimates, as follows:

/* output mean and variance */
proc means data=Beta noprint;
   var x;
   output out=Moments mean=m var=v;  /* save sample moments */
run;
 
/* use method of moments to estimate parameters for the beta distribution */
data ParamEst;
   retain a b;               /* set order of output variables */
   set Moments;
   b = m*(1-m)**2/v + m - 1; /* estimate parameters by using sample moments */
   a = b*m / (1-m);
   keep a b;
run;

The method of moments estimates are a = 1.75 and b = 3.75, which are close to the MLE values. The DATA= option in the PARMS statement in PROC NLMIXED enables you to specify a data set that contains starting values for the parameters. The following call to PROC NLMIXED is exactly the same as the previous call except that the procedure uses the initial parameter values from the ParamEst data set:

proc nlmixed data=Beta;
   parms / data=ParamEst;        /* use initial guesses from method of moments */
   bounds a b > 0;
   LL = logpdf("beta", x, a, b); /* log-likelihood function */
   model x ~ general(LL);
run;
Iteration history for MLE for beta distribution. Initial guesses are maximum likelihood estimates.

The MLEs from this call (not shown) are exactly the same as before. However, the "Iteration History" table indicates that quasi-Newton algorithm converged in 5 iterations and required only 14 calls the objective function. This compares with 8 iterations and 27 calls for the previous computation. Thus the method of moments resulted in about half as many functional evaluations. In my experience this example is typical: Choosing initial parameters by using the method of moments often reduces the computational time that is required to obtain the MLEs.

It is worth mentioning that the PARMS option in PROC NLMIXED supports two nice features:

  • The DATA= option accepts both wide and long data sets. The example here is wide: there are k variables that supply the values of the k parameters. You can also specify a two-variable data set with the names "Parameter" and "Estimate."
  • The BYDATA option enables you to specify initial guesses for each BY group. In simulation studies, you might use BY-group processing to compute the MLEs for thousands of samples. The BYDATA option is essential to improving the efficiency of a large simulation study. It can also help ensure that the optimization algorithm converges when the parameters for the samples range over a large set of values.

Summary and cautions

In summary, this article shows how to use the method of moments to provide a guess for optimizing a log-likelihood function, which often speeds up the time required to obtain the MLEs. The method requires that you can write the equations that relate the central moments of a distribution (mean, variance, ...) to the parameters for the distribution. (For a distribution that has k parameters, use the equations for the first k moments.) You then "plug in" the sample moments and algebraically invert the equations to obtain the parameter values in terms of the sample moments. This is not always possible analytically; you might need to use numerical methods for part of the computation.

A word of caution. For small samples, the method of moments might produce estimates that do not make sense. For example, the method might produce a negative estimate for a parameter that must be positive. In these cases, the method might be useless, although you could attempt to project the invalid parameters into the feasible region.

The post The method of moments: A smart way to choose initial parameters for MLE appeared first on The DO Loop.

9月 122017
 

Every fall, highways, backroads and neighborhood streets nationwide take on a noticeable yellow hue, as school buses carefully and methodically transport students back to school. In some areas, including Boston, this massive transportation exercise can present a number of challenges. Boston Public Schools (BPS) provided transportation for 25,000 students via [...]

How did one school system save money, improve local traffic and make students happier? With fewer bus stops and better bus schedules was published on SAS Voices by Shannon Heath

9月 112017
 

Did you know that you can get SAS to compute symbolic (analytical) derivatives of simple functions, including applying the product rule, quotient rule, and chain rule? SAS can form the symbolic derivatives of single-variable functions and partial derivatives of multivariable functions. Furthermore, the derivatives are output in a form that can be pasted into a SAS program. The trick is to use PROC NLIN with the LIST option.

In SAS/IML, the nonlinear optimization routines will use analytical derivatives if they are provided; otherwise, they will automatically generate numerical finite-difference approximations to the derivatives. I rarely specify derivatives because it can be time-consuming and prone to error. But recently I realized that PROC NLIN and other SAS procedures automatically generate symbolic derivatives, and you can "trick" the procedure into displaying the derivatives so that they can be used in other applications. To get PROC NLIN to output symbolic derivatives, do the following:

  • Create a data set to use for the DATA= option of PROC NLIN.
  • List the variables that you want to take derivatives of on the PARMS statement. In open code, assign a value to constants that appear in the function.
  • Specify the function to differentiate on the MODEL statement.

Symbolic derivatives of functions of one variable

Here is a one-variable example. Suppose you want to take the derivative of
f(x) = x3 + x sin(x) + exp( x2 ).
The function and its derivative is shown to the right.

The following SAS statements create a "fake" data set that defines a variable named F. The call to PROC NLIN sets up the problem. The LIST option on the PROC NLIN statement generates the 'ProgList' table, which contains the derivative function:

data _dummy;
   f = 1;                                    /* name of function */
run;
 
proc nlin data=_dummy list;
   parms x=0;                                /* list variables */
   model f = x**3 + x*sin(x) + exp(x**2);    /* define function */
   ods select ProgList;                      /* output symbolic derivatives */
run;
Symbolic derivatibves in SAS for a univariate function

The output shows the expressions for the function ("MODEL.f") and the derivative with respect to x. The derivative df/dx is written as "@MODEL.f/@x", where the symbol '@' is used for the 'd'. You can see that SAS took the simple derivative of the x3 term, applied the product rule to the x*sin(x) term, and applied the chain rule to the exp( x2 ) term. You can also see that the expressions are written in SAS notation, with "**" indicating the power operator. SAS functions are written in upper case (SIN, COS, and EXP).

Symbolic partial derivatives of multivariate functions

In exactly the same way, you can obtain partial derivatives. In a partial derivative, all variables are held constant except one. Suppose you have the function for the normal density,
f(x) = 1/(sqrt(2 π) σ) exp( -(x - μ)**2 / (2 σ**2) )
Suppose that you want to compute the partial derivatives ∂f/∂μ and ∂f/∂σ. To compute these derivatives, list μ and σ on the PARMS statement and assign values to the other symbols (x and π) that appear in the function. It does not matter what value you assign to the x and π, you are just preventing PROC NLIN from complaining that there are unassigned variables. You could assign the value 1 to both variables, but I prefer to tell the procedure that π equals 3.14. In the following program, note that PROC NLIN reuses the fake data set that defines the F variable:

proc nlin data=_dummy list;
   parms mu=0 sigma=1;                        /* list variables */
   x=1; pi = constant('pi');                  /* assign arbitrary values to constants */
   model f = exp( -(x-mu)**2 / (2*sigma**2) ) / (sqrt(2*pi)*sigma); /* define function */
   ods select ProgList;                      /* output symbolic derivatives */
run;
Symbolic partial derivatibves in SAS for a multivariate function

The derivative with respect to σ requires using the chain rule and the quotient rule. Notice that the derivative includes a call to the original function ("MODEL.f"). Notice also that there is repetition of various expressions such as -(x - μ)**2 / (2 σ**2). If you copy and paste these expressions into a SAS program that computes the derivative, the computation will be slightly inefficient because the program will evaluate the same expression multiple times. One way to improve the efficiency is to collect sub-expressions into a temporary variable, as shown in the next section.

Sub-expressions and the chain rule

An experienced statistician will recognize that the expression z = (x - μ) / σ arises naturally as the standardized variable. Using such an expression might simplify the derivatives.

The following program computes the same derivatives as before, except this time the programmer defines the expression z = (x - μ) / σ and uses z in the function. When SAS takes the derivative, it computes dz/dμ and dz/dσ as part of the computation, as shown below:

proc nlin data=_dummy list;
   parms mu=0 sigma=1;                        /* list variables */
   x=1; pi = constant('pi');                  /* assign arbitrary values to constants */
   const = sqrt(2*pi);
   z = (x-mu) / sigma;                        /* intermediate expression */
   model f = exp(-z**2 / 2) / (const*sigma);  /* define function */
   ods select ProgList;                       /* output symbolic derivatives */
run;
Symbolic derivatibves in SAS that use sub-expressions and apply the chain rule

Notice that the derivatives use several sub-expressions such as dz/dμ and dz/dσ. This computation is more efficient than the previous because it reuses previously computed quantities. For another example, define f1 = 1 / sqrt(2*pi*sigma) and f2 = exp(-z**2 / 2) and define the MODEL as f = f1*f2. You will see that each term in the derivative is efficiently evaluated in terms of previously expressions.

Discussion of symbolic derivatives

As I said previously, I do not usually specify analytical derivatives for my optimizations. I find that numerical derivatives (which are automatically computed) are fast and accurate for 99% of the optimizations that I perform.

Furthermore, some of the objective functions that I optimize do not have analytical derivatives. Others are matrix computations for which it would be difficult and time-consuming to write down the analytical derivative. (For an example, see the article on the derivative of the bivariate normal cumulative distribution.)

If you decide to specify an analytical derivative in SAS, make sure that the derivative is correct and efficient. The technique in this article will help to ensure that the derivative is correct. By strategically defining sub-expressions as in the previous section, you can help SAS generate derivatives that are computationally efficient.

Be aware that the derivatives are generated by applying the chain rule, product rule, and quotient rule without any algebraic simplifications of the result. Thus the result can be unnecessarily complicated. For an example, ask SAS to display the derivative of log( (1+x)/(1-x) ) / 2. The naive result is complicated. If you rewrite the function as log(1+x)/2 - log(1-x)/2, then the derivative is much simpler. However, in neither case will SAS simplify the derivative to obtain 1 / (1-x**2).

For more of my thoughts on using derivatives in SAS, see "Two hints for specifying derivatives." What about your thoughts? What do you think about using PROC NLIN to generate automatic derivatives of simple functions? Do you think this technique will be useful to you?

The post Symbolic derivatives in SAS appeared first on The DO Loop.

7月 072017
 

For colleges and universities, awarding financial aid today requires sophisticated analysis. When higher education leaders ask, “How can we use financial aid to help meet our institutional goals?” they need to consider many scenarios to balance strategic enrollment goals, student need, and institutional finances in order to optimize yield and [...]

Meet student enrollment goals by optimizing your financial aid strategy was published on SAS Voices by Georgia Mariani

6月 192017
 

Most numerical optimization routines require that the user provide an initial guess for the solution. I have previously described a method for choosing an initial guess for an optimization, which works well for low-dimensional optimization problems. Recently a SAS programmer asked how to find an initial guess when there are linear constraints and bounds on the parameters. There are two simple approaches for finding an initial guess that is in the feasible region. One is the "shotgun" approach in which you generate many initial guesses and evaluate each one until you find one that is feasible. The other is to use the NLPFEA subroutine in SAS/IML, which takes any guess and transforms it into a feasible point in the linearly constrained region.

The NLPFEA subroutine in SAS/IML software

The NLPFEA routine returns a point in the feasible region from an arbitrary starting guess. Suppose the problem has p (possibly bounded) parameters, and the feasible region is formed by k > 0 additional linear constraints. Then you can represent the feasible regions by a (k+2) x (p+2) matrix, which is the representation that is used for linear programming and constrained nonlinear optimizations. The first row specifies the lower bounds for the parameters and the second row specifies the upper bounds. (Missing values in the first p columns indicate that a parameter is not bounded.) The remaining k rows indicate linear constraints. For example, the following matrix defines a pentagonal region for two parameters. You can call the NLPFEA subroutine to generate a point that satisfies all constraints:

proc iml;
con = {  0   0   .   .,    /* param min */
        10  10   .   .,    /* param max */
         3  -2  -1  10,    /* 3*x1 + -2*x2 LE 10 */
         5  10  -1  56,    /* 5*x1 + 10*x2 LE 56 */
         4   2   1   7 };  /* 4*x1 +  2*x2 GE  7 */
 
guess = {0 0};                /* arbitrary p-dimensional point */
call nlpfea(z, guess, con);  /* x0 is feasible point */
print guess[c={"x" "y"}], z[c={"Tx" "Ty"}];

The output shows that the guess (x,y) = (0,0) was not feasible, but the NLPFEA routine generated the transformed point T(x,y) = (1.2, 1.1), which is feasible.

It is interesting to visualize the NLPFEA subroutine. The following SAS/IML statements create 36 initial guesses that are distributed uniformly on a circle around the feasible region. For each guess, the program transforms the guess into the feasible region by calling the NLPFEA subroutine. The initial and transformed points are saved to a SAS data set and visualized by using PROC SGPLOT:

NumPts = 36;
twopi = 2*constant('pi');
x=.; y=.; Tx=.; Ty=.;
create feasible var {"x" "y" "Tx" "Ty"};
do i = 1 to NumPts;
   x = 2.5 + 5*cos((i-1)/NumPts * twopi);  /* guess on circle */
   y = 2.5 + 5*sin((i-1)/NumPts * twopi);
   call nlpfea(feasPt, x||y, con);         /* transform into feasible */
   Tx = feasPt[1]; Ty = feasPt[2];         
   append;
end;
close;
Transformation of points into a feasible region

The graph visualizes the result. The graph shows how each point on the circle is transformed into the feasible region. Some points are transformed into the interior, but many are transformed onto the boundary. You can see that the transformed point always satisfies the linear constraints.

SAS/IML automatically finds feasible points

Before I finish, I want to point out that the nonlinear programming (NLP) subroutines in SAS/IML software rarely require you to call the NLPFEA subroutine explicitly. When you call an NLP routine for a linearly constrained optimization and provide a nonfeasible initial guess, the NLP routine internally calls the NLPFEA routine. Consequently, you might see the following NOTE displayed in the SAS log: NOTE: Initial point was changed to be feasible for boundary and linear constraints. For example, run the following program, which provides a nonfeasible initial guess to the NLPNRA (Newton-Raphson) subroutine.

start func(x);
   x1 = x[,1];   x2 = x[,2];
   return ( -(x1-3)##4 + -(x2-2)##2 + 0.1*sin(x1#x2));
finish;
 
opt = {1,    /* find maximum of function     */
       3};   /* print a little bit of output */
x0 = {0 0};
call nlpnra(rc, x_Opt, "func", x0, opt) blc=con;

In this program, the NLPNRA subroutine detects that the guess not feasible. It internally calls NLPFEA to obtain a feasible guess and then computes an optimal solution. This is very convenient for programmers. The only drawback is that you don't know the initial guess that produced the optimal solution. However, you can call the NLPFEA subroutine directly if you want to obtain that information.

Summary

In optimization problems that have linear and boundary constraints, most optimization routines require an initial guess that is in the feasible region. The NLPFEA subroutine enables you to obtain a feasible point from an arbitrary initial guess. You can then use that feasible point as an initial guess in a built-in or user-defined optimization routine. However, for the built-in NLP subroutines, you can actually skip the NLPFEA call because the NLP subroutines internally call NLPFEA when you supply a nonfeasible initial guess.

The post How to find a feasible point for a constrained optimization in SAS appeared first on The DO Loop.

6月 122017
 

Maximum likelihood estimation (MLE) is a powerful statistical technique that uses optimization techniques to fit parametric models. The technique finds the parameters that are "most likely" to have produced the observed data. SAS provides many tools for nonlinear optimization, so often the hardest part of maximum likelihood is writing down the log-likelihood function. This article shows two simple ways to construct the log-likelihood function in SAS. For simplicity, this article describes fitting the binomial and lognormal distributions to univariate data.

Always use the log-likelihood function!

Although the method is known as maximum likelihood estimation, in practice you should optimize the log-likelihood function, which is numerically superior to work with. For an introduction to MLE, including the definitions of the likelihood and log-likelihood functions, see the Penn State Online Statistics Course, which is a wonderful reference.

MLE assumes that the observed data x={x1, x2, ..., xn} are independently drawn from some population. You want to find the most likely parameters θ = (θ1,...,θk) such that the data are fit by the probability density function (PDF), f(x; θ). Since the data are independent, the probability of observing the data is the product Πi f(xi, θ), which is the likelihood function L(θ | x). If you take the logarithm, the product becomes a sum. The log-likelihood function is
LL(θ | x) = Σi log( f(xi, θ) )

This formula is the key. It says that the log-likelihood function is simply the sum of the log-PDF function evaluated at the data values. Always use this formula. Do not ever compute the likelihood function (the product) and then take the log, because the product is prone to numerical errors, including overflow and underflow.

Two ways to construct the log-likelihood function

There are two simple ways to construct the log-likelihood function in SAS:

Example: The log-likelihood function for the binomial distribution

A coin was tossed 10 times and the number of heads was recorded. This was repeated 20 times to get a sample. A student wants to fit the binomial model X ~ Binom(p, 10) to estimate the probability p of the coin landing on heads. For this problem, the vector of MLE parameters θ is merely the one parameter p.

Recall that if you are using SAS/IML to optimize an objective function, the parameter that you are trying to optimize should be the only argument to the function, and all other parameters should be specified on the GLOBAL statement. Thus one way to write a SAS/IML function for the binomial log-likelihood function is as follows:

proc iml;
/* Method 1: Use LOGPDF. This method works in DATA step as well */
start Binom_LL1(p) global(x, NTrials);
   LL = sum( logpdf("Binomial", x, p, NTrials) );
   return( LL );
finish;
 
/* visualize log-likelihood function, which is a function of p */
NTrials = 10;    /* number of trials (fixed) */
use Binomial; read all var "x"; close;
 
p = do(0.01, 0.99, 0.01);      /* vector of parameter values */
LL = j(1, ncol(p), .);
do i = 1 to ncol(LL);
   LL[i] = Binom_LL1( p[i] );  /* evaluate LL for a sequence of p */
end;
 
title "Graph of Log-Likelihood Function";
title2 "Binomial Distribution, NTrials=10";
call series(p, LL) grid={x y} xvalues=do(0,1,0.1)
                   label={"Probability of Sucess (p)", "Log Likelihood"};
Graph of log-likelihood function for the binomial distribution

Notice that the data is constant and does not change. The log likelihood is considered to be a function of the parameter p. Therefore you can graph the function for representative values of p, as shown. The graph clearly shows that the log likelihood is maximal near p=0.56, which is the maximum likelihood estimate. The graph is fairly flat near its optimal value, which indicates that the estimate has a wide standard error. A 95% confidence interval for the parameter is also wide. If the sample contained 100 observations instead of only 20, the log-likelihood function might have a narrower peak.

Notice also that the LOGPDF function made this computation very easy. You do not need to worry about the actual formula for the binomial density. All you have to do is sum the log-density at the data values.

In contrast, the second method requires a little more work, but can handle any distribution for which you can compute the density function. If you look up the formula for the binomial PDF in the MCMC documentation, you see that
PDF(x; p, NTrials) = comb(NTrials, x) * p**x * (1-p)**(NTrials-x)
where the COMB function computes the binomial coefficient "NTrials choose x." There are three terms in the PDF that are multiplied together. Therefore when you apply the LOG function, you get the sum of three terms. You can use the LCOMB function in SAS to evaluate the logarithm of the binomial coefficients in an efficient manner, as follows:

/* Method 2: Manually compute log likelihood by using formula */
start Binom_LL2(p) global(x, NTrials);
   LL = sum(lcomb(NTrials, x)) + log(p)*sum(x) + log(1-p)*sum(NTrials-x);
   return( LL );
finish;
 
LL2 = Binom_LL2(p);      /* vectorized function, so no need to loop */

The second formulation has an advantage in a vector language such as SAS/IML because you can write the function so that it can evaluate a vector of values with one call, as shown. It also has the advantage that you can modify the function to eliminate terms that do not depend on the parameter p. For example, if your only goal is maximize the log-likelihood function, you can omit the term sum(lcomb(NTrials, x)) because that term is a constant with respect to p. That reduces the computational burden. Of course, if you omit the term then you are no longer computing the exact binomial log likelihood.

Example: The log-likelihood function for the lognormal distribution

In a similar way, you can use the LOGPDF or the formula for the PDF to define the log-likelihood function for the lognormal distribution. For brevity, I will only show the SAS/IML functions, but you can download the complete SAS program that defines the log-likelihood function and computes the graph.

The following SAS/IML modules show two ways to define the log-likelihood function for the lognormal distribution. For the lognormal distribution, the vector of parameters θ = (μ, σ) contains two parameters.

/* Method 1: use LOGPDF */
start LogNormal_LL1(param) global(x);
   mu = param[1];
   sigma = param[2];
   LL = sum( logpdf("Lognormal", x, mu, sigma) );
   return( LL );
finish;
 
/* Method 2: Manually compute log likelihood by using formula
   PDF(x; p, NTrials) = comb(NTrials,x) # p##x # (1-p)##(NTrials-x)
*/
start LogNormal_LL2(param) global(x);
   mu = param[1];
   sigma = param[2];
   twopi = 2*constant('pi');
   LL = -nrow(x)/2*log(twopi*sigma##2) 
        - sum( (log(x)-mu)##2 )/(2*sigma##2)
        - sum(log(x));  /* this term is constant w/r/t (mu, sigma) */
   return( LL );
finish;

The function that uses the LOGPDF function is simple to write. The second method is more complicated because the lognormal PDF is more complicated than the binomial PDF. Nevertheless, the complete log-likelihood function only requires a few SAS/IML statements.

For completeness, the contour plot on this page shows the log-likelihood function for 200 simulated observations from the Lognormal(2, 0.5) distribution. The parameter estimates are (μ, σ) = (1.97, 0.5).

Graph of the log-likelihood function for the lognormal distribution

Summary

This article has shown two simple ways to define a log-likelihood function in SAS. You can sum the values of the LOGPDF function evaluated at the observations, or you can manually apply the LOG function to the formula for the PDF function. The log likelihood is regarded as a function of the parameters of the distribution, even though it also depends on the data. For distributions that have one or two parameters, you can graph the log-likelihood function and visually estimate the value of the parameters that maximize the log likelihood.

Of course, SAS enables you to numerically optimize the log-likelihood function, thereby obtaining the maximum likelihood estimates. My next blog post shows two ways to obtain maximum likelihood estimates in SAS.

The post Two simple ways to construct a log-likelihood function in SAS appeared first on The DO Loop.

5月 012017
 
Split data into groups that have the same mean and variance

A frequently asked question on SAS discussion forums concerns randomly assigning units (often patients in a study) to various experimental groups so that each group has approximately the same number of units. This basic problem is easily solved in SAS by using PROC SURVEYSELECT or a DATA step program.

A more complex problem is when the researcher wants the distribution of a covariate to be approximately equal for each group. For example, a medical researcher might want each group of patients to have approximately the same mean and variance for their cholesterol levels, as shown in the plot to the right. This second problem is much harder because it involves distributing the units (non-randomly) so that the groups satisfy some objective. Conceptually, you want an assignment that minimizes the difference between moments (mean, variance, skewness,...) of the subgroups.

The OPTEX procedure for optimal design

Creating an experimental design that optimizes some criterion is one of the many uses of the OPTEX procedure in SAS/QC software In fact, this problem is one of the examples in the PROC OPTEX documentation. The example in this blog post follows the documentation example; see the documentation for details.

To solve this assignment problem with PROC OPTEX, you need to specify three things:

  • A data set (call it UNITS) that contains the individual units (often patients) that you want to assign to the treatments. For this example, the units are the living patients in the Sashelp.Heart data set.
  • A data set (call it TREATMENTS) that contains the names of the treatment groups. This example uses five groups with values 1, 2, 3, 4, and 5.
  • The covariate in the problem. The OPTEX procedure will assign units to groups so that the first k moments are approximately equal across treatment groups. This example uses the Cholesterol variable and k=2, which means that the mean and variance of the cholesterol levels for patients in each treatment group will be approximately equal.

The following statements create the two data sets and define a macro variable that contains the name of the Cholesterol variable:

/* Split the living Sashelp.heart patients into five groups so that 
   mean and variance of cholesterol is the same across groups. */
data Units;                   /* each row is a unit to be assigned */
set Sashelp.Heart(where=(status = "Alive"));
keep Sex AgeAtStart Height Weight Diastolic Systolic MRW Cholesterol;
run;
 
%let NumGroups =5;           /* number of treatment groups */
data Treatments;
do Trt = 1 to &NumGroups;    /* Trt is variable that assigns patients to groups */
   output;
end;
run;
 
%let Var = Cholesterol;      /* name of covariate */

As discussed in the PROC OPTEX documentation, the following call creates an output data set (named GROUPS) that assigns each patient to a group (1–5). A call to PROC MEANS displays the mean and variance of each group.

proc optex data=Treatments seed=97531 coding=orthcan;
   class Trt;
   model Trt;              /* specify treatment model */
   blocks design=Units;    /* specify units */
   model &Var &Var*&Var;   /* fixed covariates: &Var--> mean, &Var*&Var--> variance */
   output out=Groups;      /* merged data: units assigned to groups */
run;
 
proc means data=Groups mean std;
  class Trt;
  var &Var;
run;

Success! The table shows that each group has approximately the same size and approximately the same mean and variance of the Cholesterol variable. Remember, though, that this assignment scheme is not random, so be careful not to make unjustified inferences from estimates based on these group assignments.

I am not an expert on experimental design, but a knowledgeable colleague tells me that optimal design theory states that the design that minimizes the variance of the treatment effects (adjusting for the first two moments of the covariate) is the design in which treatment means and variances of the covariate are as equal as possible. This is the design that PROC OPTEX has produced.

The following call to PROC SGPLOT uses box plots to visualize the distribution of the Cholesterol variable across treatment groups. The graph is shown at the top of this article. Clearly the distribution of cholesterol is very similar for each group.

proc sgplot data=Groups;
   vbox &Var / category=Trt;
run;

In summary, the OPTEX procedure in SAS/QC software enables you to assign units to groups, where each group has approximately the same distribution of a specified covariate. In this article, the covariate measured cholesterol levels in patients, but you can also group individuals according to income, scholastic aptitude, and so on. For large data sets, the assignment problem is a challenging optimization problem, but PROC OPTEX provides a concise syntax and solves this problem efficiently.

The post Split data into groups that have the same mean and variance appeared first on The DO Loop.

4月 122017
 

At SAS Global Forum last week, I saw a poster that used SAS/IML to optimized a quadratic objective function that arises in financial portfolio management (Xia, Eberhardt, and Kastin, 2017). The authors used the Newton-Raphson optimizer (NLPNRA routine) in SAS/IML to optimize a hypothetical portfolio of assets.

The Newton-Raphson algorithm is one of my favorite optimizers. However, for a quadratic objective function there is a simpler choice. You can use the NLPQUA function in SAS/IML to optimize a quadratic objective function.

Optimal value of quadratic function of two variables

Optimize an unconstrainted quadratic objective function in SAS/IML

Let's see how this works by specifying a quadratic polynomial in two variables. The function is
f(x,y) = (9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6
       = 0.5*x` H x + L x + 6
where H = {18  1, 1  8} is a 2x2 symmetric matrix, L = {-12  -4} is a row vector, and x = {x, y} is a column vector.

To use the NLPQUA subroutine, you need to write down the Hessian matrix, H, which is the symmetric matrix of second derivatives. Of course, for a quadratic function the second derivatives are constants. In the following SAS/IML program, the matrix H contains the second derivatives and the vector lin contains the coefficients for the linear and constant terms of f:

/* objective function is
   f(x,y) = (3*x-2)##2 + (2*y-1)##2 + x#y + 1
          = (9*x##2  + x#y + 4*y##2)  + -12*x  - 4*y + 6
   grad(f) = g(x,y) = [ 18*x + y - 12,  x + 8*y - 4 ]
   hess(f)  = [ dg1/dx  dg2/dx ]  =  [ 18  1 ]
              [ dg1/dy  dg2/dy ]     [  1  8 ]
*/
proc iml;
H = {18 1,             /* matrix of second derivatives */
      1 8};
lin = { -12 -4 6 };    /* vector of linear terms and the constant term */

The NLPQUA subroutine enables you to specify the Hessian matrix and the linear coefficients. From those values, the routine uses an efficient solver to find the global minimum of the quadratic polynomial, which occurs at x=92/143, y=60/143:

x0 = j(1, 2, 1);       /* initial guess */
opt = {0 1};           /* minimize, print final parameters */
call nlpqua(rc, xOpt, H, x0, opt) lin=lin;
print xOpt; 
print xOpt[format=Fract.];
Optimal values of a quadratic function

Solve a constrained quadratic program in SAS/IML

You can also use the NLPQUA subroutine to solve a constrained problem. Suppose that you are only interested in values of (x,y) in the unit square and you require that the solution satisfies the linear constraint x + y = 1. In that case you can define a matrix that specifies the linear constraints as follows:

blc = {0 0 . .,   /* lower bound: 0 <= x_i        */
       1 1 . .,   /* upper bound:      x_i <= 1   */
       1 1 0 1};  /* linear constraint x +  y = 1 */
call nlpqua(rc, conOpt, H, x0, opt, blc) lin=lin;
print conOpt[format=Fract.];

Notice how the linear constraints are specified for this (and every other) NLP function. Let p be the number of parameters in the problem. The first p elements of the first row specify the lower bounds for the parameters; use a missing value for variables that are not bounded below. The last two columns of the first row are not used, so you should always set those elements to missing values. Similarly, the first p elements of the second row specify the upper bound for the variables. The last two columns of the second row are not used, so set those elements to missing values.

Additional rows specify coefficients for the linear constraint. An equality constraint of the form a1*x1 + a2*x2 + ... + an *xn = c is encoded as a row of coefficients {a1 a2 ... an 0 c}, where the 0 in the penultimate location indicates equality. A "less than" inequality of the form a1*x1 + a2*x2 + ... + an *xn ≤ c is encoded as the row {a1 a2 ... an  -1  c}, where the -1 indicates the "less than or equal" symbol. Similarly, a value of +1 in the penultimate location indicates the "greater than or equal" symbol (≥). See the documentation for the NLP routines for additional details about specifying constraints.

Solve quadratic programs in PROC OPTMODEL

If you have access to SAS/OR software, PROC OPTMODEL provides a simple and natural language for solving simple and complex optimization problems. For this problem, PROC OPTMODEL detects that the objective function is quadratic and

/* variable names x and y */
proc optmodel;
   /* unconstrained quadratic optimization */
   var x, y;
   min f = 9*x^2 + x*y + 4*y^2 - 12*x - 4*y + 6;
   solve;                    /* call QP solver */
   print x Fract. y Fract.;
 
   /* constrained quadratic optimization */
   x.lb = 0; x.ub = 1;        /* bounds on variables */
   y.lb = 0; y.ub = 1;
   con SumToOne:              /* linear constraint */
      x + y = 1;
   solve;                     /* call QP solver */
   print x Fract. y Fract.;
quit;

In summary, if you want to optimize a quadratic objective function (with or without constraints), use the NLPQUA subroutine in SAS/IML, which is a fast and efficient way to maximize or minimize a quadratic function of many variables. If you have access to SAS/OR software, PROC OPTMODEL provides an even simpler syntax.

The post Quadratic optimization in SAS appeared first on The DO Loop.

1月 272017
 

There are so many ways in which a customer’s journey of experiences can be negatively affected, from forms on websites that are unclear or complicated, to inconsistent or non-relevant interactions over many channels. It is important that these interactions are measured and reduced to maximize customer engagement and increase customer satisfaction over the long run.

You can tackle this challenge from several directions, from A/B testing and Multi-Armed Bandit tests that optimize interactions at specific points in a journey (these are available in SAS 360), to approaches that optimize the full customer journeys over many sequential points in the journey.

Optimizing full analytically-driven customer journeys

This year I was involved in a project for a large retailer and the retailer believed there were a significant number of interactions with customers that had an impact on response rates – i.e., positive (halo) effects and negative (cannibalization) effects. These are difficult to deal with using standard optimization techniques that assume independence of contacts, and therefore full customer journey optimization was used to identify these effects and address this complexity successfully.

As always, the first step was to get the consolidated data, at the individual customer level. We were able to accomplish this because we had good, quality data for the project – customer-level demographic data, and contact history data.

The stages of customer journey optimization

The journey optimization was then carried out in three stages:

Stage 1 –Creating analytically driven customer journeys is an important advancement towards truly effective analytically driven omnichannel marketing. We used decision trees (in SAS Enterprise Miner) on the customer history data to map widely varied journeys that customers were taking. Traditionally, decision trees are used on a wider set of data, but by using just the history, the paths of significant activity that led to purchases were identified.

Stage 2 – Next, these analytically driven journey maps were used as inputs for optimization. For every journey identified by the decision trees, a predictive model was created (using SAS Enterprise Miner) to predict spending, so that for every customer, for every journey, we can predict how much they will spend.

Stage 3 – Finally, the data was optimized using SAS Marketing Optimization, and constraints were applied to establish a final set of scenarios that the retailer agreed would be appropriate to implement.

This illustrates how decision trees can be used to map customer journeys, and these journeys can then be optimized; to replace the disjointed and disconnected results of traditional optimization methods. We are also beginning to use the cutting-edge machine learning technique of deep reinforcement learning to further optimize customer journeys. These techniques will be incorporated into SAS Customer Intelligence solutions to ensure that SAS users can explore this complicated and increasingly important area of customer intelligence.

tags: A/B Testing, customer journey, customer journey mapping, multi-armed bandit test, optimization, retail, sas customer intelligence, sas marketing optimization

Customer journey optimization: A real-world example was published on Customer Intelligence.