Statistical Programming

3月 202023
 

In a previous article about Markov transition matrices, I mentioned that you can estimate a Markov transition matrix by using historical data that are collected over a certain length of time. A SAS programmer asked how you can estimate a transition matrix in SAS. The answer is that you can use PROC FREQ to tabulate the number of transitions from one state to another. The procedure outputs counts and row percentages, either of which can be used to construct an estimate of the transition matrix.

Transition of families through economic classes

Let's start with some data. A charity provides education, healthcare, and agricultural assistance to an impoverished town in Central America. The charity wants to estimate the transition of families through various economic categories (or states) based on the total family income:

  • State 1 is used for the poorest families who earn less than $2 per day per person.
  • State 2 is for families who earn between $2-$10 per day per person.
  • State 3 is for families who earn between $10-$20 per day per person.
  • State 4 is for families who earn more than $20 per day per person.

The charity has access to the economic status of 160 families who have been in the program for five years. The following SAS DATA step defines the beginning and ending states for these families:

/* State 1: families who earn less than $2 per day
   State 2: families who earn between $2-$10 per day
   State 3: families who earn between $10-$20 per day
   State 4: families who earn more than $20 per day
*/
data History;
retain ID;
input BeginState EndState @@;
ID + 1;
datalines;
1 2  2 2  1 3  3 3  3 3  3 3  1 1  3 2  4 4  3 3 
4 4  1 1  3 2  1 1  1 3  3 3  2 2  2 2  2 2  3 2 
2 3  1 3  1 1  1 2  4 3  1 1  3 4  1 3  3 3  1 2 
1 2  3 3  1 3  3 4  2 2  1 2  3 2  1 2  1 1  3 2 
1 3  1 1  1 1  1 1  1 3  1 3  3 3  1 1  2 2  4 4 
1 1  2 3  1 1  1 2  2 2  2 2  1 3  2 2  1 1  1 2 
3 3  1 3  4 4  1 3  3 4  1 1  1 2  2 2  1 2  3 2 
1 1  3 3  3 3  1 2  1 1  1 2  3 3  2 2  1 3  3 2 
1 1  1 2  1 1  4 2  1 2  1 3  1 2  1 1  2 1  1 1 
2 3  1 2  2 2  1 1  3 4  1 1  1 1  2 2  3 3  4 3 
3 2  4 3  1 1  2 1  2 3  2 2  1 2  4 4  1 2  2 1 
2 1  2 2  1 1  2 3  4 4  1 2  1 1  2 2  1 2  4 2 
1 1  2 2  1 1  1 2  1 1  2 1  2 1  1 2  2 3  2 2 
3 3  4 3  1 1  2 2  1 1  2 1  1 1  2 2  1 1  1 1 
1 1  2 2  1 1  3 2  1 3  3 2  3 3  4 4  1 1  4 2 
3 3  4 4  3 2  4 4  2 2  1 3  3 3  4 4  4 3  1 2 
;
 
proc print data=History(obs=10) noobs;
run;

The output from PROC PRINT shows the beginning and ending states for 10 families. The first family was in State 1 at the beginning of the program but was in State 2 at the end of the program. The second family was in State 2 at the beginning and remained there. The third family was in State 1 at the beginning of the program but was in State 3 at the end, and so forth. You can use PROC FREQ to tabulate the matrix of counts for the transitions from one category into another, as follows:

proc freq data=History ;
   tables BeginState * EndState / out=freqOut sparse nocol nopercent outpct;
run;

The output shows the counts and the row percentages for the data. The first row of the output is for the 75 families who started the program in State 1. Of those families, 37 (49%) remained in State 1 at the end of the program, 23 (31%) had progressed to State 2, and 15 (20%) had progressed to State 3. The second row of the output is for the 35 families who started the program in State 2. Of those families, 7 (20%) regressed to State 1, 22 (63%) remained in State 2, and 6 (17%) advanced to State 3. The other rows of the output are interpreted similarly.

You can use the row percentages to estimate the transition matrix. Merely divide each percentage by 100 to obtain a proportion. The proportion in the (i,j)th cell estimates the probability that a family that was in State i at the beginning of the program is in State j at the end of the program.

Reading the probabilities into a transition matrix

Most SAS programmers use SAS IML software to work with Markov transition matrices. The output from PROC FREQ is in "long form" in a data set that has 16 rows. You can read the estimates into a SAS IML vector and then reshape them into a 4 x 4 matrix. You can create the matrix in two ways: you can read the raw counts into a matrix and then divide each row by the row sum, or you can read the row percentages directly and then divide by 100 to obtain probabilities.

proc iml;
use freqOut;
   read all var {'BeginState' 'EndState' 'Count' 'Pct_Row'};  /* read the states, counts, and row percentages */
close;
 
N = sqrt(nrow(Count));    /* this should be an integer or else something is wrong */
names = EndState[1:N];    /* there should be N unique states */
 
/* estimate transition matrix by using counts of transitions */
C = shape(Count, N, N);   /* matrix of raw counts for each transition */
M = C / C[,+];            /* divide each cell by total counts for the row */
print M[r=names c=names];
 
/* or read the PCT_ROW column directly */
M = shape(Pct_Row, N, N);       /* raw counts */
M = M / 100;                    /* convert from percentages to proportions */
print M[r=names c=names];

The matrix of counts shows that 23+15+6+4 = 48 out of 160 families improved their states during the program. Only 7+11+3+5 = 26 families ended the program in a worse state than they began.

The estimates for the transition probability matrix are the same for both calculations, so only one output is shown. For the poorest families (the first row), about 50% did not improve their economic state whereas the other 50% did. For the families that began in State 2, 20% slipped back into extreme poverty (unfortunately), 63% stayed in that state, and 17% increased their state. The remaining rows have similar interpretations.

Summary

This article shows how to use PROC FREQ in SAS to construct an estimate of a Markov transition matrix from historical data. For each subject in the study, you need to know the state of the subject at the beginning and at the end of a time period. You can then construct a matrix of counts for the transition of subjects between states. By dividing each row by its row sum, you obtain empirical probability estimates.

The post Estimate a Markov transition matrix from historical data appeared first on The DO Loop.

3月 152023
 

Most homeowners know that large home improvement projects can take longer than you expect. Whether it's remodeling a kitchen, adding a deck, or landscaping a yard, big projects are expensive and subject to a lot of uncertainty. Factors such as weather, the availability of labor, and the supply of materials, can all contribute to uncertainty in the duration of a project.

If you ask an expert (for example, an experienced general contractor) to predict the length of time required to complete a project, he or she might respond as follows:

  • 10% of the time, a project like this takes 17 work days or less.
  • 50% of the time, we can complete it in 24 work days or less.
  • 90% of the time, we finish the work by the 35th work day.

Whether the contractor realizes it or not, these sentences are estimating three quantiles of a continuous probability distribution for the completion of the project.

Now, if you are planning a garden party on your new deck, you don't want to schedule the party before the deck is finished. One way to predict when you can have the party is to fit a probability distribution to the contractor's estimates. But what distribution should you use? The contractor's estimates do not describe a symmetric distribution, such as the normal distribution. Without domain-specific knowledge, there is no reason to prefer one distribution over the other. Furthermore, you don't have any real data to fit the model, you have only the three estimated quantiles.

This is an excellent opportunity to use the flexible metalog family of distributions. This article shows how to convert the expert's estimates into a distribution that you can compute with and simulate data from.

Create a distribution from expert opinion

In this example, the expert predicted the quantiles (completion time) for three cumulative probabilities (10%-50%-90%). Mathematically, the expert has provided three points that are on (or near) the cumulative distribution for the completion time. Equivalently, the expert has provided three points on (or near) the quantile function. It is easy to fit a model to these values by using the metalog distribution. As discussed in a previous article, SAS supports the metalog distribution through a package of SAS IML functions that you can download from GitHub.

The following program assumes that you have downloaded and installed the metalog functions by using the instructions in the previous article or by following the instructions in Appendix A of the documentation for the metalog package, which is in the file MetalogDoc.pdf. This example uses a three-term metalog model. I assume that the project is never completed in less than 10 days, so I fit a semibounded model that has support on the interval [10, ∞). You can use the ML_CreateFromCDF function to create a metalog object from the estimates. (An estimate of this type is called a symmetric-percentile triplet (SPT) because the quantile are provided for the probabilities (α, 1/2, 1-α), where 0 < α < 1/2. In this example, α = 0.1.)

proc iml;
/* This LOAD statement assumes that the metalog functions are defined as in  Appendix A of the documentation.
   See the MetalogDoc.pdf file at https://github.com/sassoftware/sas-iml-packages/tree/main/Metalog */
load module=_all_;         
 
prob = {0.1, 0.5, 0.90};   /* cumulative probabilities; note symmetric-percentile triplet */
x    = {17,   24,  35};    /* associated quantiles */
 
order = 3;                 /* three terms in model */
bounds = {10 .};           /* support on [10, infinity) */
 
SPT = ML_CreateFromCDF(x, prob, order, bounds );  /* create model from SPT */
title "3-Term Metalog Model of Expert Opinion";
p = do(0.01, 0.99, 0.01);
run ML_PlotECDF(SPT, p);  /* graph the model CDF and the expert's estimates */

For these three data points, a three-term metalog model passes through the expert's estimates, so the model is a perfect fit. If you prefer to view the probability density for the model, you can use the ML_PlotPDF call, as follows:

title "PDF of 3-Term Metalog Model of Completion Times";
run ML_PlotPDF(SPT, p);

The PDF indicates the most probable duration of the project (the mode) is about 20-25 days.

The expert gave estimates for three probability values. But what if you want estimates for other probabilities? For example, let's calculate dates for which the model predicts a 95% and a 99% chance that the deck will be completed. We can get those estimates by evaluating the quantile function of the metalog model. Let's evaluate the quantile function at 0.95 and 0.99, as well as at a few other values:

/* estimates for other percentiles */
p = {0.05, 0.25, 0.75, 0.95, 0.99};
estTime = ML_Quantile(SPT, p);
print p[F=PERCENT9.] estTime[F=6.2];

According to the model, there is a 95% chance that the project will be finished by the 40th day. There is a 99% chance that the project will be completed within 56 days. You can use those estimates to schedule a party on your new deck!

Summary

The metalog distribution can be used to construct a probability distribution from an expert's opinion. This example used three estimates for the completion of a home-improvement project. The same technique enables you to fit a metalog distribution to four, five, or more quantile estimates from an expert. The metalog model provides an easy way to convert an expert's opinion into a probability distribution that can be used for simulation and modeling.

The post Fitting a distribution to an expert's opinion: An application of the metalog distribution appeared first on The DO Loop.

3月 132023
 

A previous article describes the metalog distribution (Keelin, 2016). The metalog distribution is a flexible family of distributions that can model a wide range of shapes for data distributions. The metalog system can model bounded, semibounded, and unbounded continuous distributions. This article shows how to use the metalog distribution in SAS by downloading a library of SAS IML functions from a GitHub site.

The article contains the following sections:

  • An overview of the metalog library of SAS IML functions
  • How to download the functions from GitHub and include the functions in a program
  • How to use a metalog model to fit an unbounded model to data
  • A discussion of infeasible models

For a complete discussion of the metalog functions in SAS, see "A SAS IML Library for the Metalog Distribution" (Wicklin, 2023).

The metalog library of functions in SAS

The metalog library is a collection of SAS IML modules. The modules are object-oriented, which means that you must first construct a "metalog object," usually from raw data. The object contains information about the model. All subsequent operations are performed by passing the metalog object as an argument to functions. You can classify the functions into four categories:

  • Constructors create a metalog object from data. By default, the constructors fit a five-term unbounded metalog model.
  • Query functions enable you to query properties of the model.
  • Computational functions enable you to compute the PDF and quantiles of the distribution and to generate random variates.
  • Plotting routines create graphs of the CDF and PDF of the metalog distribution. The plotting routines are available when you use PROC IML.

Before you create a metalog object, you must choose how to model the data. There are four types of models: unbounded, semibounded with a lower bound, semibounded with an upper bound, or bounded. Then, you must decide the number of terms (k) to use in the metalog model. A small number of terms (3 ≤ k ≤ 6) results in a smooth model. A larger number of terms (k > 6) can fit multimodal data distributions, but be careful not to overfit the data by choosing k too large.

The most common way to create a metalog object is to fit a model to the empirical quantiles of the data, which estimates the k parameters in the model. For an unbounded metalog model, the quantile function is
\( Q(p) = a_1 + a_2 L(p) + a_3 c(p) L(p) + a_4 c(p) + a_5 c(p)^2 + a_6 c(p)^2 L(p) + a_7 c(p)^3 + \ldots \)
where L(p) = Logit(p) is the quantile function of the logistic distribution and c(p) = p – 1/2 is a linear function. You use ordinary least squares regression to fit the parameters (\(a_1, a_2, \ldots, a_k\)) in the model.

The semibounded and bounded versions are metalog models of log-transformed variables.

  • If X > L, fit a metalog model to Z = log(X – L)
  • If L < X < U, fit a metalog model to Z = log((X – L) / (U – X))

How to download the metalog functions from GitHub

The process of downloading SAS code from a GitHub site is explained in the article, "How to use Git to share SAS programs." The process is also explained in Appendix A in the metalog documentation. The following SAS program downloads the metalog functions to the WORK libref. You could use a permanent libref, if you prefer:

/* download metalog package from GitHub */
options dlcreatedir;
%let repoPath = %sysfunc(getoption(WORK))/sas-iml-packages;  /* clone repo to WORK, or use permanent libref */
 
/* clone repository; if repository exists, skip download */
data _null_;
if fileexist("&repoPath.") then 
   put 'Repository already exists; skipping the clone operation'; 
else do;
   put "Cloning repository 'sas-iml-packages'";
   rc = gitfn_clone("https://github.com/sassoftware/sas-iml-packages/", "&repoPath." ); 
end;
run;
 
/* Use %INCLUDE to read source code and STORE functions to current storage library */
proc iml;
%include "&repoPath./Metalog/ML_proc.sas";   /* each file ends with a STORE statement */
%include "&repoPath./Metalog/ML_define.sas";
quit;

Each file ends with a STORE statement, so running the above program results in storing the metalog functions in the current storage library. The first file (ML_PROC.sas) should be included only when you are running the program in PROC IML. It contains the plotting routines. The second file (ML_define.sas) should be included in both PROC IML and the iml action.

Fit an unbounded metalog model to data

The Sashelp.heart data set contains data on patients in a heart study. The data set contains a variable (Systolic) for the systolic blood pressure of 5,209 patients. The following call to PROC SGPLOT creates a histogram of the data and overlays a kernel density estimate.

title 'Systolic Blood Pressure';
proc sgplot data=sashelp.heart;
   histogram Systolic;
   density Systolic / type=kernel;
   keylegend /location=inside;
run;

The kernel density estimate (KDE) is a nonparametric model. The histogram and the KDE indicate that the data are not normally distributed, and it is not clear whether the data can be modeled by using one of the common "named" probability distributions such as gamma or lognormal. In situations such as this, the metalog distribution can be an effective tool for modeling and simulation. The following SAS IML statements load the metalog library of functions, then fit a five-term unbounded metalog distribution to the data. After you fit the model, it is a good idea to run the ML_Summary subroutine, which displays a summary of the model and displays the parameter estimates.

proc iml;
load module=_all_;     /* load the metalog library */
use sashelp.heart;
   read all var "Systolic";
close;
 
MLObj = ML_CreateFromData(Systolic, 5);  /* 5-term unbounded model */
run ML_Summary(MLObj);

The ML_Summary subroutine displays two tables. The first table tells you that the model is a 5-term unbounded model, and it is feasible. Not every model is feasible, which is why it is important to examine the summary of the model. (Feasibility is discussed in a subsequent section.) The second table provides the parameter estimates for the model.

The MLObj variable is an object that encapsulates relevant information about the metalog model. After you create a metalog object, you can pass it to other functions to query, compute, or plot the model. For example, in PROC IML, you can use the ML_PlotECDF subroutine to show the agreement between the cumulative distribution of the model and the empirical cumulative distribution of the data, as follows:

title '5-Term Metalog Model for Systolic Blood Pressure';
run ML_PlotECDF(MLObj);

The model is fit to 5,209 observations. For this blog post, I drew the CDF of the model in red so that it would stand out. The model seems to fit the data well. If you prefer to view the density function, you can run the statement
    run ML_PlotPDF(MLObj);
Again, note that the MLObj variable is passed as the first argument to the function.

One way to assess the goodness of fit is to compare the sample quantiles to the quantiles predicted by the model. You can use the ML_Quantile function to obtain the predicted quantiles, as follows:

/* estimates of quantiles */
Prob = {0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99};
call qntl(SampleEst, Systolic, Prob);   /* sample quantiles */
ModelEst = ML_Quantile(MLObj, Prob);    /* model quantiles */
print Prob SampleEst ModelEst;

The model quantiles agree well with the sample quantiles. Note that the median estimate from the model (the 0.5 quantile) is equal to the a1 parameter estimate. This is always the case.

An advantage of the metalog model is that it has an explicit quantile function. As a consequence, it is easy to simulate from the model by using the inverse CDF method. You can use the ML_Rand function to simulate from the model, as follows:

/* simulate 500 new blood pressure values from the model */
call randseed(1234);
title 'Simulated Systolic Values';
BP = ML_Rand(MLObj, 500);
call histogram(BP);

Notice that the shape of this histogram is similar to the histogram of the original data. The simulated data seem to have similar properties as the original data.

The metalog documentation provides the syntax and documentation for every public method in the library.

Infeasible metalog functions

As discussed in a previous article, not every metalog model is a valid distribution. An infeasible model is one for which the quantile function is not monotone increasing. This can happen if you have a small data set that has extreme outliers or gaps, and you attempt to fit a metalog model that has many parameters. The model will overfit the data, which results in an infeasible model.

I was unable to create an infeasible metalog model for the blood-pressure data, which has thousands of observations and no extreme observations. However, if I use only the first 20 observations, I can demonstrate an infeasible model by fitting a six-term metalog model. The following SAS IML statements create a new data vector that contains only 20 observations. A six-term metalog model is fit to those data.

/* What if we use a much smaller data set? Set N = 20. */
S = Systolic[1:20];               /* N = 20 */
MLObj = ML_CreateFromData(S, 6);  /* 6-term model */
run ML_Summary(MLObj);

As shown by the output of the ML_Summary call, the six-term metalog model is not feasible. The following graph shows the quantile function of the model overlaid on the empirical quantiles. Notice that the quantile function of the model is not monotone increasing, which explains why the model is infeasible. If you encounter this situation in practice, Keelin (2016) suggests reducing the number of terms in the model.

Summary

This article describes how to use the metalog distribution in SAS. The metalog distribution is a flexible family that can model a wide range of distributional shapes. You can download the Metalog package from the SAS Software GitHub site. The package contains documentation and a collection of SAS IML modules. You can call the modules in a SAS IML program to fit a metalog model. After fitting the model, you can compute properties of the model and simulate data from the model. More information and examples are available in the documentation of the Metalog package.

The post Use the metalog distribution in SAS appeared first on The DO Loop.

3月 062023
 

Undergraduate textbooks on probability and statistics typically prove theorems that show how the variance of a sum of random variables is related to the variance of the original variables and the covariance between them. For example, the Wikipedia article on Variance contains an equation for the sum of two random variables, X and Y:
\( \operatorname {Var} (X+Y)=\operatorname {Var} (X)+\operatorname {Var} (Y)+2\,\operatorname {Cov} (X,Y) \)

A SAS programmer wondered whether equations like this are also true for vectors of data. In other words, if X and Y are columns in a data set, do the sample variance and covariance statistics satisfy the same equation? How can you use SAS to verify (or disprove) the equation for a sample of data? This article shows how to use SAS to verify that the equation (and another similar equation) is valid for vectors of data.

It is possible to verify the equation by using PROC CORR and the DATA step, but it is much simpler to use SAS IML software because the IML language enables you to directly access cells of a variance-covariance matrix.

Create the sum of columns

Let's start by choosing some data to analyze. The following DATA step renames some variables in the Sashelp.Iris data set to X1, X2, and X3. The program also creates the variables Y12 = X1 + X2, Y13 = X1 + X3, and Y23 = X2 + X3. The program then reads the data into SAS IML matrices and computes the variance-covariance matrix for the original variables (X) and their pairwise sums (Y):

/* verify variance of a sum (or linear combination) of variables */
data Want;
set sashelp.iris(where=(Species="Versicolor"));
X1 = PetalLength;
X2 = PetalWidth;
X3 = SepalLength;
Y12 = X1 + X2;
Y13 = X1 + X3;
Y23 = X2 + X3;
keep X1-X3 Y12 Y13 Y23;
run;
 
proc iml;
use Want;
   read all var {'X1' 'X2' 'X3'} into X[c=XNames];
   read all var {'Y12' 'Y13' 'Y23'} into Y;
close;
YNames = {'X1+X2' 'X1+X3' 'X2+X3'};
 
SX = cov(X);
SY = cov(Y);
print SX[c=XNames r=XNames F=best5. L='Cov(X)'],
      SY[c=YNames r=YNames F=best5. L='Cov(Sums)'];

The variance of a sum

Let's use this information to verify the identity
\( \operatorname {Var} (X1+X2)=\operatorname {Var} (X1)+\operatorname {Var} (X2)+2\,\operatorname {Cov} (X1,X2) \)

From the displayed (rounded) values of the covariance matrices, you can mentally calculate that the equation could be true. The variances of the original variables are along the diagonal of the first matrix, and the covariances are the off-diagonal elements. The variance of the sum is the [1,1] cell of the second matrix. A little mental arithmetic indicates that 40.6 ≈ 22 + 4 + 2*7.

The following SAS IML statements extract the relevant values from cells in the variance-covariance matrices. The program then subtracts the right side of the equation from the left side. If the difference is 0, then the equation is verified:

/* Confirm that the sample variance of (x1+x2) satisfies
      Var(x1 +x2) = Var(x1) + Var(x2) + 2*Cov(x1, x2)
   See https://en.wikipedia.org/wiki/Variance#Propagation
*/
VarY12= SY[1,1];
VarX1 = SX[1,1];
VarX2 = SX[2,2];
Cov12 = SX[1,2];
Eqn1 = VarY12 - (VarX1 + VarX2 + 2*Cov12);
print Eqn1;

Success! The left and right sides of the equation are equal to numerical precision. This validates the equation for the data we are using. Notice that this same technique could be used to analyze the variance of a general linear combination of other variables.

The covariance of sums

Let's verify one more equation. The Wikipedia article about Covariance states the following formula for four random variables, X, Y, Z, and W:
\( \operatorname {Cov} (X+Y,Z+W)=\operatorname {Cov} (X,Z)+\operatorname {Cov} (X,W)+\operatorname {Cov} (Y,Z)+\operatorname {Cov} (Y,W) \)

Since our example data only has three variables, let's simplify the equation by setting W=X. Then (changing variable names) the equation we want to verify is \( \operatorname {Cov} (X1+X2,X1+X3)=\operatorname {Cov} (X1,X3)+\operatorname {Cov} (X1,X1)+\operatorname {Cov} (X2,X3)+\operatorname {Cov} (X1,X2) \)

The following SAS IML statements extract the relevant cells from the covariance matrices. The program subtracts the right side of the equation from the left side and prints the difference.

/* Confirm that the sample covariance of (x1+x2) and (x1 + x3) satisfies
     Cov(x1+x2, x1+x3) = Cov(x1, x1) + Cov(x1, x2) + Cov(x1, x3) + Cov(x2, x3)
   See https://en.wikipedia.org/wiki/Covariance#Covariance_of_linear_combinations
*/
CovY12_Y13 = SY[1,2];
Cov11 = SX[1,1];      /* = Var(X1) */
Cov12 = SX[1,2];
Cov13 = SX[1,3];
Cov23 = SX[2,3];
Eqn2 = CovY12_Y13 - (Cov11 + Cov12 + Cov13 + Cov23);
print Eqn2;

Once again, the difference is essentially zero, which validates the equation.

Summary

A SAS programmer asked whether she could use SAS to verify certain equations. She wanted to verify that certain formulas for the variance and covariance of random variables are also true for sample statistics for empirical data. This article shows how to use SAS IML software to test certain equations that relate the variance and covariance of sums to the variances and covariances of a set of data variables.

The post The variance of the sums of variables appeared first on The DO Loop.

3月 012023
 

A SAS programmer wanted to compute the distribution of X-Y, where X and Y are two beta-distributed random variables. Pham-Gia and Turkkan (1993) derive a formula for the PDF of this distribution. Unfortunately, the PDF involves evaluating a two-dimensional generalized hypergeometric function, which is not available in all programming languages. Hypergeometric functions are not supported natively in SAS, but this article shows how to evaluate the generalized hypergeometric function for a range of parameter values. By using the generalized hypergeometric function, you can evaluate the PDF of the difference between two beta-distributed variables.

The difference between two beta random variables

Before doing any computations, let's visualize what we are trying to compute. Let X ~ Beta(a1, b1) and Y ~ Beta(a1, b1) be two beta-distributed random variables. We want to determine the distribution of the quantity d = X-Y. One way to approach this problem is by using simulation: Simulate random variates X and Y, compute the quantity X-Y, and plot a histogram of the distribution of d. Because each beta variable has values in the interval (0, 1), the difference has values in the interval (-1, 1).

The following simulation generates 100,000 pairs of beta variates: X ~ Beta(0.5, 0.5) and Y ~ Beta(1, 1). Both X and Y are U-shaped on (0,1). The following simulation generates the differences, and the histogram visualizes the distribution of d = X-Y:

/* Case 2 from Pham-Gia and Turkkan, 1993, p. 1765 */
%let a1 = 0.5;
%let b1 = 0.5;
%let a2 = 1;
%let b2 = 1;
%let N = 1E5;
data BetaDiff;
call streaminit(123);
do i = 1 to &N;
   x = rand("beta", &a1, &b1);
   y = rand("beta", &a2, &b2);
   d = x - y;
   output;
end;
drop i;
run;
 
title "Distribution of X-Y";
title2 "X~beta(&a1,&b1); Y~beta(&a2,&b2)";
proc sgplot data=BetaDiff;
   histogram d;
run;

For these values of the beta parameters, the distribution of the differences between the two beta variables looks like an "onion dome" that tops many Russian Orthodox churches in Ukraine and Russia. For other choices of parameters, the distribution can look quite different. The remainder of this article defines the PDF for the distribution of the differences. The formula for the PDF requires evaluating a two-dimensional generalized hypergeometric distribution.

Appell's hypergeometric function

A previous article discusses Gauss's hypergeometric function, which is a one-dimensional function that has three parameters. For certain parameter values, you can compute Gauss's hypergeometric function by computing a definite integral. The two-dimensional generalized hypergeometric function that is used by Pham-Gia and Turkkan (1993), is called Appell's hypergeometric function (denoted F1 by mathematicians). Appell's function can be evaluated by solving a definite integral that looks very similar to the integral encountered in evaluating the 1-D function. Appell's F1 contains four parameters (a,b1,b2,c) and two variables (x,y). In statistical applications, the variables and parameters are real-valued. For the parameter values c > a > 0, Appell's F1 function can be evaluated by computing the following integral:
\(F_{1}(a,b_{1},b_{2},c;x,y)={\frac {1}{B(a, c-a)}} \int _{0}^{1}u^{a-1}(1-u)^{c-a-1}(1-x u)^{-b_{1}}(1-y u)^{-b_{2}}\,du\)
where B(s,t) is the complete beta function, which is available in SAS by using the BETA function. Both arguments to the BETA function must be positive, so evaluating the BETA function requires that c > a > 0. Notice that the integration variable, u, does not appear in the answer. Appell's hypergeometric function is defined for |x| < 1 and |y| < 1. Notice that the integrand is unbounded when either x → 1 or y → 1 (assuming b1 > 0 and b2 > 0).

The following SAS IML program defines a function that uses the QUAD function to evaluate the definite integral, thereby evaluating Appell's hypergeometric function for the parameters (a,b1,b2,c) = (2,1,1,3). A table shows the values of the function at a few (x,y) points. The graph shows a contour plot of the function evaluated on the region [-0.95, 0.9] x [-0.95, 0.9].

/* Appell hypergeometric function of 2 vars 
   F1(a,b1,b2; c; x,y) is a function of (x,y) with parms = a // b1 // b2 // c;
   F1 is defined on the domain {(x,y) | |x|<1 and |y|<1}.
   You can evaluate F1 by using an integral for c > a > 0, as shown at
   https://en.wikipedia.org/wiki/Appell_series#Integral_representations
*/
proc iml;
/* Integrand for the QUAD function */
start F1Integ(u)  global(g_parm, g_x, g_y);
   a=g_parm[1]; b1=g_parm[2]; b2=g_parm[3]; c=g_parm[4];
   x=g_x; y=g_y;
   numer = u##(a-1) # (1-u)##(c-a-1);
   denom = (1-u#x)##b1 # (1-u#y)##b2;
   return( numer / denom );
finish;
 
/* Evaluate the Appell F1 hypergeometric function when c > a > 0
   and |x|<1 and |y|<1
   Pass in parm = {a, b1, b2, c} and
   z = (x1 y1,
        x2 y2,
        ...
        xn yn}; */
start AppellF1(z, parm)  global(g_parm, g_x, g_y);
   /* transfer parameters to global symbols */
   g_parm = parm;
   a = parm[1]; b1 = parm[2]; b2 = parm[3]; c = parm[4];
   n = nrow(z);
   F = j(n, 1, .);
   if a <= 0 | c <= a then do;
      /* print error message or use PrintToLOg function:
         https://blogs.sas.com/content/iml/2023/01/25/printtolog-iml.html */
      print "This implementation of the F1 function requires c > a > 0.",
            parm[r={'a','b1','b2','c'}];
      return( F );
   end;
   /* loop over the (x,y) values */
   do i = 1 to n;
      /* defined only when |x| < 1 */
      g_x = z[i,1]; g_y = z[i,2];
      if g_x^=. & g_y^=. & 
         abs(g_x)<1 & abs(g_y)<1 then do;  
         call quad(val, "F1Integ", {0 1}) MSG="NO";
         F[i] = val;
      end;
   end;
   return( F / Beta(a, c-a) );    /* scale the integral */
finish;
 
/* Example of calling AppellF1 */
parm = {2, 1, 1, 3};
parm = {2, 1, 1, 3};
xy = { 0.9  0,
       0.7  0.3,
      -0.5  0.2,
      -0.9 -0.5,
       0    0};
F1 = AppellF1(xy, parm);
print xy[L="" c={'x' 'y'}] F1;

The PDF of beta differences

Pham-Gia and Turkkan (1993) derive the PDF of the distribution for the difference between two beta random variables, X ~ Beta(a1,b1) and Y ~ Beta(a2,b2). The PDF is defined piecewise. There are different formulas, depending on whether the difference, d, is negative, zero, or positive. The formulas use powers of d, (1-d), (1-d2), the Appell hypergeometric function, and the complete beta function. The formulas are specified in the following program, which computes the PDF. Notice that linear combinations of the beta parameters are used to construct the parameters for Appell's hypergeometric function.

/* Use Appell's hypergeometric function to evaluate the PDF
   of the distribution of the difference X-Y between 
   X ~ Beta(a1,b1) and Y ~ Beta(a2,b2)
*/
start PDFBetaDiff(_d, _parm);
   a1=_parm[1]; b1=_parm[2]; a2=_parm[3]; b2=_parm[4];
   n = nrow(_d)*ncol(_d);
   PDF = j(n, 1, .);
   if a1<=0 | b1<=0 | a2<=0 | b2<=0 then do;
      print "All parameters must be positive", a1 b1 a2 b2;
      return(PDF);
   end;
   A = Beta(a1,b1) * Beta(a2,b2);
   do i = 1 to n;
      d = _d[i];
      *print d a1 b1 a2 b2;
      if d = . then 
         PDF[i] = .;
      if abs(d) < 1 then do;
         /* Formulas from Pham-Gia and Turkkan, 1993 */
         if abs(d)<1e-14 then do;
            if a1+a2 > 1 & b1+b2 > 1 then 
               PDF[i] = Beta(a1+a2-1, b1+b2-1) / A;
            *print "d=0" (a1+a2-1)[L='a1+a2-1'] (b1+b2-1)[L='b1+b2-1'] (PDF[i])[L='PDF'];
         end;
         else if d > 0 then do;  /* (0,1] */
            parm = b1            //
                   a1+b1+a2+b2-2 //
                   1-a1          //
                   a2+b1;
            xy = 1-d || 1-d##2;
            F1 = AppellF1(xy, parm);
            PDF[i] = Beta(a2,b1) # d##(b1+b2-1) # (1-d)##(a2+b1-1) # F1 / A;
            *print "d>0" xy (PDF[i])[L='PDF'];
         end;
         else do;                /* [-1,0) */
            parm = b2            //
                   1-a2          //
                   a1+b1+a2+b2-2 //
                   a1+b2;
            xy = 1-d##2 || 1+d;
            F1 = AppellF1(xy, parm);
            PDF[i] = Beta(a1,b2) # (-d)##(b1+b2-1) # (1+d)##(a1+b2-1) # F1 / A;
            *print "d<0" xy (PDF[i])[L='PDF'];
         end;
      end;
   end;
   return PDF;
finish;
 
print "*** Case 2 in Pham-Gia and Turkkan, p. 1767 ***";
a1 = 0.5;
b1 = 0.5;
a2 = 1;
b2 = 1;
parm = a1//b1//a2//b2;
 
d = {-0.9, -0.5, -0.25, 0, 0.25, 0.5, 0.9};
PDF = PDFBetaDiff(d, parm);
print d PDF;

Notice that the parameters are the same as in the simulation earlier in this article. The following graph visualizes the PDF on the interval (-1, 1):

/* graph the distribution of the difference */
d = T(do(-0.975, 0.975, 0.025));
PDF = PDFBetaDiff(d, parm);
 
title "X-Y for X ~ Beta(0.5,0.5) and Y ~ Beta(1,1)";
title2 "Fig 1, p. 1766";
call series(d, PDF) grid={x y};

The PDF, which is defined piecewise, shows the "onion dome" shape that was noticed for the distribution of the simulated data. The following graph overlays the PDF and the histogram to confirm that the two graphs agree.

Not every combination of beta parameters results in a non-smooth PDF. For example, if you define X ~ beta(3,5) and Y ~ beta(2, 8), then you can compute the PDF of the difference, d = X-Y, by changing the parameters as follows:

/* Case 5 from Pham-Gia and Turkkan, 1993, p. 1767 */
a1 = 3;
b1 = 5;
a2 = 2;
b2 = 8;
parm = a1//b1//a2//b2;

If you rerun the simulation and overlay the PDF for these parameters, you obtain the following graph:

Summary

The distribution of X-Y, where X and Y are two beta-distributed random variables, has an explicit formula (Pham-Gia and Turkkan, 1993). Unfortunately, the PDF involves evaluating a two-dimensional generalized hypergeometric function, which is a complicated special function. Hypergeometric functions are not supported natively in SAS, but this article shows how to evaluate the generalized hypergeometric function for a range of parameter values, which enables you to evaluate the PDF of the difference between two beta-distributed variables.

You can download the following SAS programs, which generate the tables and graphs in this article:

References

The post The distribution of the difference between two beta random variables appeared first on The DO Loop.

1月 252023
 

I previously discussed how to use the PUTLOG statement to write a message from the DATA step to the log in SAS. The PUTLOG statement is commonly used to write notes, warnings, and errors to the log. This article shows how to use the PRINTTOLOG subroutine in SAS IML software to write messages to the log. The PRINTTOLOG subroutine was released in 2020 in SAS IML on SAS Viya. It runs in PROC IML and in the iml action on Viya. The PRINTTOLOG subroutine is not part of SAS 9.4, but I show how to construct a SAS/IML module for SAS 9.4 that has the same syntax and functionality.

The PrintToLog subroutine in SAS IML on Viya

The syntax of the subroutine is
    call PrintToLog(msg, flag);
The flag parameter is optional. If you omit that argument, the message is displayed in the log. If you use the flag parameter, the valid values are 0, 1, and 2:

  • If flag is 0, the string "NOTE: " is prepended to the message. For example, call PrintToLog("This is a note.", 0);.
  • If flag is 1, the string "WARNING: " is prepended to the message. For example, call PrintToLog("This is a warning.", 1);.
  • If flag is 2, the string "ERROR: " is prepended to the message. For example, call PrintToLog("This is an error.", 2);.

I will illustrate the routine by using an IML version of the program that I used in my previous post about the PUTLOG statement. The input to the program is a three-column matrix. Each row of the matrix contains the coefficients (a, b, c) for the quadratic equation a x2 + b x + 3 = 0. The program is vectorized. It uses the discriminant of the quadratic equation to determine whether the quadratic equation has any real roots. If so, the program uses the quadratic formula to find the roots. To demonstrate the PrintToLog function, the program does the following:

  • Display an ERROR if any coefficient of the quadratic term is 0.
  • Display a WARNING if any quadratic equation does not have any real roots.
  • Display a NOTE if any quadratic equation has a repeated root.
proc iml;
/* NOTE: The PrintToLog subroutine is available in SAS IML in Viya. 
   It is not available as a built-in subroutine for SAS/IML in SAS 9.4. */
Q = { 1 -1 -6,    /* each row is (a,b,c) */
      1  7  6,
      0  7  6,
      2  7  7,
     -2  5 12,
     -2  4 -2,
      5  4 10};
a = Q[,1]; b = Q[,2]; c = Q[,3];
Root1 = j(nrow(a), 1, .); /* initialize roots to missing */
Root2 = j(nrow(a), 1, .); /* initialize roots to missing */
 
if any(a=0) then
   call PrintToLog("The coefficient of the quadratic term is 0 for at least one observation", 2); /* display ERROR */
 
discrim = b##2 - 4*a#c;
if any(discrim < 0) then 
   call PrintToLog("At least one equation does not have real roots.", 1); /* display WARNING */
if any(discrim=0) then 
   call PrintToLog("At least one equation has repeated roots.", 0); /* display NOTE */
 
idx = loc(discrim >= 0 && a^=0);
if ncol(idx)>0 then do;
	Root1[idx] = (-b[idx] - sqrt(discrim[idx])) / (2*a[idx]);
	Root2[idx] = (-b[idx] + sqrt(discrim[idx])) / (2*a[idx]);
end;
print a b c discrim Root1 Root2;

The SAS log contains the following message, which is color-coded according to whether the message is an error (red color), a warning (green color), or a note (blue color).

For completeness, the output of the program is shown below.

A PrintToLog module for SAS 9.4

Although the PrintToLog subroutine is not a built-in subroutine in SAS 9.4, you can define a module that has the same functionality. The definition uses two SAS/IML techniques and one macro trick:

The following SAS macro defines a SAS/IML module in SAS 9.4, but not in SAS Viya. Thus, you can run this macro in any PROC IML program. If the program runs in Viya (where the PrintToLog subroutine is already defined), the macro does nothing. If the program runs in SAS 9.4, the macro defines a PrintToLog module that you can use to write messages to the log.

/* Check the SYSVER macro to see if SAS 9.4 is running.
   In SAS Viya, the macro is empty and does nothing.
   In SAS 9.4, the macro defines a function that emulates the PrintToLog call.
   The syntax is as follows:
   call PrintToLog("This is a log message.");
   call PrintToLog("This is a note.", 0);
   call PrintToLog("This is a warning.", 1);
   call PrintToLog("This is an error.", 2);
*/
%macro DefinePrintToLog;
%if %sysevalf(&sysver = 9.4) %then %do;
start PrintToLog(msg,errCode=-1);
   if      errCode=0 then prefix = "NOTE: ";
   else if errCode=1 then prefix = "WARNING: ";
   else if errCode=2 then prefix = "ERROR: ";
   else prefix = "";
   stmt = '%put ' + prefix + msg + ';';
   call execute(stmt);
finish;
%end;
%mend;
 
/* this program runs in SAS 9.4 or in SAS Viya */
proc iml;
%DefinePrintToLog;
call PrintToLog("This is a log message.");
call PrintToLog("This is a note.", 0);
call PrintToLog("This is a warning.", 1);
call PrintToLog("This is an error.", 2);

The output is shown for SAS 9.4, where the %DefinePrintToLog macro defines the SAS/IML module.

Summary

The PRINTTOLOG statement is way for SAS IML programmers to write messages to the log. In SAS Viya, this subroutine is built-in and can be used from PROC IML or from the iml action. This article shows how to define a module for PROC IML in SAS 9.4 that has similar functionality.

The post Write to the log from SAS IML programs appeared first on The DO Loop.

11月 302022
 

A probabilistic card trick is a trick that succeeds with high probability and does not require any skill from the person performing the trick. I have seen a certain trick mentioned several times on social media. I call it "ladders" or the "ladders game" because it reminds me of the childhood board game Chutes and Ladders (previously known as Snakes and Ladders in India and the UK). In Chutes and Ladders, some squares on the board tell the player to move forward (go up a ladder) or to move backward (go down a chute). In the "ladders" game, you only move forward.

This article uses Monte Carlo simulation in SAS to show that the "ladders" card trick "works" about 98.2% of the time.

I have been told that Martin Gardner wrote about this game. He referred to mathematical card tricks as those that "do not require prestidigitation," or sleight of hand.

The ladders card trick

The rules of the game and a Python simulation are presented in a blog post by Bob Carpenter, who saw the trick performed at a children's museum. The performer shuffles an ordinary deck of 52 playing cards and places them face up in a row on a table. The performer then describes the game of ladders:

  • A player starts by pointing to any card in the first five positions.
  • The value of the card indicates the number of cards to move forward. If the card is 2, 3, ..., 9, you move forward that many cards. If the card is a 10, jack, queen, king, or ace, you move forward one space.
  • From the new position, apply the rule again. Repeat this process until you cannot advance any more.

The performer selects two volunteers from the audience and tells them each to point randomly to any card from among the first five cards. Each player does so. The performer then announces that both players will end up on a certain "magical" card and he tells everyone the name of the magical card. The players then follow the game rules. Amazingly, each person's path terminates on the magical card that was previously announced. The trick works about 98.2% of the time.

To illustrate the ladders game, first consider a simpler situation of a deck that contains only 13 cards, one for each card value (for example, use only spades). One random shuffle of the 13 cards is shown to the right. For this arrangement of cards, the performer announces that both volunteers will end up on the 7 card (the magical card for this shuffle), which is the 11th card in the row of cards. Suppose the first volunteer points to the ace in the first position as her starting card. The rules of the game dictate the following sequence of moves:

  • From the A, the player moves forward one card and lands on the J.
  • From the J, the player moves forward one card and lands on the 2.
  • From the 2, the player moves forward two cards and lands on the 6.
  • From the 6, the player moves forward six cards and lands on the 7. The player cannot move forward seven spaces, so the 7 is her final card, as predicted in advance.

Thus, the path for the first player is the sequence of positions (1, 2, 3, 5, 11).

What happens for the other player? Well, if the second volunteer chooses his initial card at positions 2, 3, or 5, then his path will instantly coincide with the path of the first player, so his path, too, terminates at the 7 card in the 11th position. If the second volunteer chooses his initial card at position 4, then his path hits positions 4, 8, and 11. This path also terminates at the 7 card in the 11th position. Thus, for this set of cards and this shuffle, all initial positions follow a path that eventually lands on the 7.

How does the performer know the name of the final magical card? As he deals, he secretly counts the path determined by the first card in the deck.

Simulate the ladders game in SAS

You can use Monte Carlo simulation to estimate the probability that the ladders card trick "works." To keep it simple, let's say the trick "works" when the paths of the two players converge. That is, the players initially start on two random positions (1-5), but they end up on the same final card after they each follow the path dictated by their initial position. The simulation for the ladders card trick has the following major steps:

  1. Set up the card deck. For each card, assign the number of steps to move forward.
  2. Create a function that computes the final position, given the initial position and the sequence of moves determined by the shuffled deck.
  3. Run the Monte Carlo simulation. This consists of several sub-steps:
    • Shuffle the deck (that is, shuffle the array of moves).
    • Use sampling without replacement to choose two initial positions.
    • Compute the final card for each initial position.
    • Record whether the two paths converged.
  4. Estimate the probability of the trick working as the proportion of times that the paths converged in the simulation.

As is often the case, this simulation is easiest to perform in SAS by using the SAS/IML language. You can use the RANPERM function to shuffle the deck. You can use the SAMPLE function to sample without replacement from the values 1-5. You can write a function that computes the final position based on the initial position. The following program is one way to implement a Monte Carlo simulation for the ladders card trick:

/* The Ladders card trick:
   https://statmodeling.stat.columbia.edu/2022/10/07/mira-magic-a-probabilistic-card-trick/
   Lay out a shuffled deck of cards.
   Choose one of the first five cards as a starting point.
   Move down the deck as follows:
   If the current card is 2-9, move that many cards forward.
   For other cards, move one space forward.
*/
proc iml;
/* 1. set up the card deck and assign values. 
   Note: We don't actually need the deck, only the instructions
   about how to move. */
values = {A,'2','3','4','5','6','7','8','9','10', J, Q, K};
advance= {1,  2,  3,  4,  5,  6,  7,  8,  9,  1,  1, 1, 1};
deck  = repeat(values, 4);   /* full deck of 52 cards */
moves = repeat(advance, 4);  /* corresponding moves */
nCards = nrow(moves);
 
/* 2. Find the final position, given the first position and 
   the vector of moves for the current shuffled deck.
   startPos : a scalar value 1-5
   Move : a vector of values. If your current card is in 
          position i, the next card is position (i + Move[i])
*/
start EndPos(startPos, Move);
   Pos = startPos;
   do while(Pos + Move[Pos] < nrow(Move));
      Pos = Pos + Move[Pos];
   end;
   return( Pos );
finish;
 
/* 3. Run the Monte Carlo simulation:
      A. Shuffle the deck (vector of moves)
      B. Use sampling w/o replacement to choose two initial positions
      C. Compute the final card for each initial position
      D. Record whether the two paths converged
*/
call randseed(1234);
NSim = 1E5;
match = j(NSim,1);
do j = 1 to nSim;
   k = ranperm(nCards);          /* shuffle the deck */
   Move = moves[k];
   /* choose two different starting positions in 1-5 */
   startPos = sample(1:5, 2, "NoReplace");
   EndPos1 = EndPos(startPos[1], Move);
   EndPos2 = EndPos(startPos[2], Move);
   match[j] = (EndPos1=EndPos2);
end;
 
/* 4. Estimate probability as the proportion of matches (1s) in a binary vector.
      Optional: Compute a Wald CL for the proportion */
ProbMatch = mean(match);
StdErr = sqrt( ProbMatch*(1-ProbMatch)/ NSim );
CL = ProbMatch + {-1 1}*1.96*StdErr;
print NSim[F=comma9.] ProbMatch StdErr CL[c={'LCL' 'UCL'}];

The simulation shows that the two players end up on the same card more than 98% of the time.

Summary

This article describes a probabilistic card trick, which I call the game of ladders. Two players choose random cards as an initial condition, then follow the rules of the game. For about 98.2% of the deals and initial positions, the paths of the two players will converge to the same card. As the performer is laying out the card, he can predict in advance that the paths will converge and, with high probability, announce the name of the final card.

Bob Carpenter's blog post mentions the connection between this card trick and Markov chains. He also presents several variations, such as using more cards in the deck.

The post Ladders: A probabilistic card trick appeared first on The DO Loop.

11月 212022
 

Recently, I needed to know "how much" of a piecewise linear curve is below the X axis. The coordinates of the curve were given as a set of ordered pairs (x1,y1), (x2,y2), ..., (xn, yn). The question is vague, so the first step is to define the question better. Should I count the number of points on the curve for which the Y value is negative? Should I use a weighted sum and add up all the negative Y values? Ultimately, I decided that the best measure of "negativeness" for my application is to compute the area that lies below the line Y=0 and above the curve. In calculus, this would be the "negative area" of the curve. Because the curve is piecewise linear, you can compute the area exactly by using the trapezoid rule of integration.

An example is shown to the right. The curve is defined by 12 ordered pairs. The goal of this article is to compute the area shaded in blue. This is the "negative area" with respect to the line Y=0. With very little additional effort, you can generalize the computation to find the area below any horizontal line and above the curve.

Area defined by linear segments

The algorithm for computing the shaded area is simple. For each line segment along the curve, let [a,b] be the interval defined by the left and right abscissas (X values). Let f(a) and f(b) be the corresponding ordinate values. Then there are four possible cases for the positions of f(a) and f(b) relative to the horizontal reference line, Y=0:

  • Both f(a) and f(b) are above the reference line. In this case, the area between the line segment and the reference line is positive. We are not interested in this case for this article.
  • Both f(a) and f(b) are below the reference line. In this case, the "negative area" can be computed as the area of a trapezoid: \(A = 0.5 (b - a) (f(b) + f(a))\).
  • The value f(a) is below the reference line, but f(b) is above the line. In this case, the "negative area" can be computed as the area of a triangle. You first solve for the location, c, at which the line segment intersects the reference line. The negative area is then \(A = 0.5 (c - a) f(a)\).
  • The value f(a) is above the reference line and f(b) is below the line. Again, the relevant area is a triangle. Solve for the intersection location, c, and compute the negative area as \(A = 0.5 (b - c) f(b)\).

The three cases for negative area are shown in the next figure:

You can easily generalize these formulas if you want the above the curve and below the line Y=t. In every formula that includes f(a), replace that value with (f(a) – t). Similarly, replace f(b) with (f(b) – t).

Compute the negative area

The simplest computation for the negative area is to loop over all n points on the line. For the i_th point (1 ≤ i < n), let [a,b] be the interval [x[i], x[i+1]] and apply the formulas in the previous section. Since we skip any intervals for which f(a) and f(b) are both positive, we can exclude the point (x[i], y[i]) if y[i-1], y[i], and y[i+1] are all positive. This is implemented in the following SAS/IML function. By default, the function returns the area below the line Y=0 and the curve. You can use an optional argument to change the value of the horizontal reference line.

proc iml;
/* compute the area below the line y=y0 for a piecewise
   linear function with vertices given by (x[i],y[i]) */
start AreaBelow(x, y, y0=0);
   n = nrow(x);
   idx = loc(y<y0);                /* find indices for which y[i] < 0 */
   if ncol(idx)=0 then return(0);
 
   k = unique(idx-1, idx, idx+1);  /* we need indices before and after */
   jdx = loc(k > 0 & k <  n);      /* restrict to indices in [1, n-1] */
   v = k[jdx];                     /* a vector of the relevant vertices */
 
   NegArea = 0;
   do j = 1 to nrow(v);            /* loop over intervals where f(a) or f(b) negative */
      i = v[j];                    /* get j_th index in the vector v */
      fa = y[i]-y0; fb = y[i+1]-y0;/* signed distance from cutoff line */
      if fa > 0 & fb > 0 then 
         ;                         /* segment is above cutoff; do nothing */
      else do;
         a  = x[i];  b = x[i+1];
         if fa < 0 & fb < 0 then do;  /* same sign, use trapezoid rule */
            Area = 0.5*(b - a) * (fb + fa);
         end;
         /* different sign, f(a) < 0, find root and use triangle area */
         else if fa < 0 then do;  
            c = a - fa * (b - a) / (fb - fa);
            Area = 0.5*(c - a)*fa;
         end;
         /* different sign, f(b) < 0, find root and use triangle area */
         else do;
            c = a - fa * (b - a) / (fb - fa);
            Area = 0.5*(b - c)*fb;
         end;
         NegArea = NegArea + Area;
      end;
   end;
   return( NegArea );
finish;
 
/* points along a piecewise linear curve */
x = {   1,    2, 3.5,   4,5,     6, 6.5,   7,   8,  10, 12,  15};
y = {-0.5, -0.1, 0.2, 0.7,0.8,-0.2, 0.3, 0.6, 0.3, 0.1,-0.4,-0.6};
 
/* compute area under the line Y=0 and above curve (="negative area") */
NegArea = AreaBelow(x,y);
print NegArea;

The program defines the AreaBelow function and calls the function for the piecewise linear curve that is shown at the top of this article. The output shows that the area of the shaded regions is -2.185.

The area between a curve and a reference line

I wrote the program in the previous section so that the default behavior is to compute the area below the reference line Y=0 and above a curve. However, you can use the third argument to specify the value of the reference line. For example, the following statements compute the areas below the lines Y=0.1 and Y=0.2, respectively:

Area01 = AreaBelow(x,y, 0.1);      /* reference line Y=0.1 */
Area02 = AreaBelow(x,y, 0.2);      /* reference line Y=0.2 */
print Area01 Area02;

As you would expect, the magnitude of the area increases as the reference line moves up. In fact, you can visualize the area below the line Y=t and the curve as a function of t. Simply, call the AreaBelow function in a loop for a sequence of increasing t values and plot the results:

t = do(-0.7, 1.0, 0.05);
AreaUnder = j(1,ncol(t));
do i = 1 to ncol(t);
   AreaUnder[i] = AreaBelow(x, y, t[i]);
end;
title "Area Under Reference Line";
call series(t, AreaUnder) label={'Reference Level' 'Area'} grid={x y} xvalues=do(-0.7, 1.0, 0.1);

The graph shows that the area under the reference line is a monotonic function. If the reference line is below the minimum value of the curve, the area is zero. As you increase t, the area below the line Y=t and the curve increases in magnitude. After the reference line reaches the maximum value along the curve (Y=0.8 for this example), the magnitude of the area increases linearly. It is difficult to see in the graph above, but the curve is actually linear for t ≥ 0.8.

Summary

You can use numerical integration to determine "how much" of a function is negative. If the function is piecewise linear, the integral over the negative intervals can be computed by using the trapezoid rule. This article shows how to compute the area between a reference line Y=t and a piecewise linear curve. When t=0, this is the "negative area" of the curve.

Incidentally, this article is motivated by the same project that inspired me to write about how to test whether a function is monotonically increasing. If a function is monotonically increasing, then its derivative is strictly positive. Therefore, another way to test a function for monotonic increasing is to test whether the derivative is never negative. A way to measure how far a function deviates from being monotonic is to compute the "negative area" for the derivative.

The post The area under a piecewise linear curve appeared first on The DO Loop.

10月 242022
 

I recently gave a presentation about the SAS/IML matrix language in which I emphasized that a matrix language enables you to write complex analyses by using only a few lines of code. In the presentation, I used least squares regression as an example. One participant asked how many additional lines of code would be required for binary logistic regression. I think my answer surprised him. I claimed it would take about a dozen lines of code to obtain parameter estimates for logistic regression. This article shows how to obtain the parameter estimates for a logistic regression model "manually" by using maximum likelihood estimation.

Example data and logistic regression model

Of course, if you want to fit a logistic regression model in SAS, you should use PROC LOGISTIC or another specialized regression procedure. The LOGISTIC procedure not only gives parameter estimates but also produces related statistics and graphics. However, implementing a logistic regression model from scratch is a valuable exercise because it enables you to understand the underlying statistical and mathematical principles. It also enables you to understand how to generalize the basic model. For example, with minimal work, you can modify the program to support binomial data that represent events and trials.

Let's start by running PROC LOGISTIC on data from the PROC LOGISTIC documentation. The response variable, REMISS, indicates whether there was cancer remission in each of 27 cancer patients. There are six variables that might be important in predicting remission. The following DATA step defines the data. The subsequent call to PROC LOGISTIC fits a binary logistic model to the data. The procedure produces a lot of output, but only the parameter estimates are shown:

data Remission;
   input remiss cell smear infil li blast temp;
   label remiss='Complete Remission';
   datalines;
1   .8   .83  .66  1.9  1.1     .996
1   .9   .36  .32  1.4   .74    .992
0   .8   .88  .7    .8   .176   .982
0  1     .87  .87   .7  1.053   .986
1   .9   .75  .68  1.3   .519   .98
0  1     .65  .65   .6   .519   .982
1   .95  .97  .92  1    1.23    .992
0   .95  .87  .83  1.9  1.354  1.02
0  1     .45  .45   .8   .322   .999
0   .95  .36  .34   .5  0      1.038
0   .85  .39  .33   .7   .279   .988
0   .7   .76  .53  1.2   .146   .982
0   .8   .46  .37   .4   .38   1.006
0   .2   .39  .08   .8   .114   .99
0  1     .9   .9   1.1  1.037   .99
1  1     .84  .84  1.9  2.064  1.02
0   .65  .42  .27   .5   .114  1.014
0  1     .75  .75  1    1.322  1.004
0   .5   .44  .22   .6   .114   .99
1  1     .63  .63  1.1  1.072   .986
0  1     .33  .33   .4   .176  1.01
0   .9   .93  .84   .6  1.591  1.02
1  1     .58  .58  1     .531  1.002
0   .95  .32  .3   1.6   .886   .988
1  1     .6   .6   1.7   .964   .99
1  1     .69  .69   .9   .398   .986
0  1     .73  .73   .7   .398   .986
;
 
proc logistic data=remission;
   model remiss(event='1') = cell smear infil li blast temp;
   ods select ParameterEstimates;
run;

The next section shows how to obtain the parameter estimates "manually" by performing a maximum-likelihood computation in the SAS/IML language.

Logistic regression from scratch

Suppose you want to write a program to obtain the parameter estimates for this model. Notice that the regressors in the model are all continuous variables, and the model contains only main effects, which makes forming the design matrix easy. To fit a logistic model to data, you need to perform the following steps:

  1. Read the data into a matrix and construct the design matrix by appending a column of 1s to represent the Intercept variable.
  2. Write the loglikelihood function. This function will a vector of parameters (b) as input and evaluate the loglikelihood for the binary logistic model, given the data. The Penn State course in applied regression analysis explains the model and how to derive the loglikelihood function. The function is
    \(\ell(\beta) = \sum_{i=1}^{n}[y_{i}\log(\pi_{i})+(1-y_{i})\log(1-\pi_{i})]\)
    where \(\pi_i\) is the i_th component of the logistic transformation of the linear predictor: \(\pi(\beta) = \frac{1}{1+\exp(-X\beta)}\)
  3. Make an initial guess for the parameters and use nonlinear optimization to find the maximum likelihood estimates. The SAS/IML language supports several optimization methods. I use the Newton-Raphson method, which is implemented by using the NLPNRA subroutine.
proc iml;
/* 1. read data and form design matrix */
varNames = {'cell' 'smear' 'infil' 'li' 'blast' 'temp'};
use remission;
read all var "remiss" into y;   /* read response variable */
read all var varNames into X;   /* read explanatory variables */
close;
X = j(nrow(X), 1, 1) || X;     /* design matrix: add Intercept column */
 
/* 2. define loglikelihood function for binary logistic model */
start BinLogisticLL(b) global(X, y);
   z = X*b`;                   /* X*b, where b is a column vector */
   p = Logistic(z);            /* 1 / (1+exp(-z)) */
   LL = sum( y#log(p) + (1-y)#log(1-p) );   
   return( LL );
finish;
 
/* 3. Make initial guess and find parameters that maximize the loglikelihood */
b0 = j(1, ncol(X), 0);         /* initial guess */
opt = 1;                       /* find maximum of function */
call nlpnra(rc, b, "BinLogisticLL", b0, opt);   /* use Newton's method to find b that maximizes the LL function */
print b[c=("Intercept"||varNames) L="Parameter Estimates" F=D8.];
QUIT;

The parameter estimates from the MLE are similar to the estimates from PROC LOGISTIC. When performing nonlinear optimization, different software uses different methods and criteria for convergence, so you will rarely get the exact parameter estimates when you solve the same problem by using different methods. For these data, the relative difference in each parameter estimate is ~1E-4 or less.

I didn't try to minimize the number of lines in the program. As written, the number of executable lines is 16. If you shorten the program by combining lines (for example, p = Logistic(X*b`);), you can get the program down to 10 lines.

Of course, a program should never be judged solely by the number of lines it contains. I can convert any program to a "one-liner" by wrapping it in a macro! I prefer to ask whether the program mimics the mathematics underlying the analysis. Can the statements be understood by someone who is familiar with the underlying mathematics but might not be an expert programmer? Does the language provide analysts with the tools they need to perform and combine common operations in applied math and statistics? I find that the SAS/IML language succeeds in all these areas.

Logistic regression with events-trials syntax

As mentioned previously, implementing the regression estimates "by hand" not only helps you to understand the problem better, but also to modify the basic program. For example, the syntax of PROC LOGISTIC enables you to analyze binomial data by using the events-trials syntax. When you have binomial data, each observation contains a variable (EVENTS) that indicates the number of successes and the total number of trials (TRIALS) for a specified value of the explanatory variables. The following DATA step defines data that are explained in the Getting Started documentation for PROC LOGISTIC. The data are for a designed experiment. For each combination of levels for the independent variables (HEAT and SOAK), each observation contains the number of items that were ready for further processing (R) out of the total number (N) of items that were tested. Thus, each observation represents N items for which R items are events (successes) and N-R are nonevents (failures).

The following DATA step defines the binomial data and calls PROC LOGISTIC to fit a logistic model:

/* Getting Started example for PROC LOGISTIC */
data ingots;
   input Heat Soak r n @@;
   datalines;
7 1.0 0 10  14 1.0 0 31  27 1.0 1 56  51 1.0 3 13
7 1.7 0 17  14 1.7 0 43  27 1.7 4 44  51 1.7 0  1
7 2.2 0  7  14 2.2 2 33  27 2.2 0 21  51 2.2 0  1
7 2.8 0 12  14 2.8 0 31  27 2.8 1 22  51 4.0 0  1
7 4.0 0  9  14 4.0 0 19  27 4.0 1 16
;
 
proc logistic data=ingots;
   model r/n = Heat Soak;
   ods select ParameterEstimates;
run;

Can you modify the previous SAS/IML program to accommodate binomial data? Yes! The key is "expand the data." Each observation represents N items for which R items are events (successes) and N-R are nonevents (failures). You could therefore reuse the previous program if you introduce a "frequency variable" (FREQ) that indicates the number of items that are represented by each observation.

The following program is a slight modification of the original program. The FREQ variable is derived from the EVENTS and TRIALS variables. The data is duplicated so that each observation in the binomial data becomes a pair of observations, one with frequency EVENTS and the other with frequency TRIALS-EVENTS. In the loglikelihood function, the frequency variable is incorporated into the summation.

proc iml;
/* 1. read data and form design matrix */
varNames = {'Heat' 'Soak'};
use ingots;
read all var "r" into Events;
read all var "n" into Trials;
read all var varNames into X;
close;
n = nrow(X);
X = j(n,1,1) || X;             /* design matrix: add Intercept column */
 
/* for event-trial syntax, split each data row into TWO observations,
   one for the events and one for the nonevents. Create Freq = frequency variable. */ 
Freq = Events // (Trials-Events);
X = X // X;                    /* duplicate regressors */
y = j(n,1,1) // j(n,1,0);      /* binary response: 1s and 0s */
 
/* 2. define loglikelihood function for events-trials logistic model */
start BinLogisticLLEvents(b) global(X, y, Freq);
   z = X*b`;                   /* X*b, where b is a column vector */
   p = Logistic(z);            /* 1 / (1+exp(-z)) */
   LL = sum( Freq#(y#log(p) + (1-y)#log(1-p)) );   /* count each observation numEvents times */
   return( LL );
finish;
 
/* 3. Make initial guess and find parameters that maximize the loglikelihood */
b0 = j(1, ncol(X), 0);         /* initial guess (row vector) */
opt = 1;                       /* find maximum of function */
call nlpnra(rc, b, "BinLogisticLLEvents", b0, opt);
print b[c=("Intercept"||varNames) L="Parameter Estimates" F=D8.];
QUIT;

For these data and model, the parameter estimates are the same to four decimal places. Again, there are small relative differences in the estimates. But the point is that you can solve for the MLE parameter estimates for binomial data from first principles by using about 20 lines of SAS/IML code.

Summary

This article shows how to obtain parameter estimates for a binary regression model by writing about a short SAS/IML program. With slightly more effort, you can obtain parameter estimates for a binomial model in events-trials syntax. These examples demonstrate the power, flexibility, and compactness of the SAS/IML matrix programming language.

Of course, this power and flexibility come at a cost. When you use SAS/IML to solve a problem, you must understand the underlying mathematics of the problem. Second, the learning curve for SAS/IML is higher than for other SAS procedures. SAS procedures such as PROC LOGISTIC are designed so that you can focus on building a good predictive model without worrying about the details of numerical methods. Accordingly, SAS procedures can be used by analysts who want to focus solely on modeling. In contrast, the IML procedure is often used by sophisticated programmers who want to extend the capabilities of SAS by implementing custom algorithms.

The post Implement binary logistic regression from first principles appeared first on The DO Loop.

10月 032022
 

A previous article discusses the definitions of three kinds of moments for a continuous probability distribution: raw moments, central moments, and standardized moments. These are defined in terms of integrals over the support of the distribution. Moments are connected to the familiar shape features of a distribution: the mean, variance, skewness, and kurtosis. This article shows how to evaluate integrals in SAS to evaluate the moments of a continuous distribution, and how to compute the mean, variance, skewness, and kurtosis of the distribution.

Notice that this article is not about how to compute the sample estimates. You can easily use PROC MEANS or PROC UNIVARIATE to compute the sample mean, sample variance, and so forth, from data. Rather, this article is about how to compute the quantities for the probability distribution.

Definitions of moments of a continuous distribution

Evaluating a moment requires performing an integral. The domain of integration is the support of the distribution. This is often infinite (for example, the normal distribution on (-∞, ∞)) or semi-bounded (for example, the lognormal distribution on [0,∞)), but sometimes it is finite (for example, the beta distribution on [0,1]). The QUAD subroutine in SAS/IML software can compute integrals on finite, semi-infinite, and infinite domains. You need to specify a SAS/IML module that evaluates the integrand at an arbitrary point in the domain. To compute a raw moment, the integrand is computed as the product of x and the probability density function (PDF). To compute a central moment, the integrand is the product the PDF and a power of (x - μ), where μ is the mean of the distribution. Let f(x) be the PDF. The moments relate to the mean, variance, skewness, and kurtosis as follows:

  • The mean is the first raw moment: \(\mu = \int_{-\infty}^{\infty} x f(x)\, dx\)
  • The variance is the second central moment: \(\sigma^2 = \int_{-\infty}^{\infty} (x - \mu)^2 f(x)\, dx\)
  • The skewness is the third standardized moment: \(\mu_3 / \sigma^3\) where \(\mu_3 = \int_{-\infty}^{\infty} (x - \mu)^3 f(x)\, dx\) is the third central moment.
  • The full kurtosis is the fourth standardized moment: \(\mu_4 / \sigma^4\) where \(\mu_4 = \int_{-\infty}^{\infty} (x - \mu)^4 f(x)\, dx\) is the fourth central moment. For consistency with the sample estimates, you can compute the excess kurtosis by subtracting 3 from this quantity.

Compute moments in SAS

To demonstrate how to compute the moments of a continuous distribution, let's use the gamma distribution with shape parameter α=4. The PDF of the Gamma(4) distribution is shown to the right. The domain of the PDF is [0, ∞). The mean, variance, skewness, and excess kurtosis for the gamma distribution can be computed analytically. The formulas are given in the Wikipedia article about the gamma distribution. The following SAS/IML program evaluates the formulas for α=4.

/* compute moments (raw, central, and standardized) for a probability distribution */
proc iml;
/* Formulas for the mean, variance, skewness, and excess kurtosis, 
   of the gamma(alpha=4) distribution from https://en.wikipedia.org/wiki/Gamma_distribution
*/
alpha = 4;                 /* shape parameter */
mean  = alpha;             /* mean */
var   = alpha;             /* variance */
skew  = 2 / sqrt(alpha);   /* skewness */
exKurt= 6 / alpha;         /* excess kurtosis */
colNames = {'Mean' 'Var' 'Skew' 'Ex Kurt'};
print (mean||var||skew||exKurt)[c=colNames L="Formulas for Gamma"];

Let's evaluate these same quantities by performing numerical integration of the integrals that define the moments of the distribution:

/* return the PDF of a specified probability distribution */
start f(x) global(alpha);
   return  pdf("Gamma", x, alpha);  /* return the PDF evaluated at x */
finish;
 
/* the first raw moment is the mean: \int x*f(x) dx */
start Raw1(x);
   return( x # f(x) );              /* integrand for first raw moment */
finish;
 
/* the second central moment is the variance: \int (x-mu)^2*f(x) dx */
start Central2(x) global(mu);
   return( (x-mu)##2 # f(x) );      /* integrand for second central moment */
finish;
 
/* the third central moment is the UNSTANDARDIZE skewness: \int (x-mu)^3*f(x) dx */
start Central3(x) global(mu);
   return( (x-mu)##3 # f(x) );      /* integrand for third central moment */
finish;
 
/* the fourth central moment is the UNSTANDARDIZE full kurtosis: \int (x-mu)^4*f(x) dx */
start Central4(x) global(mu);
   return( (x-mu)##4 # f(x) );      /* integrand for fourth central moment */
finish;
 
Interval = {0, .I};                 /* support of gamma distribution is [0, infinity) */
call quad(mu, "Raw1", Interval);    /* define mu, which is used for central moments */
call quad(var, "Central2", Interval);
call quad(cm3, "Central3", Interval);
call quad(cm4, "Central4", Interval);
 
/* do the numerical calculations match the formulas? */
skew = cm3 / sqrt(var)##3;          /* standardize 3rd central moment */
exKurt = cm4 / sqrt(var)##4 - 3;    /* standardize 4th standard moment; subtract 3 to get excess kurtosis */
print (mu||var||skew||exKurt)[c=colNames L="Numerical Computations for Gamma"];

For this distribution, the numerical integration produces numbers that exactly match the formulas. For other distributions, the numerical integration might differ from the exact values by a tiny amount.

Checking the moments against the sample estimates

Not all distributions have analytical formulas for the mean, variance, and so forth. When I compute a new quantity, I like to verify the computation by comparing it to another computation. In this case, you can compare the distributional values to the corresponding sample estimates for a large random sample. In other words, if you generate a large random sample from the Gamma(4) distribution and compute the sample mean, sample, variance, and so forth, you should obtain sample estimates that are close to the corresponding values for the distribution. You can perform this simulation in the DATA step or in the SAS/IML language. The following statements use the DATA step and PROC MEANS to generate a random sample of size N=10,000 from the gamma distribution:

data Gamma(keep = x);
alpha = 4;
call streaminit(1234);
do i = 1 to 10000;
   x = rand("Gamma", alpha);
   output;
end;
run;
 
proc means data=Gamma mean var skew kurt NDEC=3;
run;

As expected, the sample estimates are close to the values of the distribution. The sample skewness and kurtosis are biased statistics and have relatively large standard errors, so the estimates for those parameters might not be very close to the distributional values, even for a large sample.

Summary

This article shows how to use the QUAD subroutine in SAS/IML software to compute raw and central moments of a probability distribution. This article uses the gamma distribution with shape parameter α=4, but the computation generalizes to other distributions. The important step is to write a user-defined function that evaluates the PDF of the distribution of interest. For many common distributions, you can use the PDF function in Base SAS to evaluate the density function. You also must specify the support of the distribution, which is often an infinite or semi-infinite domain of integration. When the moments exist, you can compute the mean, variance, skewness, and kurtosis of any distribution.

The post Compute moments of probability distributions in SAS appeared first on The DO Loop.