Statistical Programming

11月 182020
 
Predicted probabilities for a logistic regression model

To help visualize regression models, SAS provides the EFFECTPLOT statement in several regression procedures and in PROC PLM, which is a general-purpose procedure for post-fitting analysis of linear models. When scoring and visualizing a model, it is important to use reasonable combinations of the explanatory variables for the visualization. When the explanatory variables are correlated, the standard visualizations might evaluate the model at unreasonable (or even impossible!) combinations of the explanatory variables. This article describes how to generate points in an ellipsoid and score (evaluate) a regression model at those points. The ellipsoid contains typical combinations of the explanatory variables when the explanatory variables are multivariate normal. An example is shown in the graph to the right, which shows the predicted probabilities for a logistic regression model.

This technique can help prevent nonsensical predicted values. For example, in a previous article, I show an example of a quadratic regression surface that appears to predict negative values for a quantity that can never be negative (such as length, mass, or volume). The reason for the negative prediction is that the model is being evaluated at atypical points that are far from the explanatory values in the experiment.

Example: Visualizing the predicted probabilities for a logistic model

Let's look at an example of a regression model for which the regressors are correlated. Suppose you want to predict the gender of the 19 students in the Sashelp.Class data based on their height and weight. (Apologies to my international readers, but the measurements are in pounds and inches.) The following call to PROC LOGISTIC fits a logistic model to the data and uses the EFFECTPLOT statement to create a contour plot of the predicted probability that a student is female:

proc logistic data=Sashelp.Class;
   model Sex = Height Weight;
   effectplot contour / ilink;      /* contour plot of pred prob */
   store out=LogiModel;             /* store model for later */
run;

The contour plot shows the predicted probability of being female according to the model. The red regions have a high probability of being female and the blue regions have a low probability. For a given height, lighter students are more likely to be female.

This graph is a great summary of the model. However, because height and weight are correlated, the data values do not fill the rectangular region in the graph. There are no students in the upper-left corner near (height, weight)=(51, 150). Such a student would be severely obese. Similarly, there are no students in the lower-right corner near (height, weight)=(72, 50). Such a student would be severely underweight.

You can graph just the explanatory variables and overlay a 95% prediction ellipse for bivariate normal data, as shown below:

title "Weight versus Height";
proc sgplot data=Sashelp.Class;
   scatter x=Height y=Weight / jitter transparency=0.25 group=Sex markerattrs=(size=10 symbol=CircleFilled);
   ellipse x=Height y=Weight;    /* 95% prediction ellipse for bivariate normal data */
run;

The heights and weights of most students will fall into the prediction ellipse, assuming that the variables are multivariate normal with a mean vector that is close to (62.3, 100), which is the mean of the data.

If you want to score the model at "representative" values of the data, it makes sense to generate points inside the ellipse. The following sections show how to generate a regular grid of values inside the ellipse and how to score the logistic model on those points.

Generate data in an ellipsoid

I can think of several ways to generate a grid of values inside the ellipse, but the easiest way is to use the Mahalanobis distance (MD). The Mahalanobis distance is a metric that accounts for the correlation in data. Accordingly, you can generate a regular grid of points and then keep only the points whose Mahalanobis distance to the mean vector is less than a certain cutoff value. In this section, I choose the cutoff value to be the maximum Mahalanobis distance in the original data. Alternately, you can use probability theory and the fact that the distribution of the squared MD values is asymptotically chi-squared.

The following SAS/IML program reads the explanatory variables and computes the sample mean and covariance matrix. You can use the MAHALANOBIS function to compute the Mahalanobis distance between each observation and the sample mean. The maximum MD for these data is about 2.23. You can use the ExpandGrid function to generate a regular grid of points, then keep only those points whose MD is less than 2.23. The resulting points are inside an ellipse and represent "typical" values for the variables.

proc iml;
varNames = {'Height' 'Weight'};
use Sashelp.Class; read all var varNames into X;  close;
 
S = cov(X);
m = mean(X);
MD_Data = mahalanobis(X, m, S);
MaxMD = max(MD_Data);
print MaxMD;
 
minX = 50;  maxX = 72;           /* or use min(X[,1]) and max(X[,1]) */
minY = 50;  maxY = 150;          /* or use min(X[,2]) and max(X[,2]) */
xRng = minX:maxX;                /* step by 1 in X */
yRng = do(minY, maxY, 5);        /* step by 5 in Y */
 
u = ExpandGrid( xRng, yRng );    /* regular grid of points */
MD = mahalanobis(u, m, S);       /* MD to center for each point */
u = u[ loc(MD < MaxMD), ];       /* keep only points where MD < cutoff */
ods graphics / width=300px height=300px;
call scatter(u[,1], u[,2]);
 
create ScoreIn from u[c=varNames];   /* write score data set */
   append from u;
close;
QUIT;

Score data in an ellipsoid

You can use PROC PLM to score the logistic model at the points in the ellipsoid. In the earlier call to PROC LOGISTIC, I used the STORE statement to create an item store named LogiModel. The following call to PROC PLM scores the model and uses the ILINK option to output the predicted probabilities. You can use the HEATMAPPARM statement in PROC SGPLOT to visualize the predicted probabilities. So that the color ramp always visualizes probabilities on the interval [0,1], I add two fake observations (0 and 1) to the predicted values.

proc PLM restore=LogiModel noprint;
   score data=ScoreIn out=ScoreOut / ilink;      /* predicted probability */
run;
 
data Fake;           /* Optional trick: Add 0 and 1 so colorramp shows range [0,1] */
Predicted=0; output;
Predicted=1; output;
run;
 
data Score;          /* concatenate the real and fake data */
   set ScoreOut Fake;
run;
 
title "Predicted Values of Sex='F' for Logistic Model";
proc sgplot data=Score;
   styleattrs wallcolor=CxF8F8F8;
   heatmapparm  x=Height y=Weight colorresponse=Predicted;
   xaxis grid; yaxis grid;
run;

The graph is shown at the top of this article. It shows the predicted probabilities for representative combinations of the height and weight variables. Students whose measurements appear in the lower portion of the ellipse (red) are more likely to be female. Students in the upper portion of the ellipse (blue) are more likely to be male. It is not likely that students have measurements outside the ellipse, so the model is not evaluated outside the ellipse.

Summary

This article shows how you can use the Mahalanobis distance to generate scoring data that are in an ellipsoid that is centered at the sample mean vector. Such data are appropriate for scoring "typical" combinations of variables in data that are approximated multivariate normal. Although the example used a logistic regression model, the idea applies to any model that you want to score. This technique can help to prevent scoring the model at unrealistic values of the explanatory variables.

The post Create scoring data when regressors are correlated appeared first on The DO Loop.

11月 092020
 

Intuitively, the skewness of a unimodal distribution indicates whether a distribution is symmetric or not. If the right tail has more mass than the left tail, the distribution is "right skewed." If the left tail has more mass, the distribution is "left skewed." Thus, estimating skewness requires some estimates about the relative size of the left and right tails of a distribution.

The standard definition of skewness is called the moment coefficient of skewness because it is based on the third central moment. The moment coefficient of skewness is a biased estimator and is also not robust to outliers in the data. This article discusses an estimator proposed by Hogg (1974) that is robust and less biased. The ideas in this article are based on Bono, et al. (2020). Bono et al. provide a SAS macro that uses SAS/IML to compute the estimates. This article provides SAS/IML functions that compute the estimates more efficiently and more succinctly.

Early robust estimators of skewness

Hogg's estimate of skewness is merely one of many robust measures of skewness that have been proposed since the 1920s. I have written about the Bowley-Galton skewness, which is based on quantiles. The Bowley-Galton estimator defines the length of the right tails as Q3-Q2 and the length of the left tail as Q2-Q1. (Q1 is the first quartile, Q2 is the median, and Q3 is the third quartile.) It then defines the quantile skewness as half the relative difference between the lengths, which is BG = ((Q3-Q2)-(Q2-Q1)) / (Q3-Q1).

Although "the tail" does not have a universally accepted definition, some people would argue that a tail should include portions of the distribution that are more extreme than the Q1 and Q3 quartiles. You could generalize the Bowley-Galton estimator to use more extreme quantiles. For example, you could use the upper and lower 20th percentiles. If P20 is the 20th percentile and P80 is the 80th percentile, you can define an estimator to be ((P80-Q2)-(Q2-P20)) / (P80-P20). The choice of percentiles to use is somewhat arbitrary.

Hogg's estimator of skewness

Hogg's measure of skewness (1974, p. 918) is a ratio of the right tail length to the left tail length. The right tail length is estimated as U(0.05) – M25, where U(0.05) is the average of the largest 5% of the data and M25 is the 25% trimmed mean of the data. The left tail length is estimated as M25 – L(0.05), where L(0.05) is the average of the smallest 5% of the data. The Hogg estimator is
SkewH = (U(0.05) – M25) / (M25 - L(0.05))

Note that the 25% trimmed mean is the mean of the middle 50% of the data. So the estimator is based on estimating the means of various subsets of the data that are based on quantiles of the data.

Compute the mean of the lower and upper tails

I have written about how to visualize and compute the average value in a tail. However, I did not discuss how to make sense of phrases like "5% of the data" in a sample of size N when 0.05*N is not an integer. There are many estimates of quantiles but Bono et al. (2020) and uses an interpolation method (QNTLDEF=1 in SAS) to compute a weighted mean of the data based on the estimated quantiles.

In Bono et al., the "mean of the lower p_th quantile" is computed by computing k=floor(p*N), which is the closest integer less then p*N, and the fractional remainder, r = p*N – k. After sorting the data from lowest to highest, you sum the first k values and a portion of the next value. You then compute the average as (sum(x[1:k]) + r*x[k+1]) / (k+r), as shown in the following program. The mean of the upper quantile is similar.

proc iml;
/* ASSUME x is sorted in ascending order.
   Compute the mean of the lower p_th percentile of the data.
   Use linear interpolation if p*N is not an integer. Assume 0<= p <= 1. 
*/
start meanLowerPctl(p, x);
   N = nrow(x);  
   k = floor(p*N);                 /* index less than p*N */
   r = p*N - k;                    /* r >= 0 */
   if k<1 then m = x[1];           /* use x[1] as the mean */
   else if k>=N then m = mean(x);  /* use mean of all data */
   else do;                        /* use interpolation */
      sum = sum(x[1:k]) + r*x[k+1];
      m = sum / (k+r);
   end;
   return( m );
finish;
 
/* ASSUME x is sorted in ascending order.
   Compute the mean of the upper p_th percentile of the data.
*/
start meanUpperPctl(p, x);
   q = 1-p; 
   N = nrow(x); 
   k = ceil(q*N);                  /* index greater than p*N */
   r = k - q*N;                    /* r >= 0 */
   if k<=1 then m = mean(x);       /* use mean of all data */
   else if k>=N then m = x[N];     /* use x[1] as the mean */
   else do;                        /* use interpolation */
      sum = sum(x[k+1:N]) + r*x[k];
      m = sum / (N-k+r);
   end;
   return( m );
finish;

Let's run a small example so that we can check that the functions work. The following statements define an ordered sample of size N=10 and compute the mean of the lower and upper p_th quantiles for p=0.1, 0.2, ..., 1.

x = {2, 4, 5, 7, 8, 8, 9, 9, 12, 16};  /* NOTE: sorted in increasing order */
obsNum = T(1:nrow(x));
Pctl = obsNum / nrow(x);
meanLow = j(nrow(Pctl), 1, .);
meanUp = j(nrow(Pctl), 1, .);
do i = 1 to nrow(Pctl);
   meanLow[i] = meanLowerPctl(Pctl[i], x);
   meanUp[i] = meanUpperPctl(Pctl[i], x);
end;
print Pctl meanLow meanUp;

Because there are 10 observations and the program uses deciles to define the tails, each row gives the mean of the smallest and largest k observations for k=1, 2, ..., 10. For example, the second row indicates that the mean of the two smallest values is 3, and the mean of the two largest values is 14.

If you plot the estimated means as a function of the percentile, you can see how the functions interpolate between the data values. The following statements compute and graph the mean of the lower p_th quantile as a function of p. The graph of the mean of the upper tail is computed in an analogous fashion:

/* plot the mean of the lower tail as a function of p */
p = do(0, 1, 0.01);
mLow = j(1, ncol(p), .);
do i = 1 to ncol(p);
   mLow[i] = meanLowerPctl(p[i], x);
end;
 
title "Mean of Lower p Percentile of Data";
call series(p, mLow) grid={x y} xvalues=do(0,1,0.1) label={"Percentile" "Mean"};

Compute Hogg's skewness in SAS

With these functions, you can compute Hogg's estimate of skewness. The following SAS/IML function implements the formula. The function is then tested on a large random sample from the exponential distribution.

/* estimate of Hogg skewness */
start HoggSkew(_x);
   x = _x;
   call sort(x);
   L05 = meanLowerPctl(0.05, x);
   U05 = meanUpperPctl(0.05, x);
   m25 = mean(x, 'trim', 0.25);
   Skew_Hogg = (U05 - m25) / (m25 - L05); /* Right Skewness: Hogg (1974, p. 918) */
   return Skew_Hogg;
finish;
 
/* generate random sample from the exponential distribution */
call randseed(12345, 1);
N = 1E5;
x = randfun(N, "Expon");
Skew_Hogg = HoggSkew(x);  /* find the Hogg skewness for exponential data */
print Skew_Hogg;

The sample estimate of Hogg's skewness for these data is 4.558. The value of Hogg's skewness for the exponential distribution is 4.569, so there is close agreement between the parameter and its estimate from the sample.

Compute Hogg's kurtosis in SAS

In the same paper, Hogg (1974) proposed a robust measure of kurtosis, which is
KurtH = (U(0.2) – L(0.2)) / (U(0.5 - U(0.5))
where U(0.2) and L(0.2) are the means of the upper and lower 20% of the data (respectively) and U(0.5) and L(0.5) are the means of the upper and lower 50% of the data (respectively). You can use the same SAS/IML functions to compute Hogg's kurtosis for the data:

/* estimate of Hogg kurtosis */
start HoggKurt(_x);
   x = _x;
   call sort(x);
   L20 = meanLowerPctl(0.20, x);
   U20 = meanUpperPctl(0.20, x);
   L50 = meanLowerPctl(0.50, x);
   U50 = meanUpperPctl(0.50, x);
   Kurt_Hogg = (U20 - L20) / (U50 - L50); /* Kurtosis: Hogg (1974, p. 913) */
   return Kurt_Hogg;
finish;
 
Kurt_Hogg = HoggKurt(x);  /* find the Hogg kurtosis for exponential data */
print Kurt_Hogg;

For comparison, the exact value of the Hogg kurtosis for the exponential distribution is 1.805.

Summary

This article shows how to compute Hogg's robust measures of skewness and kurtosis. The article was inspired by Bono et al. (2020), who explore the bias and accuracy of Hogg's measures of skewness and kurtosis as compared to the usual moment-based skewness and kurtosis. They conclude that Hogg’s estimators are less biased and more accurate. Bono et al. provide a SAS macro that computes Hogg's estimators. In this article, I rewrote the computations as SAS/IML functions to make them more efficient and more understandable. These functions enable researchers to compute Hogg's skewness and kurtosis for data analysis and in simulation studies.

You can download the complete SAS program that defines the Hogg measures of skewness and kurtosis. The program includes computations that show how to compute these values for a distribution.

References

The post Robust statistics for skewness and kurtosis appeared first on The DO Loop.

11月 042020
 

The expected value of a random variable is essentially a weighted mean over all possible values. You can compute it by summing (or integrating) a probability-weighted quantity over all possible values of the random variable. The expected value is a measure of the "center" of a probability distribution.

Expected value for the tail of a distribution

You can generalize this idea. Instead of using the entire distribution, suppose you want to find the "center" of only a portion of the distribution. For the central portion of a distribution, this process is similar to the trimmed mean. (The trimmed mean discards data in the tails of a distribution and averages the remaining values.) For the tails of a distribution, a natural way to compute the expected value is to sum (or integrate) the weighted quantity x*pdf(x) over the tail of the distribution. The graph to the right illustrates this idea for the exponential distribution. The left and right tails (defined by the 20th and 80th percentiles, respectively) are shaded and the expected value in each tail is shown by using a vertical reference line.

This article shows how to compute the expected value of the tail of a probability distribution. It also shows how to estimate this quantity for a data sample.

Why is this important? Well, this idea appears in some papers about robust and unbiased estimates of the skewness and kurtosis of a distribution (Hogg, 1974; Bono, et al., 2020). In estimating skewness and kurtosis, it is important to estimate the length of the tails of a distribution. Because the tails of many distributions are infinite in length, you need some alternative definition of "tail length" that leads to finite quantities. One approach is to truncate the tails, such as at the 5th percentile on the left and the 95th percentile on the right. An alternative approach is to use the expected value of the tails, as shown in this article.

The expected value in a tail

Suppose you have any distribution with density function f. You can define the tail distribution as a truncated distribution on the interval (a,b), where possibly a = -∞ or b = ∞. To get a proper density, you need to divide by the area of the tail, as follows:
\( g(x) = f(x) / \int_a^b f(x) \,dx \)
If F(x) is the cumulative distribution, the denominator is simply the expression F(b)F(a). Therefore, the expected value for the truncated distribution on (a,b) is
\( EV = \int_a^b x g(x) \,dx = (\int_a^b x f(x) \,dx) / (F(b) - F(a)) \)

There is no standard definition for the "tail" of a distribution, but one definition is to use symmetric quantiles of the distribution to define the tails. For a quantile, p, you can define the left tail to be the portion of the distribution for which X ≤ p and the right tail to be the portion for which X ≥ 1-p. If you let qL be the pth quantile and qU be the (1-p)th quantile, then the expected value of the left tail is
\( E_L = (\int_{-\infty}^{q_L} x f(x) \,dx) / (F(q_L) - 0) \)
and the expected value of the right tail is
\( E_R = (\int_{q_U}^{\infty} x f(x) \,dx) / (1 - F(q_U)) \)

The expected value in the tail of the exponential distribution

For an example, let's look at the exponential distribution. The exponential distribution is defined only for x ≥ 0, so the left tail starts a 0. The choice of the quantile, p, is arbitrary, but I will use p=0.2 because that value is used in Bono, et al. (2020). The 20th percentile of the exponential distribution is q20 = 0.22. The 80th percentile is q80 = 1.61. You can use the QUAD function in the SAS/IML language to compute the integrals, as follows:

proc iml;
/* find the expected value of the truncated exponential distribution on the interval [a,b] 
*/
start Integrand(x);
   return x*pdf("Expon",x);
finish;
 
/* if f(x) is the PDF and F(x) is the CDF of a distribution, the expected value on [a,b] is
   (\int_a^b  x*f(x) dx) / (CDF(B) - CDF(a))
*/
start Expo1Moment(a,b);
   call quad(numer, "Integrand", a||b );
   /* define CDF(.M)=0 and CDF(.P)=1 */
   cdf_a = choose(a=., 0, cdf("Expon", a));  /* CDF(a) */
   cdf_b = choose(b=., 1, cdf("Expon", b));  /* CDF(b) */
   ExpectedValue = numer / (cdf_b - cdf_a);
   return ExpectedValue;
finish;
 
/* expected value of lower 20th percentile of Expon distribution */
p = 0.2;
qLow = quantile("Expon", p);
ExpValLow20 = Expo1Moment(0, qLow);
print qLow ExpValLow20;
 
/* expected value of upper 20th percentile */
qHi = quantile("Expon", 1-p);
ExpValUp20 = Expo1Moment(qHi, .I); /* .I = infinity */
print qHi ExpValUp20;

In this program, the left tail is the portion of the distribution to the left of the 20th percentile. The right tail is to the right of the 80th percentile. The first table says that the expected value in the left tail of the exponential distribution is 0.107. Intuitively, that is the weighted average of the left tail or the location of its center of mass. The second table says that the expected value in the right tail is 0.209. These results are visualized in the graph at the top of this article.

Estimate the expected value in the tail of a distribution

You can perform a similar computation for a data sample. Instead of an integral, you merely take the average of the lower and upper p*100% of the data. For example, the following SAS/IML statements simulate 100,000 random variates from the exponential distribution. You can use the QNTL function to estimate the quantiles of the data. You can then use the LOC function to find the elements that are in the tail and use the MEAN function to compute the arithmetic average of those elements:

/* generate data from the exponential distribution */
call randseed(12345);
N = 1e5;
x = randfun(N, "Expon");
 
/* assuming no duplicates and a large sample, you can 
   use quantiles and means to estimate the expected values */
/* estimate the expected value for the lower 20% tail */
call qntl(qEst, x, p);      /* p_th quantile */
idx = loc(x<=qEst);         /* rows for which x[i]<= quantile */
meanLow20 = mean(x[idx]);   /* mean of lower tail */
print qEst meanLow20;
 
/* estimate the expected value for the upper 20% tail */
call qntl(qEst, x, 1-p);    /* (1-p)_th quantile */
idx = loc(x>=qEst);         /* rows for which x[i]>= quantile */
meanUp20 = mean(x[idx]);    /* mean of upper tail */
print qEst meanUp20;

The estimates are shown for the lower and upper 20th-percentile tails of the data. Because the sample is large, the sample estimates are close to the quantiles of the exponential distribution. Also, the means of the lower and upper tails of the data are close to the expected values for the tails of the distribution.

It is worth mentioning that there are many different formulas for estimating quantiles. Each estimator will give a slightly different estimate for quantile, and therefore you can get different estimates for the mean. This becomes important for small samples, for long-tailed distributions, and for samples that have duplicate values.

Summary

The expected value is a measure of the "center" of a probability distribution on some domain. This article shows how to solve an integral to find the expected value for the left tail or right tail of a distribution. For a data distribution, the expected value is the mean of the observations in the tail. In a future article, I'll show how to use these ideas to create robust and unbiased estimates of skewness and kurtosis.

The post The expected value of the tail of a distribution appeared first on The DO Loop.

10月 262020
 

A fundamental principle of data analysis is that a statistic is an estimate of a parameter for the population. A statistic is calculated from a random sample. This leads to uncertainty in the estimate: a different random sample would have produced a different statistic. To quantify the uncertainty, SAS procedures often support options that estimate standard errors for statistics and confidence intervals for parameters.

Of course, if statistics have uncertainty, so, too, do functions of the statistics. For complicated functions of the statistics, the bootstrap method might be the only viable technique for quantifying the uncertainty.

Graphical comparison of two methods for estimating confidence intervals of eigenvalues of a correlation matrix

This article shows how to obtain confidence intervals for the eigenvalues of a correlation matrix. The eigenvalues are complicated functions of the correlation estimates. The eigenvalues are used in a principal component analysis (PCA) to decide how many components to keep in a dimensionality reduction. There are two main methods to estimate confidence intervals for eigenvalues: an asymptotic (large sample) method, which assumes that the eigenvalues are multivariate normal, and a bootstrap method, which makes minimal distributional assumptions.

The following sections show how to compute each method. A graph of the results is shown to the right. For the data in this article, the bootstrap method generates confidence intervals that are more accurate than the asymptotic method.

This article was inspired by Larsen and Warne (2010), "Estimating confidence intervals for eigenvalues in exploratory factor analysis." Larsen and Warne discuss why confidence intervals can be useful when deciding how many principal components to keep.

Eigenvalues in principal component analysis

To demonstrate the techniques, let's perform a principal component analysis (PCA) on the four continuous variables in the Fisher Iris data. In SAS, you can use PROC PRINCOMP to perform a PCA, as follows:

%let DSName = Sashelp.Iris;
%let VarList = SepalLength SepalWidth PetalLength PetalWidth;
 
/* 1. compute value of the statistic on original data */
proc princomp data=&DSName STD plots=none;     /* stdize PC scores to unit variance */
   var &VarList;
   ods select ScreePlot Eigenvalues NObsNVar;
   ods output Eigenvalues=EV0 NObsNVar=NObs(where=(Description="Observations"));
run;
 
proc sql noprint;                              
 select nValue1 into :N from NObs;   /* put the number of obs into macro variable, N */
quit;
%put &=N;
Output from PROC CORR that shows eigenvalues of the correlation matrix

The first table shows that there are 150 observations. The second table displays the eigenvalues for the sample, which are 2.9, 0.9, 0.15, and 0.02. If you want a graph of the eigenvalues, you can use the PLOTS(ONLY)=SCREE option on the PROC PRINCOMP statement. The ODS output statement creates SAS data set from the tables. The PROC SQL call creates a macro variable, N, that contains the number of observations.

Asymptotic confidence intervals

If a sample size, n, is large enough, the sampling distribution of the eigenvalues is approximately multivariate normal (Larsen and Ware (2010, p. 873)). If g is an eigenvalue for a correlation matrix, then an asymptotic confidence interval is
     g ± z* sqrt( 2 g2 / n )
where z* is the standard normal quantile, as computed in the following program:

/* Asymptotic CIs for eigenvalues (Larsen and Warne (2010, p. 873) */
data AsympCI;
set EV0;
alpha = 0.05;
z = quantile("Normal", 1 - alpha/2);  /* = 1.96 */
SE = sqrt(2*Eigenvalue**2 / &N);
Normal_LCL = Eigenvalue - SE;         /* g +/- z* sqrt(2 g^2 / n) */
Normal_UCL = Eigenvalue + SE;
drop alpha z SE;
run;
 
proc print data=AsympCI noobs;
   var Number Eigenvalue Normal_LCL Normal_UCL;
run;
Comparison of two methods for estimating confidence intervals of eigenvalues of a correlation matrix

The lower and upper confidence limits are shown for each eigenvalue. The advantage of this method is its simplicity. The intervals assume that the distribution of the eigenvalues is multivariate normal, which will occur when the sample size is very large. Since N=150 does not seem "very large," it is not clear whether these confidence intervals are valid. Therefore, let's estimate the confidence intervals by using the bootstrap method and compare the bootstrap intervals to the asymptotic intervals.

Bootstrap confidence intervals for eigenvalues

The bootstrap computations in this section follow the strategy outlined in the article "Compute a bootstrap confidence interval in SAS." (For additional bootstrap tips, see "The essential guide to bootstrapping in SAS.") The main steps are:

  1. Compute a statistic for the original data. This was done in the previous section.
  2. Use PROC SURVEYSELECT to resample (with replacement) B times from the data. For efficiency, put all B random bootstrap samples into a single data set.
  3. Use BY-group processing to compute the statistic of interest on each bootstrap sample. The union of the statistics is the bootstrap distribution, which approximates the sampling distribution of the statistic. Remember to turn off ODS when you run BY-group processing!
  4. Use the bootstrap distribution to obtain confidence intervals for the parameters.

The steps are implemented in the following SAS program:

/* 2. Generate many bootstrap samples */
%let NumSamples = 5000;       /* number of bootstrap resamples */
proc surveyselect data=&DSName NOPRINT seed=12345
     out=BootSamp(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                 option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;
 
/* 3. Compute the statistic for each bootstrap sample */
/* Suppress output during this step:
   https://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations.html
*/
%macro ODSOff();   ods graphics off; ods exclude all;  ods noresults;   %mend;
%macro ODSOn();    ods graphics on;  ods exclude none; ods results;     %mend;
 
%ODSOff;
proc princomp data=BootSamp STD plots=none;
   by SampleID;
   freq NumberHits;
   var &VarList;
   ods output Eigenvalues=EV(keep=SampleID Number Eigenvalue);
run;
%ODSOn;
 
/* 4. Estimate 95% confidence interval as the 2.5th through 97.5th percentiles of boostrap distribution */
proc univariate data=EV noprint;
   class Number;
   var EigenValue;
   output out=BootCI pctlpre=Boot_ pctlpts=2.5  97.5  pctlname=LCL UCL;
run;
 
/* merge the bootstrap CIs with the normal CIs for comparison */
data AllCI;
   merge AsympCI(keep=Number Eigenvalue Normal:) BootCI(keep=Number Boot:);
   by Number;
run;
 
proc print data=AllCI noobs;
   format Eigenvalue Normal: Boot: 5.3;
run;
Asymptotic (normal) confidence intervals for eigenvalues

The table displays the bootstrap confidence intervals (columns 5 and 6) next to the asymptotic confidence intervals (columns 3 and 4). It is easier to compare the intervals if you visualize them graphically, as follows:

/* convert from wide to long */
data CIPlot;
   set AllCI;
   Method = "Normal   "; LCL = Normal_LCL; UCL = Normal_UCL; output;
   Method = "Bootstrap"; LCL = Boot_LCL; UCL = Boot_UCL; output;
   keep Method Eigenvalue Number LCL UCL;
run;
 
title "Comparison of Normal and Bootstrap Confidence Intervals";
title2 "Eigenvalues of the Correlation Matrix for the Iris Data";
ods graphics / width=480px height=360px;
proc sgplot data=CIPlot;
   scatter x=Eigenvalue y=Number / group=Method clusterwidth=0.4
           xerrorlower=LCL xerrorupper=UCL groupdisplay=cluster;
   yaxis grid type=discrete colorbands=even colorbandsattrs=(color=gray transparency=0.9);
   xaxis grid;
run;

The graph is shown at the top of this article. The graph nicely summarizes the comparison. For the first (largest) eigenvalue, the bootstrap confidence interval is about half as wide as the normal confidence interval. Thus, the asymptotic result seems too wide for these data. For the other eigenvalues, the normal confidence intervals appear to be too narrow.

If you graph the bootstrap distribution, you can see that the bootstrap distribution does not appear to be multivariate normal. This presumably explains why the asymptotic intervals are so different from the bootstrap intervals. For completeness, the following graph shows a matrix of scatter plots and marginal histograms for the bootstrap distribution. The histograms indicate skewness in the bootstrap distribution.

Bootstrap distribution for eigenvalues (5000 samples)

Summary

This article shows how to compute confidence intervals for the eigenvalues of an estimated correlation matrix. The first method uses a formula that is valid when the sampling distribution of the eigenvalues is multivariate normal. The second method uses bootstrapping to approximate the distribution of the eigenvalues, then uses percentiles of the distribution to estimate the confidence intervals. For the Iris data, the bootstrap confidence intervals are substantially different from the asymptotic formula.

The post Confidence intervals for eigenvalues of a correlation matrix appeared first on The DO Loop.

10月 072020
 

A previous article shows how to use a recursive formula to compute exact probabilities for the Poisson-binomial distribution. The recursive formula is an O(N2) computation, where N is the number of parameters for the Poisson-binomial (PB) distribution. If you have a distribution that has hundreds (or even thousands) of parameters, it might be more efficient to use a different method to approximate the PDF, CDF, and quantiles for the Poisson-binomial distribution. A fast method that is easy to implement is the Refined Normal Approximation (RNA) method, which is presented in Hong (2013, Eqn 14, p. 9-10; attributed to Volkova, 1996). This article discusses the RNA method, when to use it, and a program that implements the method in SAS.

See this introductory article for an overview of the Poisson-binomial distribution.

The normal approximation to the Poisson-binomial distribution

Before talking about the normal approximation, let's plot the exact PDF for a Poisson-binomial distribution that has 500 parameters, each a (random) value between 0 and 1. The PDF is computed by using the recursive-formula method from my previous article. Although the Poisson-binomial distribution a discrete distribution, the PDF is shown by using a series plot.

Exact PDF for the Poisson-binomial Distribution (N=500)

The PDF looks to be approximately normally distributed. It turns out that this is often the case when the number of parameters is large. The mean and the standard deviation of the normal approximation are easy to compute in terms of the parameters. Let p = {p1, p2, ..., pN} be the parameters for the Poisson-binomial distribution. The mean (μ), standard deviation (σ), and skewness (γ) of the distribution are given by

  • μ = sum(p)
  • σ = sqrt(sum(p # (1-p))), where # is the elementwise multiplication operator
  • γ = sum(p # (1-p) # (1-2p)) / σ3

When N is large, the Poisson-binomial distribution is approximated by a normal distribution with mean μ and standard deviation σ. (This property also holds for the Poisson and the binomial distributions.) The PB distribution might not be symmetric, but the so-called refined normal approximation corrects for skewness in the PB distribution.

Let CDF(k; p) be the Poisson-binomial CDF. That is, CDF(k; p) is the probability of k or fewer successes among the N independent Bernoulli trials, where the probability of success for the j_th trial is p[j]. Then the refined normal approximation (RNA) is
     CDF(k; p) ≈ G((k + 0.5 - μ)/σ), for k = 0, 1, ..., N
where G(x) = Φ(x) + γ (1 – x2) φ(x) / 6. Here Φ and φ are the CDF and PDF, respectively, of the standard normal distribution.

Hong (2013) presents a table (Table 2, p. 15) that shows that the refined normal approximation to the CDF is very close to the exact CDF when N ≥ 500. However, he does not mention that the mean cannot be too small or too large. As a rule of thumb, the interval [μ – 5 σ, μ + 5 σ] should be completely inside the interval [0, N]. In practical terms, you should not use the RNA when Σj pj is close to 0 or N. That is, don't use it when almost all the Bernoulli trials have very small or very large probabilities of success.

Although the computation is given for the CDF, the PDF is easier to visualize and understand. The previous graph shows the exact PDF for N=500 parameters, where each p[j]is randomly sampled from the uniform distribution on (0,1). The mean of the distribution is 255.2; the standard deviation is 9.0. For this set of parameters, the maximum difference between the EXACT distribution and the refined normal approximation is 1E-5. You can plot the difference between the exact and RNA distributions. The following plot shows that the agreement is very good. Notice that the vertical scale for the difference is tiny! The approximation is only slightly different from the approximation.

Difference between Exact and Approximate PDF for the Poisson-binomial Distribution (N=500)

The refined normal approximation in SAS

It is straightforward to use the refined normal approximation to approximate the CDF of the Poisson-binomial distribution in SAS:

  1. Compute the μ, σ, and γ moments from the vector of parameters, p.
  2. Evaluate the refined normal approximation for each value of k.
  3. Because the G function is not always in the interval [0,1], use a "trap and map" technique to force extreme values into the [0,1] interval.

If you know the CDF of a discrete distribution, you can compute the difference between consecutive values to obtain the PDF. Similarly, you can use the definition of quantiles to obtain the quantile function from the PDF.

You can download the SAS program that approximates the Poisson-binomial distribution. The program, which is implemented in the SAS/IML matrix language, generates all results and graphs in this article.

Improvements in speed

The RNA algorithm is a direct method that is much faster than the recursive formula:

  • For N=500, my computer computes the exact PDF in about 0.16 seconds, whereas the RNA algorithm only requires 2E-4 seconds. In other words, the RNA algorithm is about 800 times faster.
  • For N=1000, the exact PDF takes about 0.6 seconds, whereas the RNA algorithm only require 3.4E-4 seconds. This is about 2000 times faster.

The RNA algorithm has a second advantage: you can directly evaluate the CDF (and PDF) functions at specific values of k. In contrast, the recursive formula, by definition, computes the distribution's value at k=j by evaluating all the previous values: k=0, 1, 2, ..., j.

Summary

In summary, when the Poisson-binomial distribution has many parameters, you can approximate the CDF and PDF by using a refined normal approximation. The normal approximation is very good when N ≥ 500 and the mean of the distribution is sufficiently far away from the values 0 and N. When those conditions are met, the RNA is a good approximation to the PB distribution. Not only is the RNA algorithm fast, but you can use it to directly compute the distribution at any value of k without computing all the earlier values.

You can download the SAS/IML program that uses the refined normal approximation to implement the CDF, PDF, and quantile functions for the Poisson-binomial distribution.

The post The Poisson-binomial distribution for hundreds of parameters appeared first on The DO Loop.

9月 302020
 

When working with a probability distribution, it is useful to know how to compute four essential quantities: a random sample, the density function, the cumulative distribution function (CDF), and quantiles. I recently discussed the Poisson-binomial distribution and showed how to generate a random sample.[LINK] This article shows how to compute the PDF for the Poisson-binomial distribution. From the PDF function, you can quickly compute the cumulative distribution (CDF) and the quantile function. For a discrete distribution, the PDF is also known as the probability mass function (PMF).

In a 2013 paper, Y. Hong describes several ways to compute or approximate the PDF for the Poisson-binomial distribution. This article uses SAS/IML to implement one of the recurrence formulas (RF1, Eqn 9) in his paper. Hong attributes this formula to Barlow and Heidtmann (1984). A previous article discusses how to compute recurrence relations in SAS.

A recurrence relation for the Poisson-binomial PDF

You can read my previous article or the Chen (2013) paper to learn more about the Poisson-binomial (PB) distribution. The PB distribution is generated by running N independent Bernoulli trials, each with its own probability of success. These probabilities are the N parameters for the PB distribution: p1, p2, ..., pN.

A PB random variable is the number of successes among the N trials, which will be an integer in the range [0, N]. If X is a PB random variable, the PDF is the probability of observing k successes for k=0,1,...,N. That is, PDF(k) = Pr[X=k]. This probability depends on the N probabilities for the underlying Bernoulli trials.

If you are not interested in how to compute the PDF from a recurrence relation, you can skip to the next section.

Let's assume that we always perform the Bernoulli trials in the same order. (They are independent trials, so the order doesn't matter.) Define ξ[k,j] to be the probability that k successes are observed after performing the first j trials, 0 ≤ k,j ≤ N. Note that PDF(k) = ξ[k,N].

As mentioned earlier, we will use a recurrence relation to compute the PDF for each value of k=0,1,...,N. If you picture the ξ array as a matrix, you can put zeros for all the probabilities in the lower triangular portion of the matrix because the probability of observing j successes is zero if you have performed fewer than j trials.

You can also fill in the first row of the matrix, which is the probability of observing k=0 successes. The only way that you can observe zero successes after j trials is if no trial was a success. Because the trials are independent, the probability is the product Π (1-p[i]), where the product is over i=1..j. This enables us to fill in the top row of the ξ matrix. Consequently, PDF(0) is the product of 1-p[i]over all i=1..N. After completing the first row (the base case), the ξ matrix looks like the following:

Recurrence relation for computing the PDF of the Poisson-binomial distribution: Base Case

The recurrence relation for filling in the table is given by the formula
ξ[k, j] = (1-p[j])*ξ[k,j-1] + p[j]*ξ[k-1,j]
This formula enables you to compute the k_th row (starting at the left side) provided that you know the values of the (k-1)st row. A cell in the matrix can be computed if you know the value of the cell in the previous column for that row and for the previous row. The following diagram show this idea. The arrows in the row for k=1 indicate that you can compute those cells from left to right. First, you compute ξ[1, 1], then ξ[1, 2], then ξ[1, 3], and so forth. When you reach the end of the row, you have computed PDF(1). You then fill in the next row (from left to right) and continue this process until all cells are filled.

Recurrence relation for computing the PDF of the Poisson-binomial distribution

You can understand the recurrence relation in terms of the underlying Bernoulli trials. Suppose you have already performed j-1 trials, are getting ready to perform the j_th. There are two situations for which the j_th trial will produce the k_th success:

  • The first j-1 trials produced k successes and the j_th trial is a failure. This happens with probability (1-p[j])*ξ[k,j-1], which is the first term in the recurrence relation.
  • The first j-1 trials produced k-1 successes and the j_th trial is a success. This happens with probability p[j]ξ[k-1,j], which is the second term in the recurrence relation.

Although I have displayed the entire ξ matrix, I will point out that you do not need to store the entire matrix. You can compute each row by knowing only the previous row. This seems to have been overlooked by Hong, who remarked that the "RF1 method can be computer memory demanding when N is large" (p. 8) and "when N = 15,000, approximately 4GB memory is needed for computing the entire CDF." Neither of those statements is true. You only need two arrays of size N, and the required RAM for two arrays of size N = 15,000 is 0.22 MB (or 2.2E-4 GB).

Compute the Poisson-binomial PDF in SAS

You can use the SAS/IML language to implement the recurrence formula. The following program uses N=10 and the same vector of probabilities as in the previous article.

proc iml;
start PDFPoisBinom(_p);
   p = colvec(_p); N = nrow(p);
   R = j(1, N+1, 0);        /* R = previous row of probability matrix */
   S = j(1, N+1, 0);        /* S = current row of probability matrix */
   R[1] = 1;
   R[2:N+1] = cuprod(1-p);  /* first row (k=0) is cumulative product */
   pdf0 = R[N+1];           /* pdf(0) is last column of row */ 
   pdf = j(1, N);
   do k = 1 to N;           /* for each row k=1..N */
      S[,] = 0;             /* zero out the current row */
      do j = k to N;        /* apply recurrence relation from left to right */
         S[j+1] = (1-p[j])*S[j] + p[j]*R[j];  
      end;
      pdf[k] = S[N+1];      /* pdf(k) is last column of row */
      R = S;                /* store this row for the next iteration */
   end;
   return (pdf0 || pdf);    /* return PDF as a row vector */
finish;
 
p = {0.2, 0.2, 0.3, 0.3, 0.4, 0.6, 0.7, 0.8, 0.8, 0.9};  /* 10 trials */
pdf = PDFPoisBinom(p);
print pdf[c=('0':'10') F=Best5.];
PDF of the Poisson-binomial distribution

You can use PROC SGPLOT to create a bar chart that visualizes the PDF:

PDF of the Poisson-binomial distribution

Compute the Poisson-binomial CDF

For a discrete distribution, you can easily compute the CDF and quantiles from the PDF. The CDF is merely the cumulative sum of the PDF values. You can use the CUSUM in SAS/IML to compute a cumulative sum, so the SAS/IML function that computes the CDF is very short:

start CDFPoisBinom(p);
   pdf = PDFPoisBinom(p);
   return ( cusum(pdf) );   /* The CDF is the cumulative sum of the PDF */
finish;
 
cdf = CDFPoisBinom(p);
print cdf[c=('0':'10') F=Best5.];
CDF of the Poisson-binomial distribution

Instead of displaying the table of values, I have displayed the CDF as a graph. I could have used a bar chart or needle plot, but I chose to visualize the CDF as a step function because that representation is helpful for computing quantiles, as shown in the next section.

Compute the Poisson-binomial quantiles

Given a probability, ρ, the quantile for a discrete distribution is the smallest value for which the CDF is greater than or equal to ρ. The quantile function for the Poisson-binomial distribution is a value, q, in the range [0, N]. Geometrically, you can use the previous graph to compute the quantiles: Draw a horizontal line at height ρ and see where it crosses a vertical line on the CDF graph. That vertical line is located at the value of the quantile for ρ. The following SAS/IML module implements the quantile function:

/* The quantile is the smallest value for which the CDF is greater than or equal to 
   the given probability. */
start QuantilePoisBinom(p, _probs);        /* p = PB parameters; _prob = vector of probabilities for quantiles */
   cdf = CDFPoisBinom(p);                  /* index is one-based */
   probs = colvec(_probs);                 /* find quantiles for these probabilities */
   q = j(nrow(probs), 1, .);               /* allocate result vector */
   do i = 1 to nrow(probs);
      idx = loc( probs[i] <= CDF );        /* find all x such that p <= CDF(x) */ 
      q[i] = idx[1] - 1;                   /* smallest index. Subtract 1 b/c PDF is for k=0,1,...,N  */
   end;
   return ( q );
finish;
 
probs = {0.05, 0.25, 0.75, 0.95};
qntl = QuantilePoisBinom(p, probs);
print probs qntl;
Quantiles of the Poisson-binomial distribution

Summary

By using a recurrence relation, you can compute the entire probability density function (PDF) for the Poisson-binomial distribution. From those values, you can obtain the cumulative distribution (CDF). From the CDF, you can obtain the quantiles. This article implements SAS/IML functions that compute the PDF, CDF, and quantiles. A previous article shows how to generates a random sample from the Poisson-binomial distribution.

You can download the complete SAS program that computes the quantities and creates the graphs in this article.

References

Hong, Y., (2013) "On computing the distribution function for the Poisson binomial distribution," Computational Statistics & Data Analysis, 59:41–51.

9月 282020
 

The Poisson-binomial distribution is a generalization of the binomial distribution.

  • For the binomial distribution, you carry out N independent and identical Bernoulli trials. Each trial has a probability, p, of success. The total number of successes, which can be between 0 and N, is a binomial random variable. The distribution of that random variable is the binomial distribution.
  • The Poisson-binomial distribution is similar, but the probability of success can vary among the Bernoulli trials. That is, you carry out N independent but not identical Bernoulli trials. The j_th trial has a probability, pj, of success. The total number of successes, which can be between 0 and N, is a random variable. The distribution of that random variable is the Poisson-binomial distribution.

This article shows how to generate a random sample from the Poisson-binomial distribution in SAS. Simulating a random sample is a great way to begin exploring a new distribution because the empirical density and empirical cumulative distribution enable you to see the shape of the distribution and how it depends on parameters.

Generate a random binomial sample

Before looking at the Poisson-binomial distribution, let's review the more familiar binomial distribution. The binomial distribution has two parameters: the probability of success (p) and the number of Bernoulli trials (N). The output from a binomial distribution is a random variable, k. The random variable is an integer between 0 and N and represents the number of successes among the N Bernoulli trials. If you perform many draws from the binomial distribution, the sample will look similar to the underlying probability distribution, which has mean N*p and variance N*p*(1-p).

SAS supports sampling from the binomial distribution by using the RAND function. The following SAS DATA step generates a random sample of 1,000 observations from the binomial distribution and plots the distribution of the sample:

/* Generate a random sample from the binomial distribution */
/* The Easy Way: call rand("Binom", p, N) */
%let m = 1000;              /* number of obs in random sample */
data BinomSample;
call streaminit(123);
p = 0.8; N = 10;            /* p = prob of success; N = num trials */
label k="Number of Successes";
do i = 1 to &m;
   k = rand("Binom", p, N); /* k = number of successes in N trials */
   output;
end;
keep i k;
run;
 
title "Binomial Sample";
footnote J=L "Sample Size = &m";
proc sgplot data=BinomSample;
   vbar k;
   yaxis grid;
run;
A random sample from the binomial distribution

This is the standard way to generate a random sample from the binomial distribution. In fact, with appropriate modifications, this program shows the standard way to simulate a random sample of size m from ANY of the built-in probability distributions in SAS.

Another way to simulate binomial data

Let's pretend for a moment that SAS does not support the binomial distribution. You could still simulate binomial data by making N calls to the Bernoulli distribution and counting the number of successes. The following program generates a binomial random variable by summing the results of N Bernoulli random variables:

/* The Alternative Way: Make N calls to rand("Bernoulli", p) */
data BinomSample2(keep=i k);
call streaminit(123);
p = 0.8; N = 10;
label k="Number of Successes";
do i = 1 to &m;
   k = 0;                            /* initialize k=0 */
   do j = 1 to N;                    /* sum of N Bernoulli variables */
      k = k + rand("Bernoulli", p);  /* every trial has the same probability, p */
   end;
   output;
end;
run;

The output data set is also a valid sample from the Binom(p, N) distribution. The program shows that you can replace the single call to RAND("Binom",p,N) with N calls to RAND("Bernoulli",p). You then must add up all the binary (0 or 1) random variables from the Bernoulli distribution.

Simulate Poisson-binomial data

The program in the previous section can be modified to generate data from the Poisson-binomial distribution. Instead of using the same probability for all Bernoulli trials, you can define an array of probabilities and use them to generate the Bernoulli random variables. This is shown by the following program, which generates the number of successes in a sequence of 10 Bernoulli trials where the probability of success varies among the trials:

/* generate a random sample from the Poisson-binomial distribution 
   that has 10 parameters: p1, p2, p3, ..., p10 */
data PoisBinomSample(keep=i k);
/* p[j] = probability of success for the j_th trial, i=1,2,...,10 */
array p[10] (0.2 0.2 0.3 0.3 0.4 0.6 0.7 0.8 0.8 0.9);  /* N = dim(p) */
call streaminit(123);
label k="Number of Successes";
do i = 1 to &m;
   k = 0;                              /* initialize k=0 */
   do j = 1 to dim(p);                 /* sum of N Bernoulli variables */
      k = k + rand("Bernoulli", p[j]); /* j_th trial has probability p[j] */
   end;
   output;
end;
run;
 
title "Poisson-Binomial Sample";
proc sgplot data=PoisBinomSample;
   vbar k;
   yaxis grid;
run;
A random sample from the Poisson-binomial distribution

The graph shows the distribution of a Poisson-binomial random sample. Each observation in the sample is the result of running the 10 trials and recording the number of successes. You can see from the graph that many of the trials resulted in 5 successes, although 4 or 6 are also very likely. For these parameters, it is rare to see 0 or 1 success, although both occurred during the 1,000 sets of trials. Seeing 10 successes is mathematically possible but did not occur in this simulation.

In this example, the vector of probabilities has both high and low probabilities. The probability of success is 0.2 for one trial and 0.9 for another. If you change the parameters in the Poisson-binomial distribution, you can get distributions that have different shapes. For example, if all the probabilities are small (close to zero), then the distribution will be positively skewed and the probability of seeing 0, 1, or 2 successes is high. Similarly, if all the probabilities are large (close to one), then the distribution will be negatively skewed and there is a high probability of seeing 8, 9, or 10 successes. If all probabilities are equal, then you get a binomial distribution.

Simulate Poisson-binomial data by using SAS/IML

The SAS/IML language makes it easy to encapsulate the Poisson-binomial simulation into a function. The function has two input arguments: the number of observations to simulate (m) and the vector of probabilities (p). The output of the function is a vector of m integers. For example, the following SAS/IML function implements the simulation of Poisson-binomial data:

proc iml;
/* Simulate from the Poisson-Binomial distribution. 
   Input: m = number of observations in sample
          p = column vector of N probabilities. The probability 
              of success for the j_th Bernoulli trial is p_j.
   Output: row vector of m realizations of X ~ PoisBinom(p) 
*/
start RandPoisBinom(m, _p);
   p = colvec(_p);
   b = j(nrow(p), m);     /* each column is a binary indicator var */
   call randgen(b, "Bernoulli", p); 
   return( b[+, ] );      /* return numSuccesses = sum each column */
finish;
 
/* The Poisson-binomial has N parameters: p1, p2, ..., pN */
p = {0.2 0.2 0.3 0.3 0.4 0.6 0.7 0.8 0.8 0.9};  /* 10 trials */
call randseed(123);
S = RandPoisBinom(&m, p);
call bar(S) grid="y" label="Number of Successes";
A random sample from the Poisson-binomial distribution

The SAS/IML program uses the RANDGEN function to fill up an N x m matrix with values from the Bernoulli distribution. The RANDGEN function supports a vector of parameters, which means that you can easily specify that each column should have a different probability of success. After filling the matrix with binary values, you can use the summation subscript reduction operator to obtain the number of successes (1s) in each column. The result is a row vector that contains m integers, each of which is the number of successes from a set of N Bernoulli trials with the given probabilities.

The PROC IML program uses the same parameters as the DATA step simulation. Accordingly, the sample distributions are similar.

The expected value of the Poisson-binomial distribution is the sum of the vector of probabilities. The variance is the sum of the individual Bernoulli variances. You can compute the mean and variance of the distribution and compare it to the sample mean and variance:

/* Expected values: mean and variance of the Poisson-binomial distribution */
mu = sum(p);
var = sum( (1-p)#p );
/* sample estimates of mean and variance */
XBar = mean(S`);
s2 = var(S`);
Moments = (mu//XBar) || (var//s2);
print Moments[f=5.2 r={"Distribution" "Sample Estimate"} c={"Mean" "Variance"}];
Expected values and sample statistics for the Poisson-binomial distribution

The expected number of successes in the Poisson-binomial distribution with these parameters is 5.2. In the sample, the average number of successes is 5.17. The variance of the Poisson-binomial distribution is 1.84. The variance of this sample is 1.75. The sample statistics are close to their expected values, which is what you expect to happen for a large random sample.

Summary

This article shows how to simulate data from the Poisson-binomial distribution. A random variable that follows the Poisson-binomial distribution gives the total number of success in N Bernoulli trials, where the j_th trial has the probability pj of success. The example in this article uses a 10-parameter vector of probabilities. If all parameter values are identical (p), then the Poisson-binomial distribution reduces to the standard Binom(p, 10) distribution. However, the Poisson-binomial distribution allows the probabilities to be different.

You can download the SAS program that computes the quantities and creates the graphs in this article.

9月 232020
 

Many textbooks and research papers present formulas that involve recurrence relations. Familiar examples include:

  • The factorial function: Set Fact(0)=1 and define Fact(n) = n*Fact(n-1) for n > 0.
  • The Fibonacci numbers: Set Fib(0)=1 and Fib(1)=1 and define Fib(n) = Fib(n-1) + Fib(n-2) for n > 1.
  • The binomial coefficients (combinations of "n choose k"): For a set that has n elements, set Comb(n,0)=1 and Comb(n,n)=1 and define Comb(n,k) = Comb(n-1, k-1) + Comb(n-1, k) for 0 ≤ k ≤ n.
  • Time series models: In many time series models (such as the ARMA model), the response and error terms depend on values at previous time points. This leads to recurrence relations.

Sometimes the indices begin at 0, other times they begin at 1. Working efficiently with sequences and recurrence relations can be a challenge. The SAS DATA step is designed to processes one observation at a time, but a recurrence relation requires looking at past values. If you use the SAS DATA step to work with recurrence relations, you need to use tricks to store and refer to previous values. Another challenge is indexing: most SAS programmers are used to one-based indexing, whereas some formulas use zero-based indexing.

This article uses the Fibonacci numbers to illustrate how to deal with some of these issues in SAS. There are several definitions of the Fibonacci numbers, but we'll look at the definition that uses zero indexing: F0 = 1, F1 = 1, and Fn = Fn-1 + Fn-2 for n > 1. In this article, I will use n ≤ 7 to demonstrate the concepts.

Recurrence relations by using arrays

The SAS DATA step supports arrays, and you can specify the indices of the arrays. Therefore, you can specify that an array index starts with zero. The following DATA step generates "wide" data. There is one observation, and the variables are named F0-F7.

data FibonacciWide;
array F[0:7] F0-F7;  /* index starts at 0 */
F[0] = 1;            /* initialize first values */
F[1] = 1;
do i = 2 to 7;       /* apply recurrence relation */
   F[i] = F[i-1] + F[i-2];
end;
drop i;
run;
 
proc print noobs; run;

You can mentally check that each number is the sum of the two previous numbers. Using an array is convenient and easy to program. Unfortunately, this method is not very useful in practice. In practice, you usually want the data in long form rather than wide form. For example, if you simulate time series data, each observation represents a time point and the values depend on previous time points.

Recurrence relations by using extra variables

In the DATA step, you can use the concept of "lags" to implement a recurrence relation. A lag is any previous value. If we are currently computing the n_th value, the first lag is the (n-1)th value, which is the previous observation. The second lag is the (n-2)th value and so forth. One way to look at previous values is to save them in an extra variable. The following DATA step uses F1 to hold the first lag of F and F2 to hold the second lag of F. Recall that the SUM function will return the sum of the nonmissing values, so sum(F1, F2) is nonmissing if F1 is nonmissing.

data FibonacciLong;
F1 = 1; F2 = .;        /* initialize lags */
i=0; F = F1; output;   /* manually output F(0) */
do i = 1 to 7;         /* now iterate: F(i) = F(i-1) + F(i-2) */
   F = sum(F1, F2);
   output;
   F2 = F1;            /* update lags for the next iteration */
   F1 = F;
end;
run;
 
proc print noobs; 
   var i F F1 F2; 
run;

Recurrence relations by using the LAG function

The DATA step supports a LAGn function. The LAGn function maintains a queue of length n, which initially contains missing values. Every time you call the LAGn function, it pops the top of the queue, returns that value, and adds the current value of its argument to the end of the queue. The LAGn functions can be surprisingly difficult to use in a complicated program because the queue is only updated when the function is called. If you have conditional IF-THEN/ELSE logic, you need to be careful to keep the queues up to date. I confess that I am often frustrated when I try to use the LAG function with recurrence relations. I find it easier to use extra variables.

But since the documentation for the LAGn function includes the Fibonacci numbers among its examples, I include it here. Notice the clever use of calling LAG1(F) and immediately assigning the new value of F, so that the program only needs to compute the first lag and does not use any extra variables:

/* modified from documentation for the LAG function */
data FibonacciLag;
i=0; F=1; output;       /* initialize and output F[0] */
/* lag1(F) is missing when _N_=1, but equals F[i-1] in later iters */
do i = 1 to 7;
   F = sum(F, lag(F));  /* iterate: F(i) = F(i-1) + F(i-2) */
   output;
end;
run;
 
proc print noobs; run;

The values are the same as for the previous program, but notice that this program does not use any extra variables to store the lags.

Recurrence relations in SAS/IML

Statistical programmers use the SAS/IML matrix language because of its power and flexibility, but also because you can vectorize computations. A vectorized computation is one in which an iterative loop is replaced by a matrix-vector computation. Unfortunately, for many time series computations, you cannot completely get rid of the loops because the series is defined in terms of a recurrence relation. You cannot compute the n_th value until you have computed the previous values.

The easiest way to implement a recurrence relation in the SAS/IML language is to use the DATA step program for generating the "wide form" sequence. The PROC IML code is almost identical to the DATA step code:

proc iml;
N = 8;
F = j(1, N, 1);            /* initialize F[1]=F[2]=1 */
do i = 3 to N;             
   F[i] = F[i-1] + F[i-2]; /* overwrite F[i] with the sum of previous terms, i > 2 */
end;
labls = 'F0':'F7';
print F[c=labls];

The program is efficient and straightforward. The output is not shown. The SAS/IML language does not support 0-based indexing, so for recurrence relations that start at 0, I usually define the indexes to match the recurrence relation and add 1 to the subscripts of the IML vectors. For example, if you index from 0 to 7 in the previous program, the body of the loop becomes F[i+1] = F[i]+ F[i-1].

Linear recurrence relations and matrix iteration

The Fibonacci sequence is an example of a linear recurrence relations. For a linear recurrence relation, you can use matrices and vectors to generate values. You can define the Fibonacci matrix to be the 2 x 2 matrix with values {0 1, 1 1}. The Fibonacci matrix transforms a vector {x1, x2} into the vector {x2, x1+x2}. In other words, it moves the second element to the first element and replaces the second element with the sum of the elements. If you iterate this transformation, the iterates generate the Fibonacci numbers, as shown in the following statements:

M = {0 1,                  /* Fibonacci matrix */
     1 1};
v = {1, 1};                /* initial state */
F = j(1, N, 1);            /* initialize F[1]=1 */
do i = 2 to N;             
   v = M*v;                /* generate next Fibonacci number */
   F[i] = v[1];            /* save the number in an array */
end;
print F[c=labls];

You can use this method to generate sequences in any linear recurrence relation. Previous methods work for linear or nonlinear recurrence relationships. The SAS/IML language also supports a LAG function, although it works differently from the DATA step function of the same name.

Summary

This article shows a few ways to work with recurrence relations in SAS. You can use DATA step arrays to apply relations for "wide form" data. You can use extra variables or the LAG function for long form data. You can perform similar computations in PROC IML. If the recurrence relation is linear, you can also use matrix-vector computations to apply each step of the recurrence relation.

The post Working with recurrence relations in SAS appeared first on The DO Loop.

9月 212020
 

A previous article discussed how to solve regression problems in which the parameters are constrained to be a specified constant (such as B1 = 1) or are restricted to obey a linear equation such as B4 = –2*B2. In SAS, you can use the RESTRICT statement in PROC REG to solve restricted least squares problems. However, if a constraint is an INEQUALITY instead of an EQUALITY, then (in general) a least squares method does not solve the problem. Therefore, you cannot use PROC REG to solve problems that have inequality constraints.

This article shows how to use PROC NLIN to solve linear regression problems whose regression coefficients are constrained by a linear inequality. Examples are B1 ≥ 3 or B1 + B2 ≥ 6.

PROC NLIN and constrained regression problems

Before solving a problem that has inequality constraints, let's see how PROC NLIN solves a problem that has linear equality constraints. The following statements use the data and model from the previous article. The model is
      Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4 + ε
where B3 = B1 and B4 = –2*B2. In PROC NLIN, you can replace the B3 and B4 parameters, thus leaving only three parameters in the model. If desired, you can use the ESTIMATE statement to recover the value of B4, as follows:

/* You can solve the restricted problem by using PROC NLIN */
proc nlin data=RegSim noitprint;      /* use NLINMEASURES to estimate MSE */
   parameters b0=0 b1=3 b2=1;         /* specify names of parameters and initial guess */
   model Y = b0 + b1*x1 +   b2*x2 +
                  b1*x3 - 2*b2*x4;   /* replace B3 = B1 and B4 = -2*B2 */
   estimate 'b4' -2*b2;              /* estimate B4 */
run;

The parameter estimates are the same as those produced by PROC REG in the previous article.

The syntax for PROC NLIN is natural and straightforward. The PARAMETERS statement specifies the names of the parameters in the problem; the other symbols (X1-X4, Y) refer to data set variables. You must specify an initial guess for each parameter. For nonlinear regressions, this can be a challenge, but there are some tricks you can use to choose a good initial guess. For LINEAR regressions, the initial guess shouldn't matter: you can use all zeros or all ones, if you want.

Boundary constraints

Suppose that you want to force the regression coefficients to satisfy certain inequality constraints. You can use the BOUNDS statement in PROC NLIN to specify the constraints. The simplest type of constraint is to restrict a coefficient to a half-interval or interval. For example, the following call to PROC NLIN restricts B1 ≥ 3 and restricts B2 to the interval [0, 4].

/* INEQUALITY constraints on the regression parameters */
proc nlin data=RegSim;
   parameters b0=0 b1=3 b2=1;         /* initial guess */
   bounds b1 >= 3, 
          0<= b2 <= 4;
   model Y = b0 + b1*x1 + b2*x2 + b1*x3 - 2*b2*x4;
   ods select ParameterEstimates;
run;

The solution that minimizes the residual sum of squares (subject to the constraints) places the B1 parameter on the constraint B1=3. Therefore, a standard error is not available for that parameter estimate.

You can also use the BOUNDS statement to enforce simple relationships between parameters. For example, you can specify
     bounds b1 >= b2;
to specify that the B1 coefficient must be greater than or equal to the B2 parameter.

More general linear constraints

The BOUNDS statement is for simple relationships. For more complicated relationships, you might need to reparameterize the model. For example, the BOUNDS statement does not accept the following syntax:
     bounds b1 >= 2*b2; /* INVALID SYNTAX! */
However, you can reparameterize the problem by introducing a new parameter C = 2*B2. You can then systematically substitute (C/2) everywhere that B2 appears. For example, the following statements show the constrained model in terms of the new parameter, C. You can use the ESTIMATE statement to find the estimates for the original parameter.

/* let c = 2*b2 be the new parameter. Then substitute c/2 for b2 in the MODEL statement: */
proc nlin data=RegSim;
   parameters b0=0 b1=3 c=2;         /* initial guess */
   bounds b1 >= c;
   model Y = b0 + b1*x1 + (c/2)*x2 + b1*x3 - c*x4;
   estimate 'b2' c/2;                /* estimate original parameter */
run;

The output from this procedure is not shown.

You can use a similar trick to handle linear constraints such as B1 + B2 ≥ 6. Move the B2 term to the right side of the inequality and define C = 6 – B2. The constraint becomes B1 ≥ C and on the MODEL statement you can substitute (6–C) everywhere that B2 appears, as follows:

/* Define c = 6 - b2 so that b2=6-c */
proc nlin data=RegSim;
   parameters b0=0 b1=5 c=5;
   bounds b1 >= c;
   model Y = b0 + b1*x1 + (6-c)*x2 + b1*x3 - 2*(6-c)*x4;
   estimate 'b2' 6-c;              /* estimate original parameter */
   ods select ParameterEstimates AdditionalEstimates;
run;

Summary

In summary, you can use the NLIN procedure to solve linear regression problems that have linear constraints among the coefficients. Each equality constraint enables you to eliminate one parameter in the MODEL statement. You can use the BOUNDS statement to specify simple inequality constraints. For more complicated constraints, you can reparametrize the model and again use the BOUNDS statement to specify the constraints. If desired, you can use the ESTIMATE statement to estimate the parameters in the original model.

The post Regression with inequality constraints on parameters appeared first on The DO Loop.

9月 082020
 

Matrix balancing is an interesting problem that has a long history. Matrix balancing refers to adjusting the cells of a frequency table to match known values of the row and column sums. One of the early algorithms for matrix balancing is known as the RAS algorithm, but it is also called the raking algorithm in some fields. The presentation in this article is inspired by a paper by Carol Alderman (1992).

The problem: Only marginal totals are known

Alderman shows a typical example (p. 83), but let me give a smaller example. Suppose a troop of six Girl Scouts sells seven different types of cookies. At the end of the campaign, each girl reports how many boxes she sold. Also, the troop leader knows how many boxes of each cookie type were sold. However, no one kept track of how many boxes of each type were sold by each girl.

The situation is summarized by the following 7 x 6 table. We know the row totals, which are the numbers of boxes that were sold for each cookie type. We also know the column totals, which are the numbers of boxes that were sold by each girl. But we do not know the values of the individual cells, which are the type-by-girl totals.

Guessing the cells of a matrix

As stated, the problem usually has infinitely many solutions. For an r x c table, there are r*c cells, but the marginal totals only provide r + c linear constraints. Except for 2 x 2 tables, there are more unknowns than equations, which leads to an underdetermined linear system that (typically) has infinitely many solutions.

Fortunately, often obtain a unique solution if we provide an informed guess for the cells in the table. Perhaps we can ask the girls to provide estimates of the number of boxes of each type that they sold. Or perhaps we have values for the cells from the previous year. In either case, if we can estimate the cell values, we can use a matrix balancing algorithm to adjust the cells so that they match the observed marginal totals.

Let's suppose that the troop leader asks the girls to estimate (from memory) how many boxes of each type they sold. The estimates are shown in the following SAS/IML program. The row and column sums of the estimates do not match the known marginal sums, but that's okay. The program uses subscript reduction operators to compute the marginal sums of the girls' estimates. These are displayed next to the actual totals, along with the differences.

proc iml;
/* Known marginal totals: */
u = {260, 214, 178, 148, 75, 67, 59}; /* total boxes sold, by type */
v = {272 180 152 163 134 100};        /* total boxes sold, by girl */
 
/* We don't know the cell values that produce these marginal totals,
   but we can ask each girl to estimate how many boxes of each type she sold */
/*   G1 G2 G3 G4 G5 G6          */
A = {75 45 40 40 40 30 ,  /* C1 */
     40 35 45 35 30 30 ,  /* C2 */
     40 25 30 40 30 20 ,  /* C3 */
     40 25 25 20 20 20 ,  /* C4 */
     30 25  0 10 10  0 ,  /* C5 */
     20 10 10 10 10  0 ,  /* C6 */
     20 10  0 10  0  0 }; /* C7 */
 
/* notice that the row/col totals for A do not match the truth: */
rowTotEst = A[ ,+];        /* for each row, sum across all cols */
colTotEst = A[+, ];        /* for each col, sum down all rows   */
Items  = 'Cookie1':'Cookie7';                  /* row labels    */
Person = 'Girl1':'Girl6';                      /* column labels */
 
/* print known totals, estimated totals, and difference */
rowSums = (u || rowTotEst || (u-rowTotEst));
print rowSums[r=Items c={'rowTot (u)' 'rowTotEst' 'Diff'}];
colSums = (v // colTotEst // (v-colTotEst));
print colSums[r={'colTot (v)' 'colTotEst' 'Diff'} c=Person];

You can see that the marginal totals do not match the known row and column totals. However, we can use the guesses as an initial estimate for the RAS algorithm. The RAS algorithm will adjust the estimates to obtain a new matrix that DOES satisfy the marginal totals.

Adjusting a matrix to match marginal totals

The RAS algorithm can iteratively adjust the girls' estimates until the row sums and column sums of the matrix match the known row sums and column sums. The resulting matrix will probably not be an integer matrix. It will, however, reflect the structure of the girls' estimates in three ways:

  1. It will approximately preserve each girl's recollection of the relative proportions of cookie types that she sold. If a girl claims that she sold many of one type of cookie and few of another, that relationship will be reflected in the balanced matrix.
  2. It will approximately preserve the relative proportions of cookie types among girls. If one girl claims that she sold many more of one type of cookie than another girl, that relationship will be reflected in the balanced matrix.
  3. It will preserve zeros. If a girl claims that she did not sell any of one type of cookie, the balanced matrix will also contain a zero for that cell.

The RAS algorithm is explained in Alderman (1992). The matrix starts with a nonnegative r x c matrix. Let u be the observed r x 1 vector of row sums. Let v be the observed 1 x c vector of column sums. The main steps of the RAS algorithm are:

  1. Initialize X = A.
  2. Scale the rows. Let R = u / RowSum(X), where 'RowSum' is a function that returns the vector of row sums. Update X ← R # X. Here R#X indicates elementwise multiplication of each column of X by the corresponding element of R.
  3. Scale the columns. Let S = v / ColSum(X), where 'ColSum' is a function that returns the vector of column sums. Update X ← X # S. Here X#S indicates elementwise multiplication of each row of X by the corresponding element of S.
  4. Check for convergence. If RowSum(X) is close to u and ColSum(X) is close to v, then the algorithm has converged, and X is the balanced matrix. Otherwise, return to Step 2.

The RAS algorithm is implemented in the following SAS/IML program:

/* Use the RAS matrix scaling algorithm to find a matrix X that 
   approximately satisfies 
   X[ ,+] = u  (marginals constraints on rows)
   X[+, ] = v  (marginals constraints on cols)
   and X is obtained by scaling rows and columns of A */
start RAS(A, u, v);
   X = A;       /* Step 0: Initialize X */
   converged = 0;
   do k = 1 to 100 while(^Converged);
      /* Step 1: Row scaling: X = u # (X / X[ ,+]); */
      R = u / X[ ,+];
      X = R # X;
 
      /* Step 2: Column scaling: X = (X / X[+, ]) # v; */
      S = v / X[+, ];
      X = X # S;
 
      /* check for convergence: u=rowTot and v=colTot */
      rowTot = X[ ,+];   /* for each row, sum across all cols */
      colTot = X[+, ];   /* for each col, sum down all rows */
      maxDiff = max( abs(u-rowTot), abs(v-colTot) );
      Converged = (maxDiff < 1e-8);
   end;
   return( X );
finish;
 
X = RAS(A, u, v);
print X[r=Items c=Person format=7.1];

The matrix X has row and column sums that are close to the specified values. The matrix also preserves relationships among individual columns and rows, based on the girls' recollections of how many boxes they sold. The algorithm preserves zeros for the girls who claim they sold zero boxes of a certain type. It does not introduce any new zeros for cells.

You can't determine how close the X matrix is to "the truth," because we don't know the true number of boxes that each girl sold. The X matrix depends on the girls' recollections and estimates. If the recollections are close to the truth, then X will be close to the truth as well.

Summary

This article looks at the RAS algorithm, which is one way to perform matrix balancing on a table that has nonnegative cells. "Balancing" refers to adjusting the interior cells of a matrix to match observed marginal totals. The RAS algorithm starts with an estimate, A. The estimate does not need to satisfy the marginal totals, but it should reflect relationships between the rows and columns that you want to preserve, such as relative proportions and zero cells. The RAS algorithm iteratively scales the rows and columns of A to create a new matrix that matches the observed row and column sums.

It is worth mentioning the row scaling could be accumulated into a diagonal matrix, R, and the column scaling into a diagonal matrix, S. Thus, the balanced matrix is the product of these diagonal matrices and the original estimate, A. As a matrix equation, you can write X = RAS. It is this matrix factorization that gives the RAS algorithm its name.

The RAS is not the only algorithm for matrix balancing, but it is the simplest. Another algorithm, called iterative proportional fitting (IPF), will be discussed in a second article.

The post Matrix balancing: Update matrix cells to match row and column sums appeared first on The DO Loop.