data analysis

11月 232020
 

I previously showed how to create a decile calibration plot for a logistic regression model in SAS. A decile calibration plot (or "decile plot," for short) is used in some fields to visualize agreement between the data and a regression model. It can be used to diagnose an incorrectly specified model.

In principle, the decile plot can be applied to any regression model that has a continuous regressor. In this article, I show how to create two decile plots for an ordinary least squares regression model. The first overlays a decile plot on a fit plot, as shown to the right. The second decile plot shows the relationship between the observed and predicted responses.

Although this article shows how to create a decile plot in SAS, I do not necessarily recommend them. In many cases, it is better and simpler to use a loess fit, as I discuss at the end of this article.

A misspecified model

A decile plot can be used to identify an incorrectly specified model. Suppose, for example, that a response variable (Y) depends on an explanatory variable (X) in a nonlinear manner. If you model Y as a linear function of X, then you have specified an incorrect model. The following SAS DATA step simulates a response variable (Y) that depends strongly on a linear effect and weakly on quadratic and cubic effects for X. Because the nonlinear dependence is weak, a traditional fit plot does not indicate that the model is misspecified:

/* Y = b0 + b1*x + b2*x2 + b3*x**3 + error, where b2 and b3 are small */
%let N = 200;                      /* sample size */ 
data Have(keep= Y x);
call streaminit(1234);             /* set the random number seed */
do i = 1 to &N;                    /* for each observation in the sample  */
   x = rand("lognormal", 0, 0.5);  /* x = explanatory variable */
   y = 20 + 5*x - 0.5*(x-3)**3 +   /* include weak cubic effect */
            rand("Normal", 0, 4); 
   output;
end;
run;
 
proc reg data=Have;               /* Optional: plots=residuals(smooth); */
   model y = x;
quit;

The usual p-values and diagnostic plots (not shown) do not reveal that the model is misspecified. In particular, the fit plot, which shows the observed and predicted response versus the explanatory variable, does not indicate that the model is misspecified. At the end of this article, I discuss how to get PROC REG to overlay additional information about the fit on its graphs.

Overlay a decile plot on a fit plot

This section shows how to overlay a decile plot on the fit plot. The fit plot shows the observed values of the response (Y) for each value of the explanatory variable (X). You can create a decile fit plot as follows:

  1. Bin the X values into 10 bins by using the deciles of the X variable as cut points. In SAS, you can use PROC RANK for this step.
  2. For each bin, compute the mean Y value and a confidence interval for the mean. At the same time, compute the mean of each bin, which will be the horizontal plotting position for the bin. In SAS, you can use PROC MEANS for this step.
  3. Overlay the fit plot with the 10 confidence intervals and mean values for Y. You can use PROC SGPLOT for this step.
/* 1. Assign each obs to a decile for X */
proc rank data=Have out=Decile groups=10 ties=high;
   var x;           /* variable on which to group */
   ranks Decile;    /* name of variable to contain groups 0,1,...,k-1 */
run;
 
/* 2. Compute the mean and the empirical proportions (and CI) for each decile */
proc means data=Decile noprint;
   class Decile;
   types Decile;
   var y x;
   output out=DecileOut mean=YMean XMean 
          lclm=yLower uclm=yUpper;  /* only compute limits for Y */
run;
 
/* 3. Create a fit plot and overlay the deciles. */
data All;
   set Have DecileOut; 
run; 
 
title "Decile Plot";
title2 "Observed and Predicted Response versus Explanatory";
proc sgplot data=All noautolegend;
   reg x=x y=y / nomarkers CLM alpha=0.1 name="reg";
   scatter x=xMean y=YMean / yerrorlower=yLower yerrorupper=yUpper 
           name="obs" legendlabel="Mean Y and 90% CI" markerattrs=GraphData2;
   fringe x / transparency=0.5;
   xaxis label="x" values=(0 to 4.5 by 0.5) valueshint;
   yaxis label="y" offsetmin=0.05;
   keylegend "reg" "obs" / position=NW location=inside across=1;
run;

The graph is shown at the top of this article. I use the FRINGE statement to display the locations of the X values at the bottom of the plot. The graph indicates that the model might be misspecified. The mean Y values for the first three deciles are all above the predicted value from the model. For the next six deciles, the mean Y values are all below the model's predictions. This kind of systematic deviation from the model can indicate that the model is missing an important component. In this case, we know that the true response is cubic whereas the model is linear.

A decile plot of observed versus predicted response

You can create a fit plot when the model contains a single continuous explanatory variable. If you have multiple explanatory variables, you can assess the model by plotting the observed responses against the predicted responses. You can overlay a decile plot by computing the average observed response for each decile of the predicted response. When you hear someone use the term "decile plot," they are probably referring to this second plot, which is applicable to most regression models.

The SAS program is almost identical to the program in the previous section. The main difference is that you need to create an output data set (FIT) that contains the predicted values (Pred). You then use the Pred variable instead of the X variable throughout the program, as follows:

/* 1. Compute predicted values */
proc reg data=Have noprint;
   model y = x;
   output out=Fit Pred=Pred;
quit;
 
/* 2. Assign each obs to a decile for Pred */
proc rank data=Fit out=Decile groups=10 ties=high;
   var Pred;        /* variable on which to group */
   ranks Decile;    /* name of variable to contain groups 0,1,...,k-1 */
run;
 
/* 3. Compute the mean and the empirical proportions (and CI) for each decile */
proc means data=Decile noprint;
   class Decile;
   types Decile;
   var y Pred;
   output out=DecileOut mean=YMean PredMean 
          lclm=yLower uclm=yUpper;  /* only compute limits for Y */
run;
 
/* 4. Create an Observed vs Predicted plot and overlay the deciles. */
data All;
   set Fit DecileOut; 
run; 
 
title "Decile Plot";
title2 "Observed Response verus Predicted Response";
proc sgplot data=All noautolegend;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash) clip; /* add identity line */
   scatter x=PredMean y=YMean / yerrorlower=yLower yerrorupper=yUpper 
           name="obs" legendlabel="Mean Y and 95% CI" markerattrs=GraphData2;
   fringe Pred;
   xaxis label="Predicted Response";
   yaxis label="Observed Response" offsetmin=0.05;
   keylegend "obs" / position=NW location=inside across=1;
run;

The decile plot includes a diagonal line, which is the line of perfect agreement between the model and the data. In a well-specified model, the 10 empirical means of the deciles should be close to the line, and they should vary randomly above and below the line. The decile plot for the misspecified model shows a systematic trend rather than random variation. For the lower 30% of the predicted values, the observed response is higher (on average) than the predictions. When the predicted values, are in the range [29, 32] (the fourth through ninth deciles) the observed response is lower (on average) than the predictions. Thus, the decile plot indicates that the linear model might be missing an important nonlinear effect.

The loess alternative

A general principle in statistics is that you should not discretize a continuous variable unless absolutely necessary. You can often get better insight by analyzing the continuous variable. In accordance with that principle, this section shows how to identify misspecified models by using a loess smoother rather than by discretizing a continuous variable into deciles.

In addition to the "general principle," a decile plot requires three steps: compute the deciles, compute the means for the deciles, and merge the decile information with the original data so that you can plot it. As I discuss at the end of my previous article, it is often easier to overlay a loess smoother.

The following statements use the LOESS statement in PROC SGPLOT to add a loess smoother to a fit plot. If the loess fit is not approximately linear, it could indicate that the linear model is not appropriate.

title "Linear Regression Fit";
proc sgplot data=Have;
   reg x=x y=y / clm;              /* fit linear model y = b0 + b1*x */
   loess x=x y=y / nomarkers;      /* use loess fit instead of decile plot */
run;

This plot is the continuous analog to the first decile plot. It tells you the same information: For very small and very large values of X, the predicted values are too low. The loess curve suggests that the linear model is lacking an effect.

The previous plot is useful for single-variable regressions. But even if you have multiple regressors, you can add a loess smoother to the "observed versus predicted" plot. If the FIT data set contains the observed and predicted response, you can create the graph as follows:

title "Observed versus Predicted";
proc sgplot data=Fit;
   loess x=Pred y=y;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash) clip legendlabel="Identity Line";
run;

This plot is the continuous analog to the second decile plot. The loess curve indicates that the observed response is higher than predicted for small and large predicted values. When the model predicts values of Y in the range [29.5, 32], the predicted values tend to be smaller than the observed values. This is the same information that the decile plot provides, but the loess curve is easier to use and gives more information because it does not discretize a continuous variable.

Finally, you can add a loess smoother to a residual plot by using an option in PROC REG. In a well-specified model, the residual plot should not show any systematic trends. You can use the PLOTS=RESIDUALS(SMOOTH) options on the PROC REG statement to add a loess smoother to the residual plot, as follows:
proc reg data=Have plots=residuals(smooth);
   model y = x;
   ods select ResidualPlot;
quit;

The loess smoother on the residual plot shows that the residuals seem to have a systematic component that is not captured by the linear model. When you see the loess fit, you might be motivated to fit models that are more complex. You can use this option even for models that have multiple explanatory variables.

Summary

In summary, this article shows how to create two decile plots for least-square regression models in SAS. The first decile plot adds 10 empirical means to a fit plot. You can create this plot when you have a single continuous regressor. The second decile plot is applicable to models that have multiple continuous regressors. It adds 10 empirical means to the "observed versus predicted" plot. Lastly, although you can create decile plots in SAS, in many cases it is easier and more informative to use a loess curve to assess the model.

The post Decile plots in SAS appeared first on The DO Loop.

11月 182020
 
Predicted probabilities for a logistic regression model

To help visualize regression models, SAS provides the EFFECTPLOT statement in several regression procedures and in PROC PLM, which is a general-purpose procedure for post-fitting analysis of linear models. When scoring and visualizing a model, it is important to use reasonable combinations of the explanatory variables for the visualization. When the explanatory variables are correlated, the standard visualizations might evaluate the model at unreasonable (or even impossible!) combinations of the explanatory variables. This article describes how to generate points in an ellipsoid and score (evaluate) a regression model at those points. The ellipsoid contains typical combinations of the explanatory variables when the explanatory variables are multivariate normal. An example is shown in the graph to the right, which shows the predicted probabilities for a logistic regression model.

This technique can help prevent nonsensical predicted values. For example, in a previous article, I show an example of a quadratic regression surface that appears to predict negative values for a quantity that can never be negative (such as length, mass, or volume). The reason for the negative prediction is that the model is being evaluated at atypical points that are far from the explanatory values in the experiment.

Example: Visualizing the predicted probabilities for a logistic model

Let's look at an example of a regression model for which the regressors are correlated. Suppose you want to predict the gender of the 19 students in the Sashelp.Class data based on their height and weight. (Apologies to my international readers, but the measurements are in pounds and inches.) The following call to PROC LOGISTIC fits a logistic model to the data and uses the EFFECTPLOT statement to create a contour plot of the predicted probability that a student is female:

proc logistic data=Sashelp.Class;
   model Sex = Height Weight;
   effectplot contour / ilink;      /* contour plot of pred prob */
   store out=LogiModel;             /* store model for later */
run;

The contour plot shows the predicted probability of being female according to the model. The red regions have a high probability of being female and the blue regions have a low probability. For a given height, lighter students are more likely to be female.

This graph is a great summary of the model. However, because height and weight are correlated, the data values do not fill the rectangular region in the graph. There are no students in the upper-left corner near (height, weight)=(51, 150). Such a student would be severely obese. Similarly, there are no students in the lower-right corner near (height, weight)=(72, 50). Such a student would be severely underweight.

You can graph just the explanatory variables and overlay a 95% prediction ellipse for bivariate normal data, as shown below:

title "Weight versus Height";
proc sgplot data=Sashelp.Class;
   scatter x=Height y=Weight / jitter transparency=0.25 group=Sex markerattrs=(size=10 symbol=CircleFilled);
   ellipse x=Height y=Weight;    /* 95% prediction ellipse for bivariate normal data */
run;

The heights and weights of most students will fall into the prediction ellipse, assuming that the variables are multivariate normal with a mean vector that is close to (62.3, 100), which is the mean of the data.

If you want to score the model at "representative" values of the data, it makes sense to generate points inside the ellipse. The following sections show how to generate a regular grid of values inside the ellipse and how to score the logistic model on those points.

Generate data in an ellipsoid

I can think of several ways to generate a grid of values inside the ellipse, but the easiest way is to use the Mahalanobis distance (MD). The Mahalanobis distance is a metric that accounts for the correlation in data. Accordingly, you can generate a regular grid of points and then keep only the points whose Mahalanobis distance to the mean vector is less than a certain cutoff value. In this section, I choose the cutoff value to be the maximum Mahalanobis distance in the original data. Alternately, you can use probability theory and the fact that the distribution of the squared MD values is asymptotically chi-squared.

The following SAS/IML program reads the explanatory variables and computes the sample mean and covariance matrix. You can use the MAHALANOBIS function to compute the Mahalanobis distance between each observation and the sample mean. The maximum MD for these data is about 2.23. You can use the ExpandGrid function to generate a regular grid of points, then keep only those points whose MD is less than 2.23. The resulting points are inside an ellipse and represent "typical" values for the variables.

proc iml;
varNames = {'Height' 'Weight'};
use Sashelp.Class; read all var varNames into X;  close;
 
S = cov(X);
m = mean(X);
MD_Data = mahalanobis(X, m, S);
MaxMD = max(MD_Data);
print MaxMD;
 
minX = 50;  maxX = 72;           /* or use min(X[,1]) and max(X[,1]) */
minY = 50;  maxY = 150;          /* or use min(X[,2]) and max(X[,2]) */
xRng = minX:maxX;                /* step by 1 in X */
yRng = do(minY, maxY, 5);        /* step by 5 in Y */
 
u = ExpandGrid( xRng, yRng );    /* regular grid of points */
MD = mahalanobis(u, m, S);       /* MD to center for each point */
u = u[ loc(MD < MaxMD), ];       /* keep only points where MD < cutoff */
ods graphics / width=300px height=300px;
call scatter(u[,1], u[,2]);
 
create ScoreIn from u[c=varNames];   /* write score data set */
   append from u;
close;
QUIT;

Score data in an ellipsoid

You can use PROC PLM to score the logistic model at the points in the ellipsoid. In the earlier call to PROC LOGISTIC, I used the STORE statement to create an item store named LogiModel. The following call to PROC PLM scores the model and uses the ILINK option to output the predicted probabilities. You can use the HEATMAPPARM statement in PROC SGPLOT to visualize the predicted probabilities. So that the color ramp always visualizes probabilities on the interval [0,1], I add two fake observations (0 and 1) to the predicted values.

proc PLM restore=LogiModel noprint;
   score data=ScoreIn out=ScoreOut / ilink;      /* predicted probability */
run;
 
data Fake;           /* Optional trick: Add 0 and 1 so colorramp shows range [0,1] */
Predicted=0; output;
Predicted=1; output;
run;
 
data Score;          /* concatenate the real and fake data */
   set ScoreOut Fake;
run;
 
title "Predicted Values of Sex='F' for Logistic Model";
proc sgplot data=Score;
   styleattrs wallcolor=CxF8F8F8;
   heatmapparm  x=Height y=Weight colorresponse=Predicted;
   xaxis grid; yaxis grid;
run;

The graph is shown at the top of this article. It shows the predicted probabilities for representative combinations of the height and weight variables. Students whose measurements appear in the lower portion of the ellipse (red) are more likely to be female. Students in the upper portion of the ellipse (blue) are more likely to be male. It is not likely that students have measurements outside the ellipse, so the model is not evaluated outside the ellipse.

Summary

This article shows how you can use the Mahalanobis distance to generate scoring data that are in an ellipsoid that is centered at the sample mean vector. Such data are appropriate for scoring "typical" combinations of variables in data that are approximated multivariate normal. Although the example used a logistic regression model, the idea applies to any model that you want to score. This technique can help to prevent scoring the model at unrealistic values of the explanatory variables.

The post Create scoring data when regressors are correlated appeared first on The DO Loop.

11月 092020
 

Intuitively, the skewness of a unimodal distribution indicates whether a distribution is symmetric or not. If the right tail has more mass than the left tail, the distribution is "right skewed." If the left tail has more mass, the distribution is "left skewed." Thus, estimating skewness requires some estimates about the relative size of the left and right tails of a distribution.

The standard definition of skewness is called the moment coefficient of skewness because it is based on the third central moment. The moment coefficient of skewness is a biased estimator and is also not robust to outliers in the data. This article discusses an estimator proposed by Hogg (1974) that is robust and less biased. The ideas in this article are based on Bono, et al. (2020). Bono et al. provide a SAS macro that uses SAS/IML to compute the estimates. This article provides SAS/IML functions that compute the estimates more efficiently and more succinctly.

Early robust estimators of skewness

Hogg's estimate of skewness is merely one of many robust measures of skewness that have been proposed since the 1920s. I have written about the Bowley-Galton skewness, which is based on quantiles. The Bowley-Galton estimator defines the length of the right tails as Q3-Q2 and the length of the left tail as Q2-Q1. (Q1 is the first quartile, Q2 is the median, and Q3 is the third quartile.) It then defines the quantile skewness as half the relative difference between the lengths, which is BG = ((Q3-Q2)-(Q2-Q1)) / (Q3-Q1).

Although "the tail" does not have a universally accepted definition, some people would argue that a tail should include portions of the distribution that are more extreme than the Q1 and Q3 quartiles. You could generalize the Bowley-Galton estimator to use more extreme quantiles. For example, you could use the upper and lower 20th percentiles. If P20 is the 20th percentile and P80 is the 80th percentile, you can define an estimator to be ((P80-Q2)-(Q2-P20)) / (P80-P20). The choice of percentiles to use is somewhat arbitrary.

Hogg's estimator of skewness

Hogg's measure of skewness (1974, p. 918) is a ratio of the right tail length to the left tail length. The right tail length is estimated as U(0.05) – M25, where U(0.05) is the average of the largest 5% of the data and M25 is the 25% trimmed mean of the data. The left tail length is estimated as M25 – L(0.05), where L(0.05) is the average of the smallest 5% of the data. The Hogg estimator is
SkewH = (U(0.05) – M25) / (M25 - L(0.05))

Note that the 25% trimmed mean is the mean of the middle 50% of the data. So the estimator is based on estimating the means of various subsets of the data that are based on quantiles of the data.

Compute the mean of the lower and upper tails

I have written about how to visualize and compute the average value in a tail. However, I did not discuss how to make sense of phrases like "5% of the data" in a sample of size N when 0.05*N is not an integer. There are many estimates of quantiles but Bono et al. (2020) and uses an interpolation method (QNTLDEF=1 in SAS) to compute a weighted mean of the data based on the estimated quantiles.

In Bono et al., the "mean of the lower p_th quantile" is computed by computing k=floor(p*N), which is the closest integer less then p*N, and the fractional remainder, r = p*N – k. After sorting the data from lowest to highest, you sum the first k values and a portion of the next value. You then compute the average as (sum(x[1:k]) + r*x[k+1]) / (k+r), as shown in the following program. The mean of the upper quantile is similar.

proc iml;
/* ASSUME x is sorted in ascending order.
   Compute the mean of the lower p_th percentile of the data.
   Use linear interpolation if p*N is not an integer. Assume 0<= p <= 1. 
*/
start meanLowerPctl(p, x);
   N = nrow(x);  
   k = floor(p*N);                 /* index less than p*N */
   r = p*N - k;                    /* r >= 0 */
   if k<1 then m = x[1];           /* use x[1] as the mean */
   else if k>=N then m = mean(x);  /* use mean of all data */
   else do;                        /* use interpolation */
      sum = sum(x[1:k]) + r*x[k+1];
      m = sum / (k+r);
   end;
   return( m );
finish;
 
/* ASSUME x is sorted in ascending order.
   Compute the mean of the upper p_th percentile of the data.
*/
start meanUpperPctl(p, x);
   q = 1-p; 
   N = nrow(x); 
   k = ceil(q*N);                  /* index greater than p*N */
   r = k - q*N;                    /* r >= 0 */
   if k<=1 then m = mean(x);       /* use mean of all data */
   else if k>=N then m = x[N];     /* use x[1] as the mean */
   else do;                        /* use interpolation */
      sum = sum(x[k+1:N]) + r*x[k];
      m = sum / (N-k+r);
   end;
   return( m );
finish;

Let's run a small example so that we can check that the functions work. The following statements define an ordered sample of size N=10 and compute the mean of the lower and upper p_th quantiles for p=0.1, 0.2, ..., 1.

x = {2, 4, 5, 7, 8, 8, 9, 9, 12, 16};  /* NOTE: sorted in increasing order */
obsNum = T(1:nrow(x));
Pctl = obsNum / nrow(x);
meanLow = j(nrow(Pctl), 1, .);
meanUp = j(nrow(Pctl), 1, .);
do i = 1 to nrow(Pctl);
   meanLow[i] = meanLowerPctl(Pctl[i], x);
   meanUp[i] = meanUpperPctl(Pctl[i], x);
end;
print Pctl meanLow meanUp;

Because there are 10 observations and the program uses deciles to define the tails, each row gives the mean of the smallest and largest k observations for k=1, 2, ..., 10. For example, the second row indicates that the mean of the two smallest values is 3, and the mean of the two largest values is 14.

If you plot the estimated means as a function of the percentile, you can see how the functions interpolate between the data values. The following statements compute and graph the mean of the lower p_th quantile as a function of p. The graph of the mean of the upper tail is computed in an analogous fashion:

/* plot the mean of the lower tail as a function of p */
p = do(0, 1, 0.01);
mLow = j(1, ncol(p), .);
do i = 1 to ncol(p);
   mLow[i] = meanLowerPctl(p[i], x);
end;
 
title "Mean of Lower p Percentile of Data";
call series(p, mLow) grid={x y} xvalues=do(0,1,0.1) label={"Percentile" "Mean"};

Compute Hogg's skewness in SAS

With these functions, you can compute Hogg's estimate of skewness. The following SAS/IML function implements the formula. The function is then tested on a large random sample from the exponential distribution.

/* estimate of Hogg skewness */
start HoggSkew(_x);
   x = _x;
   call sort(x);
   L05 = meanLowerPctl(0.05, x);
   U05 = meanUpperPctl(0.05, x);
   m25 = mean(x, 'trim', 0.25);
   Skew_Hogg = (U05 - m25) / (m25 - L05); /* Right Skewness: Hogg (1974, p. 918) */
   return Skew_Hogg;
finish;
 
/* generate random sample from the exponential distribution */
call randseed(12345, 1);
N = 1E5;
x = randfun(N, "Expon");
Skew_Hogg = HoggSkew(x);  /* find the Hogg skewness for exponential data */
print Skew_Hogg;

The sample estimate of Hogg's skewness for these data is 4.558. The value of Hogg's skewness for the exponential distribution is 4.569, so there is close agreement between the parameter and its estimate from the sample.

Compute Hogg's kurtosis in SAS

In the same paper, Hogg (1974) proposed a robust measure of kurtosis, which is
KurtH = (U(0.2) – L(0.2)) / (U(0.5 - U(0.5))
where U(0.2) and L(0.2) are the means of the upper and lower 20% of the data (respectively) and U(0.5) and L(0.5) are the means of the upper and lower 50% of the data (respectively). You can use the same SAS/IML functions to compute Hogg's kurtosis for the data:

/* estimate of Hogg kurtosis */
start HoggKurt(_x);
   x = _x;
   call sort(x);
   L20 = meanLowerPctl(0.20, x);
   U20 = meanUpperPctl(0.20, x);
   L50 = meanLowerPctl(0.50, x);
   U50 = meanUpperPctl(0.50, x);
   Kurt_Hogg = (U20 - L20) / (U50 - L50); /* Kurtosis: Hogg (1974, p. 913) */
   return Kurt_Hogg;
finish;
 
Kurt_Hogg = HoggKurt(x);  /* find the Hogg kurtosis for exponential data */
print Kurt_Hogg;

For comparison, the exact value of the Hogg kurtosis for the exponential distribution is 1.805.

Summary

This article shows how to compute Hogg's robust measures of skewness and kurtosis. The article was inspired by Bono et al. (2020), who explore the bias and accuracy of Hogg's measures of skewness and kurtosis as compared to the usual moment-based skewness and kurtosis. They conclude that Hogg’s estimators are less biased and more accurate. Bono et al. provide a SAS macro that computes Hogg's estimators. In this article, I rewrote the computations as SAS/IML functions to make them more efficient and more understandable. These functions enable researchers to compute Hogg's skewness and kurtosis for data analysis and in simulation studies.

You can download the complete SAS program that defines the Hogg measures of skewness and kurtosis. The program includes computations that show how to compute these values for a distribution.

References

The post Robust statistics for skewness and kurtosis appeared first on The DO Loop.

11月 042020
 

The expected value of a random variable is essentially a weighted mean over all possible values. You can compute it by summing (or integrating) a probability-weighted quantity over all possible values of the random variable. The expected value is a measure of the "center" of a probability distribution.

Expected value for the tail of a distribution

You can generalize this idea. Instead of using the entire distribution, suppose you want to find the "center" of only a portion of the distribution. For the central portion of a distribution, this process is similar to the trimmed mean. (The trimmed mean discards data in the tails of a distribution and averages the remaining values.) For the tails of a distribution, a natural way to compute the expected value is to sum (or integrate) the weighted quantity x*pdf(x) over the tail of the distribution. The graph to the right illustrates this idea for the exponential distribution. The left and right tails (defined by the 20th and 80th percentiles, respectively) are shaded and the expected value in each tail is shown by using a vertical reference line.

This article shows how to compute the expected value of the tail of a probability distribution. It also shows how to estimate this quantity for a data sample.

Why is this important? Well, this idea appears in some papers about robust and unbiased estimates of the skewness and kurtosis of a distribution (Hogg, 1974; Bono, et al., 2020). In estimating skewness and kurtosis, it is important to estimate the length of the tails of a distribution. Because the tails of many distributions are infinite in length, you need some alternative definition of "tail length" that leads to finite quantities. One approach is to truncate the tails, such as at the 5th percentile on the left and the 95th percentile on the right. An alternative approach is to use the expected value of the tails, as shown in this article.

The expected value in a tail

Suppose you have any distribution with density function f. You can define the tail distribution as a truncated distribution on the interval (a,b), where possibly a = -∞ or b = ∞. To get a proper density, you need to divide by the area of the tail, as follows:
\( g(x) = f(x) / \int_a^b f(x) \,dx \)
If F(x) is the cumulative distribution, the denominator is simply the expression F(b)F(a). Therefore, the expected value for the truncated distribution on (a,b) is
\( EV = \int_a^b x g(x) \,dx = (\int_a^b x f(x) \,dx) / (F(b) - F(a)) \)

There is no standard definition for the "tail" of a distribution, but one definition is to use symmetric quantiles of the distribution to define the tails. For a quantile, p, you can define the left tail to be the portion of the distribution for which X ≤ p and the right tail to be the portion for which X ≥ 1-p. If you let qL be the pth quantile and qU be the (1-p)th quantile, then the expected value of the left tail is
\( E_L = (\int_{-\infty}^{q_L} x f(x) \,dx) / (F(q_L) - 0) \)
and the expected value of the right tail is
\( E_R = (\int_{q_U}^{\infty} x f(x) \,dx) / (1 - F(q_U)) \)

The expected value in the tail of the exponential distribution

For an example, let's look at the exponential distribution. The exponential distribution is defined only for x ≥ 0, so the left tail starts a 0. The choice of the quantile, p, is arbitrary, but I will use p=0.2 because that value is used in Bono, et al. (2020). The 20th percentile of the exponential distribution is q20 = 0.22. The 80th percentile is q80 = 1.61. You can use the QUAD function in the SAS/IML language to compute the integrals, as follows:

proc iml;
/* find the expected value of the truncated exponential distribution on the interval [a,b] 
*/
start Integrand(x);
   return x*pdf("Expon",x);
finish;
 
/* if f(x) is the PDF and F(x) is the CDF of a distribution, the expected value on [a,b] is
   (\int_a^b  x*f(x) dx) / (CDF(B) - CDF(a))
*/
start Expo1Moment(a,b);
   call quad(numer, "Integrand", a||b );
   /* define CDF(.M)=0 and CDF(.P)=1 */
   cdf_a = choose(a=., 0, cdf("Expon", a));  /* CDF(a) */
   cdf_b = choose(b=., 1, cdf("Expon", b));  /* CDF(b) */
   ExpectedValue = numer / (cdf_b - cdf_a);
   return ExpectedValue;
finish;
 
/* expected value of lower 20th percentile of Expon distribution */
p = 0.2;
qLow = quantile("Expon", p);
ExpValLow20 = Expo1Moment(0, qLow);
print qLow ExpValLow20;
 
/* expected value of upper 20th percentile */
qHi = quantile("Expon", 1-p);
ExpValUp20 = Expo1Moment(qHi, .I); /* .I = infinity */
print qHi ExpValUp20;

In this program, the left tail is the portion of the distribution to the left of the 20th percentile. The right tail is to the right of the 80th percentile. The first table says that the expected value in the left tail of the exponential distribution is 0.107. Intuitively, that is the weighted average of the left tail or the location of its center of mass. The second table says that the expected value in the right tail is 0.209. These results are visualized in the graph at the top of this article.

Estimate the expected value in the tail of a distribution

You can perform a similar computation for a data sample. Instead of an integral, you merely take the average of the lower and upper p*100% of the data. For example, the following SAS/IML statements simulate 100,000 random variates from the exponential distribution. You can use the QNTL function to estimate the quantiles of the data. You can then use the LOC function to find the elements that are in the tail and use the MEAN function to compute the arithmetic average of those elements:

/* generate data from the exponential distribution */
call randseed(12345);
N = 1e5;
x = randfun(N, "Expon");
 
/* assuming no duplicates and a large sample, you can 
   use quantiles and means to estimate the expected values */
/* estimate the expected value for the lower 20% tail */
call qntl(qEst, x, p);      /* p_th quantile */
idx = loc(x<=qEst);         /* rows for which x[i]<= quantile */
meanLow20 = mean(x[idx]);   /* mean of lower tail */
print qEst meanLow20;
 
/* estimate the expected value for the upper 20% tail */
call qntl(qEst, x, 1-p);    /* (1-p)_th quantile */
idx = loc(x>=qEst);         /* rows for which x[i]>= quantile */
meanUp20 = mean(x[idx]);    /* mean of upper tail */
print qEst meanUp20;

The estimates are shown for the lower and upper 20th-percentile tails of the data. Because the sample is large, the sample estimates are close to the quantiles of the exponential distribution. Also, the means of the lower and upper tails of the data are close to the expected values for the tails of the distribution.

It is worth mentioning that there are many different formulas for estimating quantiles. Each estimator will give a slightly different estimate for quantile, and therefore you can get different estimates for the mean. This becomes important for small samples, for long-tailed distributions, and for samples that have duplicate values.

Summary

The expected value is a measure of the "center" of a probability distribution on some domain. This article shows how to solve an integral to find the expected value for the left tail or right tail of a distribution. For a data distribution, the expected value is the mean of the observations in the tail. In a future article, I'll show how to use these ideas to create robust and unbiased estimates of skewness and kurtosis.

The post The expected value of the tail of a distribution appeared first on The DO Loop.

10月 282020
 

The skewness of a distribution indicates whether a distribution is symmetric or not. The Wikipedia article about skewness discusses two common definitions for the sample skewness, including the definition used by SAS. In the middle of the article, you will discover the following sentence: In general, the [estimators]are both biased estimators of the population skewness. The article goes on to say that the estimators are not biased for symmetric distributions. Similar statements are true for the sample kurtosis.

This statement might initially surprise you. After all, the statistics that we use to estimate the mean and standard deviation are unbiased. Although biased estimates are not inherently "bad," it is useful to get an intuitive feel for how biased an estimator might be.

Let's demonstrate the bias in the skewness statistic by running a Monte Carlo simulation. Choose an unsymmetric univariate distribution for which the population skewness is known. For example, the exponential distribution has skewness equal to 2. Then do the following:

  1. Choose a sample size N. Generate B random samples from the chosen distribution.
  2. Compute the sample skewness for each sample.
  3. Compute the average of the skewness values, which is the Monte Carlo estimate of the sample skewness. The difference between the Monte Carlo estimate and the parameter value is an estimate of the bias of the statistic.

In this article, I will generate B=10,000 random samples of size N=100 from the exponential distribution. The simulation shows that the expected value of the skewness is NOT close to the population parameter. Hence, the skewness statistic is biased.

A Monte Carlo simulation of skewness

The following DATA step simulates B random samples of size N from the exponential distribution. The call to PROC MEANS computes the sample skewness for each sample. The call to PROC SGPLOT displays the approximate sampling distribution of the skewness. The graph overlays a vertical reference line at 2, which is the skewness parameter for the exponential distribution, and also overlays a reference line at the Monte Carlo estimate of the expected value.

%let NumSamples = 10000;
%let N = 100;
/* 1. Simulate B random samples from exponential distribution */
data Exp;
call streaminit(1);
do SampleID = 1 to &NumSamples;
   do i = 1 to &N;
      x = rand("Expo");
      output;
   end;
end;
run;
 
/* 2. Estimate skewness (and other stats) for each sample */
proc means data=Exp noprint;
   by SampleID;
   var x;
   output out=MCEst mean=Mean stddev=StdDev skew=Skewness kurt=Kurtosis;
run;
 
/* 3. Graph the sampling distribution and overlay parameter value */
title "Monte Carlo Distribution of Sample Skewness";
title2 "N = &N; B = &NumSamples";
proc sgplot data=MCEst;
   histogram Skewness;
   refline 2 / axis=x lineattrs=(thickness=3 color=DarkRed) 
             labelattrs=(color=DarkRed) label="Parameter";
   refline 1.818 / axis=x lineattrs=(thickness=3 color=DarkBlue) label="Monte Carlo Estimate"
             labelattrs=(color=DarkBlue) labelloc=inside ;
run;
 
/* 4. Display the Monte Carlo estimate of the statistics */ 
proc means data=MCEst ndec=3 mean stddev;
   var Mean StdDev Skewness Kurtosis;
run;
Monte Carlo distribution of skewness statistic (B=10000, N=100)

For the exponential distribution, the skewness parameter has the value 2. However, according to the Monte Carlo simulation, the expected value of the sample skewness is about 1.82 for these samples of size 100. Thus, the bias is approximately 0.18, which is about 9% of the true value.

The kurtosis statistic is also biased. The output from PROC MEANS includes the Monte Carlo estimates for the expected value of the sample mean, standard deviation, skewness, and (excess) kurtosis. For the exponential distribution, the parameter values are 1, 1, 2, and 6, respectively. The Monte Carlo estimates for the sample mean and standard deviation are close to the parameter values because these are unbiased estimators. However, the estimates for the skewness and kurtosis are biased towards zero.

Summary

This article uses Monte Carlo simulation to demonstrate bias in the commonly used definitions of skewness and kurtosis. For skewed distributions, the expected value of the sample skewness is biased towards zero. The bias is greater for highly skewed distributions. The skewness statistic for a symmetric distribution is unbiased.

The post The sample skewness is a biased statistic appeared first on The DO Loop.

10月 262020
 

A fundamental principle of data analysis is that a statistic is an estimate of a parameter for the population. A statistic is calculated from a random sample. This leads to uncertainty in the estimate: a different random sample would have produced a different statistic. To quantify the uncertainty, SAS procedures often support options that estimate standard errors for statistics and confidence intervals for parameters.

Of course, if statistics have uncertainty, so, too, do functions of the statistics. For complicated functions of the statistics, the bootstrap method might be the only viable technique for quantifying the uncertainty.

Graphical comparison of two methods for estimating confidence intervals of eigenvalues of a correlation matrix

This article shows how to obtain confidence intervals for the eigenvalues of a correlation matrix. The eigenvalues are complicated functions of the correlation estimates. The eigenvalues are used in a principal component analysis (PCA) to decide how many components to keep in a dimensionality reduction. There are two main methods to estimate confidence intervals for eigenvalues: an asymptotic (large sample) method, which assumes that the eigenvalues are multivariate normal, and a bootstrap method, which makes minimal distributional assumptions.

The following sections show how to compute each method. A graph of the results is shown to the right. For the data in this article, the bootstrap method generates confidence intervals that are more accurate than the asymptotic method.

This article was inspired by Larsen and Warne (2010), "Estimating confidence intervals for eigenvalues in exploratory factor analysis." Larsen and Warne discuss why confidence intervals can be useful when deciding how many principal components to keep.

Eigenvalues in principal component analysis

To demonstrate the techniques, let's perform a principal component analysis (PCA) on the four continuous variables in the Fisher Iris data. In SAS, you can use PROC PRINCOMP to perform a PCA, as follows:

%let DSName = Sashelp.Iris;
%let VarList = SepalLength SepalWidth PetalLength PetalWidth;
 
/* 1. compute value of the statistic on original data */
proc princomp data=&DSName STD plots=none;     /* stdize PC scores to unit variance */
   var &VarList;
   ods select ScreePlot Eigenvalues NObsNVar;
   ods output Eigenvalues=EV0 NObsNVar=NObs(where=(Description="Observations"));
run;
 
proc sql noprint;                              
 select nValue1 into :N from NObs;   /* put the number of obs into macro variable, N */
quit;
%put &=N;
Output from PROC CORR that shows eigenvalues of the correlation matrix

The first table shows that there are 150 observations. The second table displays the eigenvalues for the sample, which are 2.9, 0.9, 0.15, and 0.02. If you want a graph of the eigenvalues, you can use the PLOTS(ONLY)=SCREE option on the PROC PRINCOMP statement. The ODS output statement creates SAS data set from the tables. The PROC SQL call creates a macro variable, N, that contains the number of observations.

Asymptotic confidence intervals

If a sample size, n, is large enough, the sampling distribution of the eigenvalues is approximately multivariate normal (Larsen and Ware (2010, p. 873)). If g is an eigenvalue for a correlation matrix, then an asymptotic confidence interval is
     g ± z* sqrt( 2 g2 / n )
where z* is the standard normal quantile, as computed in the following program:

/* Asymptotic CIs for eigenvalues (Larsen and Warne (2010, p. 873) */
data AsympCI;
set EV0;
alpha = 0.05;
z = quantile("Normal", 1 - alpha/2);  /* = 1.96 */
SE = sqrt(2*Eigenvalue**2 / &N);
Normal_LCL = Eigenvalue - SE;         /* g +/- z* sqrt(2 g^2 / n) */
Normal_UCL = Eigenvalue + SE;
drop alpha z SE;
run;
 
proc print data=AsympCI noobs;
   var Number Eigenvalue Normal_LCL Normal_UCL;
run;
Comparison of two methods for estimating confidence intervals of eigenvalues of a correlation matrix

The lower and upper confidence limits are shown for each eigenvalue. The advantage of this method is its simplicity. The intervals assume that the distribution of the eigenvalues is multivariate normal, which will occur when the sample size is very large. Since N=150 does not seem "very large," it is not clear whether these confidence intervals are valid. Therefore, let's estimate the confidence intervals by using the bootstrap method and compare the bootstrap intervals to the asymptotic intervals.

Bootstrap confidence intervals for eigenvalues

The bootstrap computations in this section follow the strategy outlined in the article "Compute a bootstrap confidence interval in SAS." (For additional bootstrap tips, see "The essential guide to bootstrapping in SAS.") The main steps are:

  1. Compute a statistic for the original data. This was done in the previous section.
  2. Use PROC SURVEYSELECT to resample (with replacement) B times from the data. For efficiency, put all B random bootstrap samples into a single data set.
  3. Use BY-group processing to compute the statistic of interest on each bootstrap sample. The union of the statistics is the bootstrap distribution, which approximates the sampling distribution of the statistic. Remember to turn off ODS when you run BY-group processing!
  4. Use the bootstrap distribution to obtain confidence intervals for the parameters.

The steps are implemented in the following SAS program:

/* 2. Generate many bootstrap samples */
%let NumSamples = 5000;       /* number of bootstrap resamples */
proc surveyselect data=&DSName NOPRINT seed=12345
     out=BootSamp(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                 option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;
 
/* 3. Compute the statistic for each bootstrap sample */
/* Suppress output during this step:
   https://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations.html
*/
%macro ODSOff();   ods graphics off; ods exclude all;  ods noresults;   %mend;
%macro ODSOn();    ods graphics on;  ods exclude none; ods results;     %mend;
 
%ODSOff;
proc princomp data=BootSamp STD plots=none;
   by SampleID;
   freq NumberHits;
   var &VarList;
   ods output Eigenvalues=EV(keep=SampleID Number Eigenvalue);
run;
%ODSOn;
 
/* 4. Estimate 95% confidence interval as the 2.5th through 97.5th percentiles of boostrap distribution */
proc univariate data=EV noprint;
   class Number;
   var EigenValue;
   output out=BootCI pctlpre=Boot_ pctlpts=2.5  97.5  pctlname=LCL UCL;
run;
 
/* merge the bootstrap CIs with the normal CIs for comparison */
data AllCI;
   merge AsympCI(keep=Number Eigenvalue Normal:) BootCI(keep=Number Boot:);
   by Number;
run;
 
proc print data=AllCI noobs;
   format Eigenvalue Normal: Boot: 5.3;
run;
Asymptotic (normal) confidence intervals for eigenvalues

The table displays the bootstrap confidence intervals (columns 5 and 6) next to the asymptotic confidence intervals (columns 3 and 4). It is easier to compare the intervals if you visualize them graphically, as follows:

/* convert from wide to long */
data CIPlot;
   set AllCI;
   Method = "Normal   "; LCL = Normal_LCL; UCL = Normal_UCL; output;
   Method = "Bootstrap"; LCL = Boot_LCL; UCL = Boot_UCL; output;
   keep Method Eigenvalue Number LCL UCL;
run;
 
title "Comparison of Normal and Bootstrap Confidence Intervals";
title2 "Eigenvalues of the Correlation Matrix for the Iris Data";
ods graphics / width=480px height=360px;
proc sgplot data=CIPlot;
   scatter x=Eigenvalue y=Number / group=Method clusterwidth=0.4
           xerrorlower=LCL xerrorupper=UCL groupdisplay=cluster;
   yaxis grid type=discrete colorbands=even colorbandsattrs=(color=gray transparency=0.9);
   xaxis grid;
run;

The graph is shown at the top of this article. The graph nicely summarizes the comparison. For the first (largest) eigenvalue, the bootstrap confidence interval is about half as wide as the normal confidence interval. Thus, the asymptotic result seems too wide for these data. For the other eigenvalues, the normal confidence intervals appear to be too narrow.

If you graph the bootstrap distribution, you can see that the bootstrap distribution does not appear to be multivariate normal. This presumably explains why the asymptotic intervals are so different from the bootstrap intervals. For completeness, the following graph shows a matrix of scatter plots and marginal histograms for the bootstrap distribution. The histograms indicate skewness in the bootstrap distribution.

Bootstrap distribution for eigenvalues (5000 samples)

Summary

This article shows how to compute confidence intervals for the eigenvalues of an estimated correlation matrix. The first method uses a formula that is valid when the sampling distribution of the eigenvalues is multivariate normal. The second method uses bootstrapping to approximate the distribution of the eigenvalues, then uses percentiles of the distribution to estimate the confidence intervals. For the Iris data, the bootstrap confidence intervals are substantially different from the asymptotic formula.

The post Confidence intervals for eigenvalues of a correlation matrix appeared first on The DO Loop.

9月 102020
 

I previously wrote about the RAS algorithm, which is a simple algorithm that performs matrix balancing. Matrix balancing refers to adjusting the cells of a frequency table to match known values of the row and column sums. Ideally, the balanced matrix will reflect the structural relationships in the original matrix. An alternative method is the iterative proportional fitting (IPF) algorithm, which is implemented in the IPF subroutine in SAS/IML. The IPF method can balance n-way tables, n ≥ 2.

The IPF function is a statistical modeling method. It computes maximum likelihood estimates for a hierarchical log-linear model of the counts as a function of the categorical variables. It is more complicated to use than the RAS algorithm, but it is also more powerful.

This article shows how to use the IPF subroutine for the two-way frequency table from my previous article. The data are for six girl scouts who sold seven types of cookies. We know the row totals for the Type variable (the boxes sold for each cookie type) and the column totals for the Person variable (the boxes sold by each girl). But we do not know the values of the individual cells, which are the Type-by-Person totals. A matrix balancing algorithm takes any guess for the table and maps it into a table that has the observed row and column sums.

Construct a table that has the observed marginal totals

As mentioned in the previous article, there are an infinite number of matrices that have the observed row and column totals. Before we discuss the IPF algorithm, I want to point out that it is easy to compute one particular balanced matrix. The special matrix is the one that assumes no association (independence) between the Type and Person variables. That matrix is formed by the the outer product of the marginal totals:
     B[i,j] = (sum of row i)(sum of col j) / (sum of all counts)

It is simple to compute the "no-association" matrix in the SAS/IML language:

proc iml;
/* Known quantities: row totals and column totals */
u = {260, 214, 178, 148, 75, 67, 59}; /* total boxes sold, by type */
v = {272 180 152 163 134 100};        /* total boxes sold, by girl */
 
/* Compute the INDEPENDENT table, which is formed by the outer product
   of the marginal totals:
   B[i,j] = (sum of row i)(sum of col j) / (sum of all counts) */
B = u*v / sum(u);
 
Type   = 'Cookie1':'Cookie7';                  /* row labels    */
Person = 'Girl1':'Girl6';                      /* column labels */
print B[r=Type c=Person format=7.1 L='Independent Table'];

For this balanced matrix, there is no association between the row and column variables (by design). This matrix is the "expected count" matrix that is used in a chi-square test for independence between the Type and Person variables.

Because the cookie types are listed in order from most popular to least popular, the values in the balanced matrix (B) decrease down each column. If you order the girls according to how many boxes they sold (you need to exchange the third and fourth columns), then the values in the balanced matrix would decrease across each row.

Use the IPF routine to balance a matrix

The documentation for the IPF subroutine states that you need to specify the following input arguments. For clarity, I will only discuss two-way r x c tables.

  • dim: The vector of dimensions of the table: dim = {r c};
  • table: This r x c matrix is used to specify the row and column totals. You can specify any matrix; only the marginal totals of the matrix are used. For example, for a two-way table, you can use the "no association" matrix from the previous section.
  • config is an input matrix whose columns specify which marginal totals to use to fit the counts. Because the model is hierarchical, all subsets of specified marginals are included in the model. For a two-way table, the most common choices are the main effect models such as config={1} (model count by first variable only), config={2} (model count by second variable only), and config={1 2} (model count by using both variables). It is possible to fit the "fully saturated model" (config={1, 2}), but that model is overparameterized and results in a perfect fit, if the IPF algorithm converges.
  • initTable (optional): The elements of this matrix will be adjusted to match the observed marginal totals. In practice, this matrix is often based on a survey and you want to adjust the counts by using known marginal totals from a recent census of the population. Or it might be an educated guess, as in the Girl Scout cookie example. If you do not specify this matrix, a matrix of all 1s is used, which implies that you think all cell entries are approximately equal.

The following SAS/IML program defines the parameters for the IPF call.

/* Specify structure between rows and columns by asking each girl 
   to estimate how many boxes of each type she sold. This is the 
   matrix whose cells will be adjusted to match the marginal totals. */
/*   G1 G2 G3 G4 G5 G6          */
A = {75 45 40 40 40 30 ,  /* C1 */
     40 35 45 35 30 30 ,  /* C2 */
     40 25 30 40 30 20 ,  /* C3 */
     40 25 25 20 20 20 ,  /* C4 */
     30 25  0 10 10  0 ,  /* C5 */
     20 10 10 10 10  0 ,  /* C6 */
     20 10  0 10  0  0 }; /* C7 */
 
Dim = ncol(A) || nrow(A);  /* dimensions of r x c table */
Config = {1 2};            /* model main effects: counts depend on Person and Type */
call IPF(FitA, Status,     /* output values */
         Dim, B,           /* use the "no association" matrix to specify marginals */
         Config, A);       /* use best-guess matrix to specify structure */
print FitA[r=Type c=Person format=7.2],     /* the IPF call returns the FitA matrix and Status vector */
      Status[c={'rc' 'Max Diff' 'Iters'}];

The IPF call returns two quantities. The first output is the balanced matrix. For this example, the balanced matrix (FitA) is identical to the result from the RAS algorithm. It has a similar structure to the "best guess" matrix (A). For example, and notice that the cells in the original matrix that are zero are also zero in the balanced matrix.

The second output is the Status vector. The status vector provides information about the convergence of the IPF algorithm:

  • If the first element is zero, the IPF algorithm converged.
  • The second element gives the maximum difference between the marginal totals of the balanced matrix (FitA) and the observed marginal totals.
  • The third element gives the number of iterations of the IPF algorithm.

Other models

The IPF subroutine is more powerful than the RAS algorithm because you can fit n-way tables and because you can model the table in several natural ways. For example, the model in the previous section assumes that the counts depend on the salesgirl and on the cookie type. You can also model the situation where the counts depend only on one of the factors and not on the other:

/* counts depend only on Person, not on cookie Type */
Config = {1};   
call IPF(FitA1, Status, Dim, B, Config, A);
print FitA1[r=Type c=Person format=7.2], Status;
 
/* counts depend only on cookie Type, not on Person */
Config = {2};   
call IPF(FitA2, Status, Dim, B, Config, A);
print FitA2[r=Type c=Person format=7.2], Status;

Summary

In summary, the SAS/IML language provides a built-in function (CALL IPF) for matrix balancing of frequency tables. The routine incorporates statistical modeling and you can specify which variables to use to model the cell counts. The documentation several examples that demonstrate using the IPF algorthm to balance n-way tables, n ≥ 2.

The post Iterative proportional fitting in SAS appeared first on The DO Loop.

8月 102020
 

A common operation in statistical data analysis is to center and scale a numerical variable. This operation is conceptually easy: you subtract the mean of the variable and divide by the variable's standard deviation. Recently, I wanted to perform a slight variation of the usual standardization:

  • Perform a different standardization for each level of a grouping variable.
  • Instead of using the sample mean and sample standard deviation, I wanted to center and scale the data by using values that I specify.

Although you can write SAS DATA step code to center and scale the data, PROC STDIZE in SAS contains options that handle both these cases.

Simulate data from two groups

Let's start by creating some example data. The following DATA step simulates samples drawn from two groups. The distribution of the first group is N(mu1, sigma) and the distribution of the second group is N(mu2, sigma). The first sample size is n1=0; the second size is n2=30. Notice that the data are ordered by the Group variable.

%let mu1 = 15;
%let mu2 = 16;
%let sigma = 5;
/* Create two samples N(mu1, sigma) and N(mu2, sigma), with samples sizes n1 and n2, respectively */
data TwoSample;
call streaminit(54321);
n1 = 20; n2 = 30;
Group = 1;
do i = 1 to n1;
   x = round(rand('Normal', &mu1, &sigma), 0.01);  output;
end;
Group = 2;
do i = 1 to n2;
   x = round(rand('Normal', &mu2, &sigma), 0.01);  output;
end;
keep Group x;
run;
 
ods graphics / width=500px height=360px;
title "Normal Data";
title2 "N(&mu1, &sigma) and N(&mu2, &sigma)";
proc sgpanel data=TwoSample noautolegend;
   panelby Group / columns=1;
   histogram x;
   density x / type=normal;
   colaxis grid;
run;

The graph shows a comparative histogram of the distribution of each group and overlays the density of the normal curve that best fits the data. You can run PROC MEANS to calculate that the estimates for the first group are mean=14.9 and StdDev=5.64. For the second group, the estimates are mean=15.15 and StdDev=4.27.

The "standard" standardization

Typically, you standardize data by using the sample mean and the sample standard deviation. You can do this by using PROC STDIZE and specify the METHOD=STD method (which is the default method). You can use the BY statement to apply the standardization separately to each group. The following statements standardize the data in each group by using the sample statistics:

/* use PROC STDIZE to standardize by the sample means and sample std dev */
proc stdize data=TwoSample out=ScaleSample method=std;
   by Group;
   var x;
run;
 
proc means data=ScaleSample Mean StdDev Min Max ndec=2; 
   class Group;
   var x;
run;

As expected, each sample now has mean=0 and unit standard deviation. Many multivariate analyses center and scale data in order to adjust for the fact that different variables are measured on different scales. PROC STDIZE has many other options for standardizing data.

Center and scale by using the DATA step

A second way to standardize the data is to use the DATA step to center and scale each variable and each group. You can overwrite the contents of each column, or (as I've done below), you can create a new variable that contains the standardized values. You need to specify values for the location and scale parameters. For these simulated data, I know the population values of the location and scale parameters. The following DATA step uses the parameters to center and scale the data:

/* because we know the population parameters, we can use them to center and scale the data */
/* DATA step method */
data ScaleData;
array mu[2] (&mu1 &mu2);    /* put parameter values into an array */
set TwoSample;
y = (x-mu[Group]) / &sigma;
run;
 
proc means data=ScaleData Mean StdDev Min Max ndec=2;
   class Group;
   var y;
run;

Because I use the population parameters instead of the sample estimates, the transformed variable does not have zero mean nor unit variance.

Wide form: Use the LOCATION and SCALE statements in PROC STDIZE

You can perform the same calculations by using PROC STDIZE. If you look up the documentation for the METHOD= option on the PROC STDIZE statement, you will see that the procedure supports the METHOD=IN(ds) method, which enables you to read parameters from a data set. I will focus on the LOCATION and SCALE parameters; see the documentation for other options.

You can specify the parameters in "wide form" or in "long form". In wide form, each column specifies a parameter and you can include multiple rows if you want to perform a BY-group analysis. You can use the LOCATION and SCALE statements to specify the name of the variables that contain the parameters. The following DATA step creates a data set that has three variables: Group, Mu, and Sigma. Each row specifies the location and scale parameter for centering and scaling data in the levels of the Group variable.

data ScaleWide;
Group=1; Mu=&mu1; Sigma=&sigma; output;
Group=2; Mu=&mu2; Sigma=&sigma; output;
run;
 
proc stdize data=TwoSample out=StdWide method=in(ScaleWide);
   by Group;
   var x;
   location mu;    /* column name METHOD=IN data set */
   scale    sigma; /* column name METHOD=IN data set */
run;
 
proc means data=StdWide Mean StdDev Min Max ndec=2; 
   class Group;
   var x;
run;

The output is identical to the output from the previous section and is not shown.

Long form: Use the _TYPE_ variable

You can also specify the location and scale parameters in long form. For a long-form specification, you need to create a data set that contains a variable named _TYPE_. The parameters for centering and scaling are specified in rows where _TYPE_="LOCATION" and _TYPE_="SCALE", respectively. You can also include one or more BY-group variables (if you are using the BY statement) and one or more columns whose names are the same as the variables you intend to transform.

For example, the example data in this article has a BY-group variable named Group and an analysis variable named X. The following DATA step specifies the values to use to center (_TYPE_='LOCATION') and scale (_TYPE_='SCALE') the X variable for each group:

data ScaleParam;
length _TYPE_ $14;
input Group _TYPE_ x;
datalines;
1 LOCATION 15
1 SCALE     5
2 LOCATION 16
2 SCALE     5
;
 
proc stdize data=TwoSample out=Scale2 method=in(ScaleParam);
   by Group;
   var x;
run;
 
proc means data=Scale2; 
   class Group;
   var x;
run;

Again, the result (not shown) is the same as when the DATA step is used to center and scale the data.

Summary

This article gives examples of using PROC STDIZE in SAS/STAT to center and scale data. Most analysts want to standardize data in groups by using the sample means and sample standard deviations of each group. This is the default behavior of PROC STDIZE. However, you can also center and scale data by using values that are different from the group statistics. For example, you might want to use a grand mean and a pooled standard deviation for the groups. Or, perhaps you are using simulated data and you know the population mean and variance of each group. Regardless, there are three ways to center and scale the data when you know the location and scale parameters:

  • Manually: You can use the DATA step or PROC IML or PROC SQL to write a formula that centers and scales the data.
  • Wide Form: PROC STDIZE supports the LOCATION and SCALE statements. If you use the METHOD=IN(ds) option on the PROC STDIZE statement, you can read the parameters from a data set.
  • Long Form: If the input data set contains a variable named _TYPE_, PROC STDIZE will use the parameter values where _TYPE_='LOCATION' to center the data and where _TYPE_='SCALE' to scale the data.

The post 4 ways to standardize data in SAS appeared first on The DO Loop.

7月 202020
 

I recently showed how to compute within-group multivariate statistics by using the SAS/IML language. However, a principal of good software design is to encapsulate functionality and write self-contained functions that compute and return the results. What is the best way to return multiple statistics from a SAS/IML module?

A convenient way to return several statistics is to use lists. Lists were introduced in SAS/IML 14.2 and extended in SAS/IML 14.3. I have previously written about how to use lists to pass arguments into a SAS/IML module. For parallel processing in the iml action in SAS Viya, lists are an essential data structure.

This article shows two things. First, it shows how to use the UNIQUE-LOC technique to compute multivariate statistics for each group in the data. It then shows two ways to return those statistics by using SAS/IML lists. Depending on your application, one way might be more convenient to use than the other.

Returning matrices versus lists

Suppose you want to write a SAS/IML function that computes several statistics for each group in the data. If you are analyzing one variable (univariate data), the statistics are scalar. Therefore, the module can return a matrix in which the rows of the matrix represent the groups in the data and the columns represent the statistics, such as mean, median, variance, skewness, and so forth. This form of the output matches the output from PROC MEANS when you use a CLASS statement to specify the group variable.

For multivariate data, the situation becomes more complex. For example, suppose that for each group you want to compute the number of observations, the mean vector, and the covariance matrix. Although it is possible to pack all this information into a matrix, it requires a lot of bookkeeping to keep track of which row contains which statistics for which group. A simpler method is to pack the multivariate statistics into a list and return the list.

Multivariate statistics for the iris data

Fisher's iris data is often used to demonstrate ideas in multivariate statistics. The following call to PROC SGPLOT creates a scatter plot for two variables (PetalWidth and SepalLength) and uses the GROUP= option to color the observations by the levels of the Species variable. The plot also overlays 95% prediction ellipses for each group.

title "Iris Data, Colored by Species";
title2 "95% Prediction Ellipse, assuming MVN";
proc sgplot data=Sashelp.Iris;
   ellipse x=PetalWidth y=SepalWidth / group=Species outline fill transparency=0.6;
   scatter x=PetalWidth y=SepalWidth / group=Species transparency=0.2
                       markerattrs=(symbol=CircleFilled size=10) jitter;
   xaxis grid; yaxis grid;
run;

Although PROC SGPLOT computes these ellipses automatically, you might be interested in the multivariate statistics behind the ellipses. The ellipses depend on the number of observations, the mean vector, and the covariance matrix for each group. The following SAS/IML module computes these quantities for each group. The module assembles the statistics into three matrices:

  • The ns vector contains the number of observations. The k_th row contains the number of observations in the k_th group.
  • The means matrix contains the group means. The k_th row contains the mean vector for the k_th group.
  • The covs matrix contains the group covariance matrices. The k_th row contains the covariance matrix for the k_th group. You can use the ROWVEC function to "flatten" a matrix into a row vector. When you want to reconstruct the matrix, you can use the SHAPE function, as discussed in the article "Create an array of matrices in SAS."

These three items are packed into a list and returned by the module.

proc iml;
/* GIVEN Y and a classification variable Group that indicate group membership,
   return a list that contains:
   1. Number of observations in each group
   2. ML estimates of within-group mean
   3. ML estimate of within-group covariance 
   The k_th row of each matrix contains the statistics for the k_th group.
*/
start MLEstMVN(Y, Group);
   d = ncol(Y);
   u = unique(Group);
   G = ncol(u);
   ns    = J(G,   1, 0);           /* k_th row is number of obs in k_th group */
   means = J(G,   d, 0);           /* k_th row is mean of k_th group */
   covs  = J(G, d*d, 0) ;          /* k_th row is cov of k_th group */
 
   do k = 1 to G;
      X = Y[loc(Group=u[k]), ];    /* obs in k_th group */
      n = nrow(X);                 /* number of obs in this group */
      C = (n-1)/n * cov(X);        /* ML estimate of COV does not use the Bessel correction */
      ns[k]      = n;             
      means[k, ] = mean(X);        /* ML estimate of mean */
      covs[k, ]  = rowvec( C );    /* store COV in k_th row */
   end;
   outList = [#'n'=ns, #'mean'=means, #'cov'=Covs, #'group'=u]; /* pack into list */
   return (outList);
finish;
 
/* Example: read the iris data  */
varNames = {'PetalWidth' 'SepalWidth'};
use Sashelp.Iris;
read all var varNames into X;
read all var {'Species'};
close;
 
L = MLEstMVN(X, Species);            /* call function; return named list of results */
ns = L$'n';                          /* extract obs numbers          */
means = L$'mean';                    /* extract mean vectors         */
Covs = L$'cov';                      /* extract covariance matrix    */
lbl = L$'group';
print ns[r=lbl c='n'], 
      means[r=lbl c={'m1' 'm2'}], 
      covs[r=lbl c={C11 C12 C21 C22}];

The multivariate statistics are shown. The k_th row of each matrix contains a statistic for the k_th group. The number of observations and the mean vectors are easy to understand and interpret. The covariance matrix for each group has been flattened into a row vector. You can use the SHAPE function to rearrange the row into a matrix. For example, the covariance matrix for the first group ('Setosa') can be constructed as Setosa_Cov = shape(covs[1,], 2).

An alternative: A list of lists

It can be convenient to return a list of lists that contains the statistics. Suppose you want to compute m statistics for each of G groups. The function in the previous section returned a list that contains m items and each item had G rows. (I also included an additional list item, the group levels, but that is optional.) Alternatively, you could return a list that has G sublists, where each sublist contains the m statistics for the group. This is shown in the following function, which returns the same information but organizes it in a different way.

/* Alternative: Return a list of G sublists. Each sublist has only 
   the statistics relevant to the corresponding group.    */
start MLEst2(Y, Group);
   u = unique(Group);
   G = ncol(u);
   outList = ListCreate(G);
 
   do k = 1 to G;
      X = Y[loc(Group=u[k]), ];    /* obs in k_th group */
      n = nrow(X);                 /* number of obs in k_th group */
      SL = [#'n'=n, 
            #'mean'=mean(X),        /* ML estimate of mean */
            #'cov'=(n-1)/n*cov(X)]; /* ML estimate of COV; no need to flatten */
      call ListSetItem(outList, k, SL);   /* set sublist as the k_th item */
   end;
   call ListSetName(outList, 1:G, u);     /* give a name to each sublist */
   return (outList);
finish;
 
L2 = MLEst2(X, Species);
package load ListUtil;  /* load a package to print lists */
call Struct(L2);        /* or: call ListPrint(L2);       */

The output from the STRUCT call gives a hierarchical overview of the L2 list. The leftmost column indicates the hierarchy of the lists. The list has G=3 items, which are themselves lists. Each list has a name (the group level and m=3 items: the number of observations ('n'), the mean vector ('mean'), and the within-group MLE covariance matrix ('cov'). In this output, notice that there is no need to flatten the covariance matrix into a row vector. For some applications, this list format might be easier to work with. If you want, for example, the mean vector for the 'Setosa' group, you can access it by using the notation m = L2$'Setosa'$'mean'. Or you could extract the 'Setosa' list and access the statistics as follows:

SetosaList = L2$'Setosa';    /* get the list of statistics for the 'Setosa' group */
m = SetosaList$'mean';       /* get the mean vector */
C = SetosaList$'cov';        /* get the MLE covariance matrix */
print m[c=varNames], C[c=varNames r=varNames];

Summary

This article shows how to do two things:

  • How to use the UNIQUE-LOC technique to compute multivariate statistics for each group in the data.
  • How to return those statistics by using SAS/IML lists. Two ways to organize the output are presented. The simplest method is to stack the statistics into matrices where the k_th row represents the k_th group. You can then return a list of the matrices. The second method is to return a list of sublists, where the k_th sublist contains the statistics for the k_th list. The first method requires that you "flatten" matrices into row vectors whereas the second method does not.

The post Compute within-group multivariate statistics and store them in a list appeared first on The DO Loop.

7月 012020
 

A previous article discusses the pooled variance for two or groups of univariate data. The pooled variance is often used during a t test of two independent samples. For multivariate data, the analogous concept is the pooled covariance matrix, which is an average of the sample covariance matrices of the groups. If you assume that the covariances within the groups are equal, the pooled covariance matrix is an estimate of the common covariance. This article shows how to compute and visualize a pooled covariance matrix in SAS. It explains how the pooled covariance relates to the within-group covariance matrices. It discusses a related topic, called the between-group covariance matrix.

The within-group matrix is sometimes called the within-class covariance matrix because a classification variable is used to identify the groups. Similarly, the between-group matrix is sometimes called the between-class covariance matrix.

Visualize within-group covariances

Suppose you want to analyze the covariance in the groups in Fisher's iris data (the Sashelp.Iris data set in SAS). The data set contains four numeric variables, which measure the length and width of two flower parts, the sepal and the petal. Each observation is for a flower from an iris species: Setosa, Versicolor, or Virginica. The Species variable in the data identifies observations that belong to each group, and each group has 50 observations. The following call to PROC SGPLOT creates two scatter plots and overlays prediction ellipses for two pairs of variables:

title "68% Prediction Ellipses for Iris Data";
proc sgplot data=Sashelp.Iris;
   scatter x=SepalLength y=SepalWidth / group=Species transparency=0.5;
   ellipse x=SepalLength y=SepalWidth / group=Species alpha=0.32 lineattrs=(thickness=2);
run;
proc sgplot data=Sashelp.Iris;
   scatter x=PetalLength y=PetalWidth / group=Species transparency=0.5;
   ellipse x=PetalLength y=PetalWidth / group=Species alpha=0.32 lineattrs=(thickness=2);
run;

The ellipses enable you to visually investigate whether the variance of the data within the three groups appears to be the same. For these data, the answer is no because the ellipses have different shapes and sizes. Some of the prediction ellipses have major axes that are oriented more steeply than others. Some of the ellipses are small, others are relatively large.

You might wonder why the graph shows a 68% prediction ellipse for each group. Recall that prediction ellipses are a multivariate generalization of "units of standard deviation." If you assume that measurements in each group are normally distributed, 68% of random observations are within one standard deviation from the mean. So for multivariate normal data, a 68% prediction ellipse is analogous to +/-1 standard deviation from the mean.

The pooled covariance is an average of within-group covariances

The pooled covariance is used in linear discriminant analysis and other multivariate analyses. It combines (or "pools") the covariance estimates within subgroups of data. The pooled covariance is one of the methods used by Friendly and Sigal (TAS, 2020) to visualize homogeneity tests for covariance matrices.

Suppose you collect multivariate data for \(k\) groups and \(S_i\) is the sample covariance matrix for the \(n_i\) observations within the \(i\)th group. If you believe that the groups have a common variance, you can estimate it by using the pooled covariance matrix, which is a weighted average of the within-group covariances:
\(S_p = \Sigma_{i=1}^k (n_i-1)S_i / \Sigma_{i=1}^k (n_i - 1)\)

If all groups have the same number of observations, then the formula simplifies to \(\Sigma_{i=1}^k S_i / k\), which is the simple average of the matrices. If the group sizes are different, then the pooled variance is a weighted average, where larger groups receive more weight than smaller groups.

Compute the pooled covariance in SAS

In SAS, you can often compute something in two ways. The fast-and-easy way is to find a procedure that does the computation. A second way is to use the SAS/IML language to compute the answer yourself. When I compute something myself (and get the same answer as the procedure!), I increase my understanding.

Suppose you want to compute the pooled covariance matrix for the iris data. The fast-and-easy way to compute a pooled covariance matrix is to use PROC DISCRIM. The procedure supports the OUTSTAT= option, which writes many multivariate statistics to a data set, including the within-group covariance matrices, the pooled covariance matrix, and something called the between-group covariance. (It also writes analogous quantities for centered sum-of-squares and crossproduct (CSSCP) matrices and for correlation matrices.)

proc discrim data=sashelp.iris method=normal pool=yes outstat=Cov noprint;
   class Species;
   var SepalLength SepalWidth PetalLength PetalWidth;
run;
 
proc print data=Cov noobs; 
   where _TYPE_ = "PCOV";
   format _numeric_ 6.2;
   var _TYPE_ _NAME_ Sepal: Petal:;
run;

The table shows the "average" covariance matrix, where the average is across the three species of flowers.

Within-group covariance matrices

The same output data set contains the within-group and the between-group covariance matrices. The within-group matrices are easy to understand. They are the covariance matrices for the observations in each group. Accordingly, there are three such matrices for these data: one for the observations where Species="Setosa", one for Species="Versicolor", and one for Species="Virginica". The following call to PROC PRINT displays the three matrices:

proc print data=Cov noobs; 
   where _TYPE_ = "COV" and Species^=" ";
   format _numeric_ 6.2;
run;

The output is not particularly interesting, so it is not shown. The matrices are the within-group covariances that were visualized earlier by using prediction ellipses.

Visual comparison of the pooled covariance and the within-group covariance

Friendly and Sigal (2020, Figure 1) overlay the prediction ellipses for the pooled covariance on the prediction ellipses for the within-group covariances. A recreation of Figure 1 in SAS is shown below. You can use the SAS/IML language to draw prediction ellipses from covariance matrices.

The shaded region is the prediction ellipse for these two variables in the pooled covariance matrix. It is centered at the weighted average of the group means. You can see that the pooled ellipse looks like an average of the other ellipses. This graph shows only one pair of variables, but see Figure 2 of Friendly and Sigal (2020) for a complete scatter plot matrix that compares the pooled covariance to the within-group covariance for each pair of variables.

Between-group covariance matrices

Another matrix in the PROC DISCRIM output is the so-called between-group covariance matrix. Intuitively, the between-group covariance matrix is related to the difference between the full covariance matrix of the data (where the subgroups are ignored) and the pooled covariance matrix (where the subgroups are averaged). The precise definition is given in the next section. For now, here is how to print the between-group covariance matrix from the output of PROC DISCRIM:

proc print data=Cov noobs; 
   where _TYPE_ = "BCOV";
   format _numeric_ 6.2;
run;

How to compute the pooled and between-group covariance

If I can compute a quantity "by hand," then I know that I truly understand it. Thus, I wrote a SAS/IML program that reproduces the computations made by PROC DISCRIM. The following steps are required to compute each of these matrices from first principles.

  1. For each group, compute the covariance matrix (S_i) of the observations in that group.
  2. Note that the quantity (n_i - 1)*S_i is the centered sum-of-squares and crossproducts (CSSCP) matrix for the group. Let M be the sum of the CSSCP matrices. The sum is the numerator for the pooled covariance.
  3. Form the pooled covariance matrix as S_p = M / (N-k).
  4. Let C be the CSSCP data for the full data (which is (N-1)*(Full Covariance)). The between-group covariance matrix is BCOV = (C - M) * k / (N*(k-1)).

You can use the UNIQUE-LOC trick to iterate over the data for each group. The following SAS/IML program implements these computations:

/* Compute a pooled covariance matrix when observations
   belong to k groups with sizes n1, n2, ..., nk, where n1+n2+...+nk = N
*/
proc iml;
varNames = {'SepalLength' 'SepalWidth' 'PetalLength' 'PetalWidth'};
use Sashelp.iris;
   read all var varNames into Z;
   read all var "Species" into Group;
close;
 
/* assume complete cases, otherwise remove rows with missing values */
N   = nrow(Z);
 
/* compute the within-group covariance, which is the covariance for the observations in each group */
u = unique(Group);
k = ncol(u);                /* number of groups */
p   = ncol(varNames);       /* number of variables */
M = j(p, p, 0);             /* sum of within-group CSCCP matrices */
do i = 1 to k;
   idx = loc(Group = u[i]); /* find rows for this group */
   X = Z[idx,];             /* extract obs for i_th group */
   n_i = nrow(X);           /* n_i = size of i_th group */
   S   = cov(X);            /* within-group cov */
   /* accumulate the weighted sum of within-group covariances */
   M = M + (n_i-1) * S;     /* (n_i-1)*S is centered X`*X */
end;
 
/* The pooled covariance is an average of the within-class covariance matrices. */
Sp = M / (N-k);
print Sp[L="Pooled Cov" c=varNames r=VarNames format=6.2];
 
/* The between-class CSSCP is the difference between total CSSCP and the sum of the 
   within-group CSSCPs. The SAS doc for PROC DISCRIM defines the between-class 
   covariance matrix as the between-class SSCP matrix divided by N*(k-1)/k, 
   where N is the number of observations and k is the number of classes.
*/
/* the total covariance matrix ignores the groups */
C = (N-1)*cov(Z);    
BCSSCP = C - M;               /* between = Full - Sum(Within) */
BCov = BCSSCP * k/( N*(k-1) );
print BCov[L="Between Cov" c=varNames r=VarNames format=6.2];

Success! The SAS/IML program shows the computations that are needed to reproduce the pooled and between-group covariance matrices. The results are the same as are produced by PROC DISCRIM.

Summary

In multivariate ANOVA, you might assume that the within-group covariance is constant across different groups in the data. The pooled covariance is an estimate of the common covariance. It is a weighted average of the sample covariances for each group, where the larger groups are weighted more heavily than smaller groups. I show how to visualize the pooled covariance by using prediction ellipses.

You can use PROC DISCRIM to compute the pooled covariance matrix and other matrices that represent within-group and between-group covariance. I also show how to compute the matrices from first principles by using the SAS/IML language. You can download the SAS program that performs the computations and creates the graphs in this article.

The post Pooled, within-group, and between-group covariance matrices appeared first on The DO Loop.