data analysis

7月 012020
 

A previous article discusses the pooled variance for two or groups of univariate data. The pooled variance is often used during a t test of two independent samples. For multivariate data, the analogous concept is the pooled covariance matrix, which is an average of the sample covariance matrices of the groups. If you assume that the covariances within the groups are equal, the pooled covariance matrix is an estimate of the common covariance. This article shows how to compute and visualize a pooled covariance matrix in SAS. It explains how the pooled covariance relates to the within-group covariance matrices. It discusses a related topic, called the between-group covariance matrix.

The within-group matrix is sometimes called the within-class covariance matrix because a classification variable is used to identify the groups. Similarly, the between-group matrix is sometimes called the between-class covariance matrix.

Visualize within-group covariances

Suppose you want to analyze the covariance in the groups in Fisher's iris data (the Sashelp.Iris data set in SAS). The data set contains four numeric variables, which measure the length and width of two flower parts, the sepal and the petal. Each observation is for a flower from an iris species: Setosa, Versicolor, or Virginica. The Species variable in the data identifies observations that belong to each group, and each group has 50 observations. The following call to PROC SGPLOT creates two scatter plots and overlays prediction ellipses for two pairs of variables:

title "68% Prediction Ellipses for Iris Data";
proc sgplot data=Sashelp.Iris;
   scatter x=SepalLength y=SepalWidth / group=Species transparency=0.5;
   ellipse x=SepalLength y=SepalWidth / group=Species alpha=0.32 lineattrs=(thickness=2);
run;
proc sgplot data=Sashelp.Iris;
   scatter x=PetalLength y=PetalWidth / group=Species transparency=0.5;
   ellipse x=PetalLength y=PetalWidth / group=Species alpha=0.32 lineattrs=(thickness=2);
run;

The ellipses enable you to visually investigate whether the variance of the data within the three groups appears to be the same. For these data, the answer is no because the ellipses have different shapes and sizes. Some of the prediction ellipses have major axes that are oriented more steeply than others. Some of the ellipses are small, others are relatively large.

You might wonder why the graph shows a 68% prediction ellipse for each group. Recall that prediction ellipses are a multivariate generalization of "units of standard deviation." If you assume that measurements in each group are normally distributed, 68% of random observations are within one standard deviation from the mean. So for multivariate normal data, a 68% prediction ellipse is analogous to +/-1 standard deviation from the mean.

The pooled covariance is an average of within-group covariances

The pooled covariance is used in linear discriminant analysis and other multivariate analyses. It combines (or "pools") the covariance estimates within subgroups of data. The pooled covariance is one of the methods used by Friendly and Sigal (TAS, 2020) to visualize homogeneity tests for covariance matrices.

Suppose you collect multivariate data for \(k\) groups and \(S_i\) is the sample covariance matrix for the \(n_i\) observations within the \(i\)th group. If you believe that the groups have a common variance, you can estimate it by using the pooled covariance matrix, which is a weighted average of the within-group covariances:
\(S_p = \Sigma_{i=1}^k (n_i-1)S_i / \Sigma_{i=1}^k (n_i - 1)\)

If all groups have the same number of observations, then the formula simplifies to \(\Sigma_{i=1}^k S_i / k\), which is the simple average of the matrices. If the group sizes are different, then the pooled variance is a weighted average, where larger groups receive more weight than smaller groups.

Compute the pooled covariance in SAS

In SAS, you can often compute something in two ways. The fast-and-easy way is to find a procedure that does the computation. A second way is to use the SAS/IML language to compute the answer yourself. When I compute something myself (and get the same answer as the procedure!), I increase my understanding.

Suppose you want to compute the pooled covariance matrix for the iris data. The fast-and-easy way to compute a pooled covariance matrix is to use PROC DISCRIM. The procedure supports the OUTSTAT= option, which writes many multivariate statistics to a data set, including the within-group covariance matrices, the pooled covariance matrix, and something called the between-group covariance. (It also writes analogous quantities for centered sum-of-squares and crossproduct (CSSCP) matrices and for correlation matrices.)

proc discrim data=sashelp.iris method=normal pool=yes outstat=Cov noprint;
   class Species;
   var SepalLength SepalWidth PetalLength PetalWidth;
run;
 
proc print data=Cov noobs; 
   where _TYPE_ = "PCOV";
   format _numeric_ 6.2;
   var _TYPE_ _NAME_ Sepal: Petal:;
run;

The table shows the "average" covariance matrix, where the average is across the three species of flowers.

Within-group covariance matrices

The same output data set contains the within-group and the between-group covariance matrices. The within-group matrices are easy to understand. They are the covariance matrices for the observations in each group. Accordingly, there are three such matrices for these data: one for the observations where Species="Setosa", one for Species="Versicolor", and one for Species="Virginica". The following call to PROC PRINT displays the three matrices:

proc print data=Cov noobs; 
   where _TYPE_ = "COV" and Species^=" ";
   format _numeric_ 6.2;
run;

The output is not particularly interesting, so it is not shown. The matrices are the within-group covariances that were visualized earlier by using prediction ellipses.

Visual comparison of the pooled covariance and the within-group covariance

Friendly and Sigal (2020, Figure 1) overlay the prediction ellipses for the pooled covariance on the prediction ellipses for the within-group covariances. A recreation of Figure 1 in SAS is shown below. You can use the SAS/IML language to draw prediction ellipses from covariance matrices.

The shaded region is the prediction ellipse for these two variables in the pooled covariance matrix. It is centered at the weighted average of the group means. You can see that the pooled ellipse looks like an average of the other ellipses. This graph shows only one pair of variables, but see Figure 2 of Friendly and Sigal (2020) for a complete scatter plot matrix that compares the pooled covariance to the within-group covariance for each pair of variables.

Between-group covariance matrices

Another matrix in the PROC DISCRIM output is the so-called between-group covariance matrix. Intuitively, the between-group covariance matrix is related to the difference between the full covariance matrix of the data (where the subgroups are ignored) and the pooled covariance matrix (where the subgroups are averaged). The precise definition is given in the next section. For now, here is how to print the between-group covariance matrix from the output of PROC DISCRIM:

proc print data=Cov noobs; 
   where _TYPE_ = "BCOV";
   format _numeric_ 6.2;
run;

How to compute the pooled and between-group covariance

If I can compute a quantity "by hand," then I know that I truly understand it. Thus, I wrote a SAS/IML program that reproduces the computations made by PROC DISCRIM. The following steps are required to compute each of these matrices from first principles.

  1. For each group, compute the covariance matrix (S_i) of the observations in that group.
  2. Note that the quantity (n_i - 1)*S_i is the centered sum-of-squares and crossproducts (CSSCP) matrix for the group. Let M be the sum of the CSSCP matrices. The sum is the numerator for the pooled covariance.
  3. Form the pooled covariance matrix as S_p = M / (N-k).
  4. Let C be the CSSCP data for the full data (which is (N-1)*(Full Covariance)). The between-group covariance matrix is BCOV = (C - M) * k / (N*(k-1)).

You can use the UNIQUE-LOC trick to iterate over the data for each group. The following SAS/IML program implements these computations:

/* Compute a pooled covariance matrix when observations
   belong to k groups with sizes n1, n2, ..., nk, where n1+n2+...+nk = N
*/
proc iml;
varNames = {'SepalLength' 'SepalWidth' 'PetalLength' 'PetalWidth'};
use Sashelp.iris;
   read all var varNames into Z;
   read all var "Species" into Group;
close;
 
/* assume complete cases, otherwise remove rows with missing values */
N   = nrow(Z);
 
/* compute the within-group covariance, which is the covariance for the observations in each group */
u = unique(Group);
k = ncol(u);                /* number of groups */
p   = ncol(varNames);       /* number of variables */
M = j(p, p, 0);             /* sum of within-group CSCCP matrices */
do i = 1 to k;
   idx = loc(Group = u[i]); /* find rows for this group */
   X = Z[idx,];             /* extract obs for i_th group */
   n_i = nrow(X);           /* n_i = size of i_th group */
   S   = cov(X);            /* within-group cov */
   /* accumulate the weighted sum of within-group covariances */
   M = M + (n_i-1) * S;     /* (n_i-1)*S is centered X`*X */
end;
 
/* The pooled covariance is an average of the within-class covariance matrices. */
Sp = M / (N-k);
print Sp[L="Pooled Cov" c=varNames r=VarNames format=6.2];
 
/* The between-class CSSCP is the difference between total CSSCP and the sum of the 
   within-group CSSCPs. The SAS doc for PROC DISCRIM defines the between-class 
   covariance matrix as the between-class SSCP matrix divided by N*(k-1)/k, 
   where N is the number of observations and k is the number of classes.
*/
/* the total covariance matrix ignores the groups */
C = (N-1)*cov(Z);    
BCSSCP = C - M;               /* between = Full - Sum(Within) */
BCov = BCSSCP * k/( N*(k-1) );
print BCov[L="Between Cov" c=varNames r=VarNames format=6.2];

Success! The SAS/IML program shows the computations that are needed to reproduce the pooled and between-group covariance matrices. The results are the same as are produced by PROC DISCRIM.

Summary

In multivariate ANOVA, you might assume that the within-group covariance is constant across different groups in the data. The pooled covariance is an estimate of the common covariance. It is a weighted average of the sample covariances for each group, where the larger groups are weighted more heavily than smaller groups. I show how to visualize the pooled covariance by using prediction ellipses.

You can use PROC DISCRIM to compute the pooled covariance matrix and other matrices that represent within-group and between-group covariance. I also show how to compute the matrices from first principles by using the SAS/IML language. You can download the SAS program that performs the computations and creates the graphs in this article.

The post Pooled, within-group, and between-group covariance matrices appeared first on The DO Loop.

6月 292020
 

The first time I saw a formula for the pooled variance, I was quite confused. It looked like Frankenstein's monster, assembled from bits and pieces of other quantities and brought to life by a madman. However, the pooled variance does not have to be a confusing monstrosity. The verb "to pool" means to combine resources from several sources, as in, "if the siblings pool their money, they can buy a nice gift for their mother." Thus a "pooled variance" combines information from several subgroups of the data in order to obtain a better estimate. This article shows how to understand the pooled variance, which is an average of the sample variances for subgroups of the data.

The following graph visualizes the pooled variance and the variances within groups. In the graph, there are three groups. The sample variance within each group is plotted as a blue marker. The pooled variance is indicated by a horizontal line. The pooled variance appears to be an average of the three sample variances. The exact formulas and the data for this graph are explained in subsequent sections.

The pooled variance is an average of group variances

Most students first encounter the pooled variance in a statistics course when learning about two-sample t tests. In a two-sample t test, you have data in two groups and you want to test whether the means of the two groups are different. In order to run a two-sample t test, you need to decide whether you think the variances of the two groups are equal. If you think the group variances are equal, you compute the pooled variance, which estimates the common variance. You use the pooled variance estimate to compute the t statistic.

The pooled variance combines (or "pools") the variance estimates within the individual groups. The pooled variance is a better estimate of the (unknown) common group variance than either of the individual group variances. If each group has the same number of observations, then the pooled variance is a simple average. If the group sizes are different, then the pooled variance is a weighted average, where larger groups receive more weight than smaller groups.

For the two-sample case, let \(s_1^2\) and \(s_2^2\) be the sample variances within the groups A and B, respectively. Let \(n_1\) and \(n_2\), respectively, be the number of observations in the groups. If you assume that the data in each group have the same variance, how can you estimate that common variance? One way is to "pool" the variances of each group and compute a weighted average of the variances. For two groups, the pooled variance is
\(s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2- 2}\)
The numerator is the weighted sum of the group variances. Dividing by the sum of the weights means that the pooled variance is the weighted average of the two quantities.

Notice that if \(n_1 = n_2\), then the formula simplifies. When the group sizes are equal, the pooled variance reduces to \(s_p^2 = (s_1^2 + s_2^2)/2\), which is the average of the two variances.

The idea behind the pooled variance generalizes to multiple groups. Suppose you collect data for \(k\) groups and \(s_i^2\) is the sample variance within the \(i\)th group, which has \(n_i\) observations. If you believe that the groups have a common variance, how might you estimate it? One way is to compute the pooled variance as a weighted average of the group variances:
\(s_p^2 = \Sigma_{i=1}^k (n_i-1)s_i^2 / \Sigma_{i=1}^k (n_i - 1)\)
Again, if all groups have the same number of observations, then the formula simplifies to \(\Sigma_{i=1}^k s_i^2 / k\), which is the average of the \(k\) variances.

Visualize the pooled variance

Formulas are necessary for computation, but I like to think in pictures. You can draw a graph that shows how the pooled variance relates to the variances of the groups.

Let's use an example that has three groups: A, B, and C. The following data (adapted from an example in the SAS documentation) are the ozone measurements (Y, in parts per billion) at three different collection sites (A, B, and C):

Site  n   Ozone (ppb)
==== === =============================================
A    22   4 6 3 4 7 8 2 3 4 1 3 8 9 5 6 4 6 3 4 7 3 7
B    15   5 3 6 2 1 2 4 3 2 4 6 4 6 3 6
C    18   8 9 7 8 6 7 6 7 9 8 9 8 7 8 5 8 9 7

The following graph shows a box plot of the ozone levels at each site. The height of each box gives a rough estimate of the variance in each group. The mean and variance for sites A and B appear to be similar to each other. However, the mean at site C appears to be larger and the variance seems smaller, compared to the other sites.

You can use your favorite statistical software to estimate the variance of the data for each site. The following table is produced in SAS by using the UNIVARIATE procedure and the CIBASIC option:

The table shows an estimate for the variance of the data within each group. Although the smallest sample variance (Group C: 1.32) seems much smaller than the largest sample variance (Group A: 4.69), notice that the 95% confidence intervals overlap. A parameter value such as 2.8 or 2.9 would simultaneously be in all three confidence intervals. If we assume that the variance of the groups are equal, the pooled variance formula provides a way to estimate the common variance. Applying the formula, the estimate for the pooled variance for these data is
\(s_p^2 = (21*4.69 + 14*2.89 + 17*1.32) / (21+14+17) = 3.10\)

The graph at the top of this article visualizes the information in the table and uses a reference line to indicate the pooled variance. The blue markers indicate the sample variances of each group. The confidence intervals for the population variances are shown by the vertical lines. The pooled variance is indicated by the horizontal reference line. It is the weighted average of the sample variances. If you think that all groups have the same variance, the pooled variance estimates that common variance.

Summary

In two-sample t tests and ANOVA tests, you often assume that the variance is constant across different groups in the data. The pooled variance is an estimate of the common variance. It is a weighted average of the sample variances for each group, where the larger groups are weighted more heavily than smaller groups.

You can download a SAS program that computes the pooled variance and creates the graphs in this article. Although the pooled variance provides an estimate for a common variance among groups, it is not always clear when the assumption of a common variance is valid. The visualization in this article can help, or you can perform a formal "homogeneity of variance" test as part of an ANOVA analysis.

The post What is a pooled variance? appeared first on The DO Loop.

6月 082020
 

A SAS customer asked how to specify interaction effects between a classification variable and a spline effect in a SAS regression procedure. There are at least two ways to do this. If the SAS procedure supports the EFFECT statement, you can build the interaction term in the MODEL statement. For procedures that do not support the EFFECT statement, you can generate the interaction effects yourself. Because the customer's question was specifically about the GAMPL procedure, which fits generalized additive models, I also show a third technique, which enables you to model interactions in the GAMPL procedure.

To illustrate the problem, consider the following SAS DATA step, which simulates data from two different response functions:

data Have;
call streaminit(1);
Group="A"; 
do x = 1 to 10 by .1;
   y =    2*x + sin(x) + rand("Normal"); output;
end;
Group="B";
do x = 1 to 10 by .1;
   y = 10 - x - sin(x) + rand("Normal"); output;
end;
run;

The graph shows the data and overlays a regression model. The model predicts the response for each level of the Group variable. Notice that the two curves have different shapes, which means that there is an interaction between the Group variable and the x variable. The goal of this article is to show how to use a spline effect to predict the response for each group.

Use the grammar of the procedure to form interactions

Many SAS regression procedures support the EFFECT statement, the CLASS statement, and enable you to specify interactions on the MODEL statement. Examples include the GLMMIX, GLMSELECT, LOGISTIC, QUANTREG, and ROBUSTREG procedures. For example, the following call to PROC GLMSELECT specifies several model effects by using the "stars and bars" syntax:

  • The syntax Group | x includes the classification effect (Group), a linear effect (x), and an interaction effect (Group*x).
  • The syntax Group * spl includes an interaction effect between the classification variable and the spline.
title "GLMSELECT Model with Spline and Interaction Effects";
ods select none;
proc glmselect data=Have
        outdesign(addinputvars fullmodel)=SplineBasis; /* optional: write design matrix to data set */
   class Group;
   effect spl = spline(x);
   model y = Group|x  Group*spl / selection=none;
   output out=SplineOut predicted=p;            /* output predicted values */
quit;
ods select all;
 
%macro PlotInteraction(dsname);
   proc sgplot data=&dsname;
      scatter x=x y=y / group=Group;
      series x=x y=p / group=Group curvelabel;
      keylegend / location=inside position=E;
   run;
%mend;
%PlotInteraction(SplineOut);

The predicted curves are shown in the previous section. I wrapped the call to PROC SGPLOT in a macro so that I can create the same plot for other models.

Output the design matrix for the interactions

Some SAS procedures do not support the EFFECT statement. For these procedures, you cannot form effects like Group*spline(x) within the procedure. However, the GLMSELECT procedure supports the OUTDESIGN= option, which writes the design matrix to a SAS data set. I used the option in the previous section to create the SplineBasis data set. The data set includes variables for the spline basis and for the interaction between the Group variable and the spline basis. Therefore, you can use the splines and interactions with splines in other SAS procedures.

For example, the GLM procedure does not support the EFFECT statement, so you cannot generate a spline basis within the procedure. However, you can read the design matrix from the previous call to PROC GLMSELECT. The interaction effects are named spl_Group_1_A, spl_Group_1_B, spl_Group_2_A, ...., where the suffix "_i_j" indicates the interaction effect between the i_th spline basis and the j_th level of the Group variable. You can type the names of the design variables, or you can use the "colon notation" to match all variables that begin with the prefix "spl_Group". The following call to PROC GLM computes the same model as in the previous section:

proc glm data=SplineBasis;
   class Group;
   model y = Group|x  spl_Group: ;           /* spl_Group: => spl_Group_1_A,  spl_Group_1_B, ... */
   output out=GLMOut predicted=p;            /* output predicted values */
quit;
 
/* show that the predicted values are the same */
data Comp;
   merge SplineOut GLMOut(keep=p rename=(p=p2));
   diff = p - p2;
run;
proc means data=comp Mean Min Max;  var diff;  run;

The output of PROC MEANS shows that the two models are the same. There is no difference between the predicted values from PROC GLM (which reads the design matrix) and the values from PROC GLMSELECT (which reads the raw data).

The splines of the interactions versus the interactions of the splines

Some nonparametric regression procedures, such as the GAMPL procedure, have their own syntax to generate spline effects. In fact, PROC GAMPL uses thin-plate splines, which are different from the splines that are supported by the EFFECT statement. Recently, a SAS customer asked how he could model interaction terms in PROC GAMPL.

The GAMPL procedure supports semiparametric models. In the parametric portion of the model, it is easy to specify interactions by using the standard "stars and bars" notation in SAS. However, the SPLINE() transformation in PROC GAMPL does not support interactions between a classification variable and a continuous variable. To be specific, consider the following semiparametric model:

title "No Interaction Between Classification and Spline Effects";
proc gampl data=Have; *plots=components;
   class Group;
   model Y = param(Group | x)     /* parametric terms */
             spline(x);           /* nonprametric, but no interaction */
   output out=GamPLOut pred=p r=r;
   id Y X Group;
run;
 
%PlotInteraction(SplineOut);

The output shows that the model does not capture the nonlinear features of the data. This is because the spline term is computed for all values of x, not separately for each group. When the groups are combined, the spline effect is essentially linear, which is probably not the best model for these data.

Unfortunately, in SAS/STAT 15.1, PROC GAMPL does not support a syntax such as "spline(Group*x)", which would enable the shape of the nonparametric curve to vary between levels of the Group variable. In fact, the 15.1 documentation states, "only continuous variables (not classification variables) can be specified in spline-effects."

However, if your goal is to use a flexible model to fit the data, there is a model that might work. Instead of the interaction between the Group variable and spline(x), you could use the spline effects for the interaction between the Group variable and x. These are not the same models: the interaction with the splines is not equal to the spline of the interactions! However, let's see how to fit this model to these data.

The idea is to form the interaction effect Group*x in a separate step. You can use the design matrix (SplineBasis) that was created in the first section, but to make the process as transparent as possible, I am going to manually generate the dummy variables for the interaction effect Group*x:

data Interact; 
set Have; 
x_Group_A = x*(Group='A'); 
x_Group_B = x*(Group='B');
run;

You can use the spline effects for the variable x_Group_A and x_Group_B in regression models. The following model is not the same as the previous models, but is a flexible model that seems to fit these data:

title "Interaction Between Classification and Spline Effects";
proc gampl data=Interact plots=components;
   class Group;
   model Y = param(Group | x) 
             spline(x_Group_A) spline(x_Group_B);  /* splines of interactions */
   output out=GamPLOut pred=p;
   id Y X Group;
run;
 
title "GAMPL with Interaction Effects";
%PlotInteraction(GamPLOut);

The graph shows that (visually) the model seems to capture the undulations in the data as well as the earlier model. Although PROC GAMPL does not currently support interactions between classification variables and spline effects, this trick provides a way to model the data.

The post Interactions with spline effects in regression models appeared first on The DO Loop.

6月 032020
 

I recently read an article that describes ways to compute confidence intervals for the difference in a percentile between two groups. In Eaton, Moore, and MacKenzie (2019), the authors describe a problem in hydrology. The data are the sizes of pebbles (grains) in rivers at two different sites. The authors ask whether the p_th percentile of size is significantly different between the two sites. They also want a confidence interval for the difference. The median (50th percentile) and 84th percentile were the main focus of their paper.

For those of us who are not hydrologists, consider a sample from group 'A' and a sample from group 'B'. The problem is to test whether the p_th percentile is different in the two groups and to construct a confidence interval for the difference.

The authors show two methods: a binomial-based probability model (Section 2) and a bootstrap confidence interval (Section 3). However, there is another option: use quantile regression to estimate the difference in the quantiles and to construct a confidence interval. In this article, I show the computation for the 50th and 84th percentiles and compare the results to a bootstrap estimate. You can download the SAS program that creates the analyses and graphs in this article.

If you are not familiar with quantile regression, see my previous article about how to use quantile regression to estimate the difference between two medians.

A data distribution of particle sizes

For convenience, I simulated samples for the sizes of pebbles in two stream beds:

  • For Site 'A', the sizes are lognormally distributed with μ=0.64 and σ=1. For the LN(0.64, 1) distribution, the median pebble size is 1.9 mm and the size for the 84th percentile is 5.1 mm.
  • For Site 'B', the sizes are lognormally distributed with μ=0.53 and σ=1. For the LN(0.53, 1) distribution, the median pebble size is 1.7 mm and the size for the 84th percentile is 4.6 mm.

I assume that sizes are measured to the nearest 0.1 mm. I simulated 500 pebbles from Site 'A' and 300 pebbles from Site 'B'. You can use PROC UNIVARIATE in SAS to compute a comparative histogram and to compute the percentiles. The sample distributions are shown below:

For Site 'A', the estimates for the sample median and 84th percentile are 2.05 mm and 5.1 mm, respectively. For Site 'B', the estimates are 1.60 mm and 4.7 mm, respectively. These estimates are close to their respective parameter values.

Estimate difference in percentiles

The QUANTREG procedure in SAS makes it easy to estimate the difference in the 50th and 84th percentiles between the two groups. The syntax for the QUANTREG procedure is similar to other SAS regression procedures such as PROC GLM:

proc quantreg data=Grains;
   class Site;
   model Diameter = Site / quantile=0.5 0.84;
   estimate 'Diff in Pctl' Site 1 -1 / CL;
run;

The output from PROC QUANTREG shows estimates for the difference between the percentiles of the two groups. For the 50th percentile, an estimate for the difference is 0.4 with a 95% confidence of [0.09, 0.71]. Because 0 is not in the confidence interval, you can conclude that the median pebble size at Site 'A' is significantly different from the median pebble size at Site 'B'. For the 84th percentile, an estimate for the difference is 0.3 with a 95% confidence of [-0.57, 1.17]. Because 0 is in the interval, the difference between the 84th percentiles is not significantly different from 0.

Methods for estimating confidence intervals

The QUANTREG procedure supports several different methods for estimating a confidence interval: sparsity, rank, and resampling. The estimates in the previous section are by using the RANK method, which is the default for smaller data sets. You can use the CI= option on the PROC QUANTREG statement to use these methods and to specify options for the methods. The following graph summarizes the results for four combinations of methods and options. The results of the analysis do not change: The medians are significantly different but the 84th percentiles are not.

A comparison with bootstrap estimates

When you use a resampling-based estimate for the confidence interval, the interval depends on the random number seed, the algorithm used to generate random numbers, and the number of bootstrap iterations. This can make it hard to compare Monte Carlo results to the results from deterministic statistical methods. Nevertheless, I will present one result from a bootstrap analysis of the simulated data. The following table shows the bootstrap percentile estimates (used by Eaton, Moore, and MacKenzie (2019)) for the difference between percentiles.

The confidence intervals (the Lower and Upper columns of the table) are comparable to the intervals produced by PROC QUANTREG. The confidence interval for the difference between medians seems equivalent. The bootstrap interval for the 84th percentile is shifted to the right relative to the QUANTREG intervals. However, the inferences are the same: the medians are different but there is no significant difference between the 84th percentiles.

The following histogram shows the difference between the 84th percentiles for 5,000 bootstrap samples. The confidence interval is determined by the 2.5th and 97.5th percentiles of the bootstrap distribution. The computation requires only a short SAS/IML program, which runs very quickly.

Summary

Data analysts might struggle to find an easy way to compute the difference between percentiles for two (or more) groups. A recent paper by Eaton, Moore, and MacKenzie (2019) proposes one solution, which is to use resampling methods. An alternative is to use quantile regression to estimate the difference between percentiles, as shown in this article. I demonstrate the method by using simulated data of the form studied by Eaton, Moore, and MacKenzie.

You can download the SAS program that creates the analyses and graphs in this article.

The post How to estimate the difference between percentiles appeared first on The DO Loop.

5月 282020
 

The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. An application in machine learning is to measure how distributions in a parametric family differ from a data distribution. This article shows that if you minimize the Kullback–Leibler divergence over a set of parameters, you can find a distribution that is similar to the data distribution. This article focuses on discrete distributions.

The Kullback–Leibler divergence between two discrete distributions

As explained in a previous article, the Kullback–Leibler (K-L) divergence between two discrete probability distributions is the sum
KL(f, g) = Σx f(x) log( f(x)/g(x) )
where the sum is over the set of x values for which f(x) > 0. (The set {x | f(x) > 0} is called the support of f.) For this sum to be well defined, the distribution g must be strictly positive on the support of f.

One application of the K-L divergence is to measure the similarity between a hypothetical model distribution defined by g and an empirical distribution defined by f.

Example data for the Kullback–Leibler divergence

As an example, suppose a call center averages about 10 calls per hour. An analyst wants to investigate whether the number of calls per hour can be modeled by using a Poisson(λ=10) distribution. To test the hypothesis, the analyst records the number of calls for each hour during 100 hours of operation. The following SAS DATA step reads the data. The call to PROC SGPLOT creates a histogram shows the distribution of the 100 counts:

data Calls;
input N @@;
label N = "Calls per hour";
datalines;
11 19 11 13 13 8 11 9 9 14 
10 13 8 15 7 9 6 12 7 13 
12 19 6 12 11 12 11 9 15 4 
7 12 12 10 10 16 18 13 13 8 
13 10 9 9 12 13 12 8 13 9 
7 9 10 9 4 10 12 5 4 12 
8 12 14 16 11 7 18 8 10 13 
12 5 11 12 16 9 11 8 11 7 
11 15 8 7 12 16 9 18 9 8 
10 7 11 12 13 15 6 10 10 7 
;
 
title "Number of Calls per Hour";
title2 "Data for 100 Hours";
proc sgplot data=Calls;
   histogram N / scale=proportion binwidth=1;
   xaxis values=(4 to 19) valueshint;
run;

The graph should really be a bar chart, but I used a histogram with BINWIDTH=1 so that the graph reveals that the value 17 does not appear in the data. Furthermore, the values 0, 1, 2, and 3 do not appear in the data. I used the SCALE=PROPORTION option to plot the data distribution on the density scale.

The call center wants to model these data by using a Poisson distribution. The traditional statistical approach is to use maximum likelihood estimation (MLE) to find the parameter, λ, in the Poisson family so that the Poisson(λ) distribution is the best fit to the data. However, let's see how using the Kullback–Leibler divergence leads to a similar result.

The Kullback–Leibler divergence between data and a Poisson distribution

Let's compute the K-L divergence between the empirical frequency distribution and a Poisson(10) distribution. The empirical distribution is the reference distribution; the Poisson(10) distribution is the model. The Poisson distribution has a nonzero probability for all x ≥ 0, but recall that the K-L divergence is computed by summing over the observed values of the empirical distribution, which is the set {4, 5, ..., 19}, excluding the value 17.

proc iml;
/* read the data, which is the reference distribution, f */
use Calls;  read all var "N" into Obs;  close;
call Tabulate(Levels, Freq, Obs);   /* find unique values and frequencies */
Proportion = Freq / nrow(Obs);      /* empirical density of frequency of calls (f) */
 
/* create the model distribution: Poisson(10) */
lambda = 10;   
poisPDF = pdf("Poisson", Levels, lambda); /* Poisson model on support(f) */
 
/* load K-L divergence module or include the definition from: 
 https://blogs.sas.com/content/iml/2020/05/26/kullback-leibler-divergence-discrete.html
*/
load module=KLDiv;
KL = KLDiv(Proportion, poisPDF); 
print KL[format=best5.];

Notice that although the Poisson distribution has infinite support, you only need to evaluate the Poisson density on the (finite) support of empirical density.

Minimize the Kullback–Leibler divergence between data and a Poisson distribution

The previous section shows how to compute the Kullback–Leibler divergence between an empirical density and a Poisson(10) distribution. You can repeat that computation for a whole range of λ values and plot the divergence versus the Poisson parameter. The following statements compute the K-L divergence for λ on [4, 16] and plots the result. The minimum value of the K-L divergence is achieved near λ = 10.7. At that value of λ, the K-L divergence between the data and the Poisson(10.7) distribution is 0.105.

/* Plot the K-L div versus lambda for a sequence of Poisson(lambda) models */
lambda = do(4, 16, 0.1);
KL = j(1, ncol(lambda), .);
do i = 1 to ncol(lambda);
   poisPDF = pdf("Poisson", Levels, lambda[i]);
   KL[i] = KLDiv(Proportion, poisPDF); 
end;
 
title "K-L Divergence from Poisson(lambda)";
call series(lambda, KL) grid={x y} xvalues=4:16 label={'x' 'K-L Divergence'};

The graph shows the K-L divergence for a sequence of Poisson(λ) models. The Poisson(10.7) model has the smallest divergence from the data distribution, therefore it is the most similar to the data among the Poisson(λ) distributions that were considered. You can use a numerical optimization technique in SAS/IML if you want to find a more accurate value that minimizes the K-L divergences.

The following graph overlays the PMF for the Poisson(10.7) distribution on the empirical distribution for the number of calls.

The minimal Kullback–Leibler divergence and the maximum likelihood estimate

You might wonder how minimizing the K-L divergence relates to the traditional MLE method for fitting a Poisson model to the data. The following call to PROC GENMOD shows that the MLE estimate is λ = 10.71:

proc genmod data=MyData;
   model Obs = / dist=poisson;
   output out=PoissonFit p=lambda;
run;
 
proc print data=PoissonFit(obs=1) noobs;
   var lambda;
run;

Is this a coincidence? No. It turns out that there a connection between the K-L divergence and the negative log-likelihood. Minimizing the K-L divergence is equivalent to minimizing the negative log-likelihood, which is equivalent to maximizing the likelihood between the Poisson model and the data.

Summary

This article shows how to compute the Kullback–Leibler divergence between an empirical distribution and a Poisson distribution. The empirical distribution was the observed number of calls per hour for 100 hours in a call center. You can compute the K-L divergence for many parameter values (or use numerical optimization) to find the parameter that minimizes the K-L divergence. This parameter value corresponds to the Poisson distribution that is most similar to the data. It turns out that minimizing the K-L divergence is equivalent to maximizing the likelihood function. Although the parameter estimates are the same, the traditional MLE estimate comes with additional tools for statistical inference, such as estimates for confidence intervals and standard errors.

The post Minimizing the Kullback–Leibler divergence appeared first on The DO Loop.

5月 062020
 

During this coronavirus pandemic, there are many COVID-related graphs and curves in the news and on social media. The public, politicians, and pundits scrutinize each day's graphs to determine which communities are winning the fight against coronavirus.

Interspersed among these many graphs is the oft-repeated mantra, "Flatten the curve!" As people debate whether to reopen businesses and schools, you might hear some people claim, "we have flattened the curve," whereas others argue, "we have not yet flattened the curve." But what is THE curve to which people are referring? And when does a flat curve indicate success against the coronavirus pandemic?

This article discusses "flattening the curve" in the context of three graphs:

  1. A "what if" epidemiological curve.
  2. The curve of cumulative cases versus time.
  3. A smoothing curve added to a graph of new cases versus time. Spoiler alert: This is the WRONG curve to flatten! You want this curve to be decreasing, not flat.

Before you read further, I strongly encourage you to watch the two-minute video "What does 'flattening the curve' actually mean?" by the Australian Academy of Science. It might be the best-spent two minutes of your day.

Flattening the "what if" epidemiological curve

In 2007, the US Centers for Disease Control and Prevention (CDC) published a report on pre-pandemic planning guidance, which discussed how "nonpharmaceutical interventions" (NPIs) can mitigate the spread of viral infections. Nonpharmaceutical interventions include isolating infected persons, social distancing, hand hygiene, and more. The report states (p. 28), "NPIs can delay and flatten the epidemic peak.... [Emphasis added.] Delay of the epidemic peak is critically important because it allows additional time for vaccine development and antiviral production."

This is the original meaning of "flatten the curve." It has to do with the graph of new cases versus time. A graph appeared on p. 18, but the following graph is Figure 1 in the updated 2017 guidelines:

FIGURE 1. Goals of community mitigation for pandemic influenza (CDC, 2007 and 2017)

This image represents a "what if" scenario. The purple curve on the left is a hypothetical curve that represents rapid spread. It shows what might happen if the public does NOT adopt public-health measures to slow the spread of the virus. This is "the curve" that we wish to flatten. The smaller curve on the right represents a slower spread. It shows what can happen if society adopts public-health interventions. The peak of the second curve has been delayed (moved to a later date) and reduced (fewer cases). This second curve is the "flattened curve."

Because the curve that we are trying to flatten (the rapid-spread curve) is unobservable, how can we measure success? One measure of success is if the number of daily cases remains less than the capacity of the healthcare system. Another is that the total number of affected persons is less than predicted under the no-intervention model.

If the public adopts measures that mitigate the spread, the observed new-cases-versus-time curve should look more like the slower-spread curve on the right and less like the tall rapid-spread curve on the left. Interestingly, I have heard people complain that the actual numbers of infections, hospitalizations, and deaths due to COVID-19 are lower than initially projected. They argue that these low numbers are evidence that the mathematical models were wrong. The correct conclusion is that the lower numbers are evidence that social distancing and other NPIs helped to slow the spread.

Flattening the curve of cumulative cases versus time

The previous section discusses the new-cases-by-time graph. Another common graph in the media is the cumulative number of confirmed cases. I have written about the relationship between the new-cases-by-time graph and the cumulative-cases-by-time graph. In brief, the slope of the cumulative-cases-by-time graph equals the height of the new-cases-by-time graph.

For the rapid-spread curve, the associated cumulative curve is very steep, then levels out at a large number (the total number of cases). For the slower-spread curve, the associated cumulative curve is not very steep. It climbs gradually and eventually then levels out at a smaller number of total cases.

When you read that some country (such as South Korea, Australia, or New Zealand) has flattened the curve, the curve that usually accompanies the article is the cumulative-cases-by-time graph. A cumulative curve that increases slowly means that the rate of new cases is small. A cumulative curve that has zero slope means that no new cases are being confirmed. This is definitely a good thing!

A hypothetical scenario is shown below. The upper graph shows the cumulative cases. The lower graph shows the new cases on the same time scale. After Day 60, there are very few new cases. Accordingly, the curve of cumulative cases is flat.

Thus, a cumulative curve that increases slowly and then levels out is analogous to the slower-the-spread epidemiological curve. When the cumulative curve becomes flat, it indicates that new cases are zero or nearly zero.

The wrong curve: Flattening new cases versus time

Unfortunately, not every curve that is flat indicates that conditions are improving. The primary example is the graph of new cases versus time. A constant rate of new cases is not good—although it is certainly better than an increasing rate. To defeat the pandemic, the number of new cases each day should be decreasing. For a new-cases-by-time curve, a downward-sloping curve is the goal, not a flat curve.

The following graph shows new cases by day in the US (downloaded from the New York Times data on 03May2020). I have added a seven-day rolling average to show a smoothed version of the data:

The seven-day average is no longer increasing. It has leveled out. However, this is NOT an example of a "flattened" curve. This graph shows that the average number of new cases was approximately constant (or perhaps slightly declining) for the last three weeks of April. The fact that the curve is approximately horizontal indicates that approximately 28,000 new cases of coronavirus are being confirmed every day.

The cumulative graph for the same US data is shown below. The slope of the cumulative curve in April 2020 is approximately 28,000 cases per day. This cumulative curve is not flattening, but healthcare workers and public-health officials are all working to flatten it.

Sometimes graphs do not have well-labeled axes, so be sure to identify which graph you are viewing. For a curve of cumulative cases, flatness is "good": there are very few new cases. For a curve of new cases, a downward-sloping curve is desirable.

Summary

In summary, this article discusses "flattening the curve," which is often used without specifying what "the curve" actually is. In the classic epidemiological model, "the curve" is the hypothetical number of new cases versus time under the assumption that society does not adopt public-health measures to slow the spread. A flattened curve refers to the effect of interventions that delay and reduce the peak number of cases.

You cannot observe a hypothetical curve, but you can use the curve of cumulative cases to assess the spread of the disease. A cumulative curve that increases slowly and flattens out indicates a community that has slowed the spread. So, conveniently, you can apply the phrase "flatten the curve" to the curve of cumulative cases.

Note that you do not want the graph of new cases versus time to be flat. You want that curve to decrease towards zero.

You can download the SAS program used to create graphs in this article.


LEARN MORE | See all Coronavirus dashboard blog posts

The post What does 'flatten the curve' mean? To which curve does it apply? appeared first on The DO Loop.

5月 062020
 

During this coronavirus pandemic, there are many COVID-related graphs and curves in the news and on social media. The public, politicians, and pundits scrutinize each day's graphs to determine which communities are winning the fight against coronavirus.

Interspersed among these many graphs is the oft-repeated mantra, "Flatten the curve!" As people debate whether to reopen businesses and schools, you might hear some people claim, "we have flattened the curve," whereas others argue, "we have not yet flattened the curve." But what is THE curve to which people are referring? And when does a flat curve indicate success against the coronavirus pandemic?

This article discusses "flattening the curve" in the context of three graphs:

  1. A "what if" epidemiological curve.
  2. The curve of cumulative cases versus time.
  3. A smoothing curve added to a graph of new cases versus time. Spoiler alert: This is the WRONG curve to flatten! You want this curve to be decreasing, not flat.

Before you read further, I strongly encourage you to watch the two-minute video "What does 'flattening the curve' actually mean?" by the Australian Academy of Science. It might be the best-spent two minutes of your day.

Flattening the "what if" epidemiological curve

In 2007, the US Centers for Disease Control and Prevention (CDC) published a report on pre-pandemic planning guidance, which discussed how "nonpharmaceutical interventions" (NPIs) can mitigate the spread of viral infections. Nonpharmaceutical interventions include isolating infected persons, social distancing, hand hygiene, and more. The report states (p. 28), "NPIs can delay and flatten the epidemic peak.... [Emphasis added.] Delay of the epidemic peak is critically important because it allows additional time for vaccine development and antiviral production."

This is the original meaning of "flatten the curve." It has to do with the graph of new cases versus time. A graph appeared on p. 18, but the following graph is Figure 1 in the updated 2017 guidelines:

FIGURE 1. Goals of community mitigation for pandemic influenza (CDC, 2007 and 2017)

This image represents a "what if" scenario. The purple curve on the left is a hypothetical curve that represents rapid spread. It shows what might happen if the public does NOT adopt public-health measures to slow the spread of the virus. This is "the curve" that we wish to flatten. The smaller curve on the right represents a slower spread. It shows what can happen if society adopts public-health interventions. The peak of the second curve has been delayed (moved to a later date) and reduced (fewer cases). This second curve is the "flattened curve."

Because the curve that we are trying to flatten (the rapid-spread curve) is unobservable, how can we measure success? One measure of success is if the number of daily cases remains less than the capacity of the healthcare system. Another is that the total number of affected persons is less than predicted under the no-intervention model.

If the public adopts measures that mitigate the spread, the observed new-cases-versus-time curve should look more like the slower-spread curve on the right and less like the tall rapid-spread curve on the left. Interestingly, I have heard people complain that the actual numbers of infections, hospitalizations, and deaths due to COVID-19 are lower than initially projected. They argue that these low numbers are evidence that the mathematical models were wrong. The correct conclusion is that the lower numbers are evidence that social distancing and other NPIs helped to slow the spread.

Flattening the curve of cumulative cases versus time

The previous section discusses the new-cases-by-time graph. Another common graph in the media is the cumulative number of confirmed cases. I have written about the relationship between the new-cases-by-time graph and the cumulative-cases-by-time graph. In brief, the slope of the cumulative-cases-by-time graph equals the height of the new-cases-by-time graph.

For the rapid-spread curve, the associated cumulative curve is very steep, then levels out at a large number (the total number of cases). For the slower-spread curve, the associated cumulative curve is not very steep. It climbs gradually and eventually then levels out at a smaller number of total cases.

When you read that some country (such as South Korea, Australia, or New Zealand) has flattened the curve, the curve that usually accompanies the article is the cumulative-cases-by-time graph. A cumulative curve that increases slowly means that the rate of new cases is small. A cumulative curve that has zero slope means that no new cases are being confirmed. This is definitely a good thing!

A hypothetical scenario is shown below. The upper graph shows the cumulative cases. The lower graph shows the new cases on the same time scale. After Day 60, there are very few new cases. Accordingly, the curve of cumulative cases is flat.

Thus, a cumulative curve that increases slowly and then levels out is analogous to the slower-the-spread epidemiological curve. When the cumulative curve becomes flat, it indicates that new cases are zero or nearly zero.

The wrong curve: Flattening new cases versus time

Unfortunately, not every curve that is flat indicates that conditions are improving. The primary example is the graph of new cases versus time. A constant rate of new cases is not good—although it is certainly better than an increasing rate. To defeat the pandemic, the number of new cases each day should be decreasing. For a new-cases-by-time curve, a downward-sloping curve is the goal, not a flat curve.

The following graph shows new cases by day in the US (downloaded from the New York Times data on 03May2020). I have added a seven-day rolling average to show a smoothed version of the data:

The seven-day average is no longer increasing. It has leveled out. However, this is NOT an example of a "flattened" curve. This graph shows that the average number of new cases was approximately constant (or perhaps slightly declining) for the last three weeks of April. The fact that the curve is approximately horizontal indicates that approximately 28,000 new cases of coronavirus are being confirmed every day.

The cumulative graph for the same US data is shown below. The slope of the cumulative curve in April 2020 is approximately 28,000 cases per day. This cumulative curve is not flattening, but healthcare workers and public-health officials are all working to flatten it.

Sometimes graphs do not have well-labeled axes, so be sure to identify which graph you are viewing. For a curve of cumulative cases, flatness is "good": there are very few new cases. For a curve of new cases, a downward-sloping curve is desirable.

Summary

In summary, this article discusses "flattening the curve," which is often used without specifying what "the curve" actually is. In the classic epidemiological model, "the curve" is the hypothetical number of new cases versus time under the assumption that society does not adopt public-health measures to slow the spread. A flattened curve refers to the effect of interventions that delay and reduce the peak number of cases.

You cannot observe a hypothetical curve, but you can use the curve of cumulative cases to assess the spread of the disease. A cumulative curve that increases slowly and flattens out indicates a community that has slowed the spread. So, conveniently, you can apply the phrase "flatten the curve" to the curve of cumulative cases.

Note that you do not want the graph of new cases versus time to be flat. You want that curve to decrease towards zero.

You can download the SAS program used to create graphs in this article.


LEARN MORE | See all Coronavirus dashboard blog posts

The post What does 'flatten the curve' mean? To which curve does it apply? appeared first on The DO Loop.

5月 042020
 

SAS programmers sometimes ask about ways to perform one-dimensional linear interpolation in SAS. This article shows three ways to perform linear interpolation in SAS: PROC IML (in SAS/IML software), PROC EXPAND (in SAS/ETS software), and PROC TRANSREG (in SAS/STAT software). Of these, PROC IML Is the simplest to use and has the most flexibility. This article shows how to implement an efficient 1-D linear interpolation algorithm in SAS. You can download the SAS program that creates the analyses and graphs in this article.

Linear interpolation assumptions

For one-dimensional linear interpolation of points in the plane, you need two sets of numbers:

  1. Data: Let (x1, y1), (x2, y2), ..., (xn, yn) be a set of n data points. The data should not contain any missing values. The data must be ordered so that x1 < x2 < ... < xn. These values uniquely define the linear interpolation function on [x1, xn]. I call this the "sample data" or "fitting data" because it is used to create the linear interpolation model.
  2. Values to score: Let {t1, t2, ..., tk} be a set of k new values for the X variable. For interpolation, all values must be within the range of the data: x1 ≤ ti ≤ xn for all i. The goal of interpolation is to produce a new Y value for each value of ti. The scoring data is also called the "query data."

Interpolation requires a model. For linear interpolation, the model is the unique piecewise linear function that passes through each sample point and is linear on each interval [xi, xi+1]. The model is usually undefined outside of the range of the data, although there are various (nonunique) ways to extrapolate the model beyond the range of the data. You fit the model by using the data. You then score the model on the set of new values.

The following SAS data sets define example data for linear interpolation. The POINTS data set contains fitting data that define the linear model. The SCORE data set contains the new query points at which we want to interpolate. The linear interpolation is shown to the right. The sample data are shown as blue markers. The model is shown as blue lines. The query values to score are shown as a fringe plot along the X axis. The interpolated values are shown as red markers.

The data used for this example are:

/* Example data for 1-D interpolation */
data Points;  /* these points define the model */
input x y;
datalines;
0  1
1  3
4  5
5  4
7  6
8  3
10 3
;
 
data Score; /* these points are to be interpolated */
input t @@;
datalines;
2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9
;

For convenience, the fitting data are already sorted by the X variable, which is in the range [0, 10]. The scoring data set does not need to be sorted.

The scoring data for this example contains five special values:

  • Two scoring values (-1 and 10.5) are outside of the range of the data. An interpolation algorithm should return a missing value for these values. (Otherwise, it is extrapolation.)
  • Two scoring values (0 and 1) are duplicates of X values in the data. Ideally, this should not present a problem for the interpolation algorithm.
  • The value 9 appears twice in the scoring data.

Linear interpolation in SAS by using PROC IML

As is often the case, PROC IML enables you to implement a custom algorithm in only a few lines of code. For simplicity, suppose you have a single value, t, that you want to interpolate, based on the data (x1, y1), (x2, y2), ..., (xn, yn). The main steps for linear interpolation are:

  1. Check that the X values are nonmissing and in increasing order: x1 < x2 < ... < xn. Check that t is in the range [x1, xn]. If not, return a missing value.
  2. Find the first interval that contains t. You can use the BIN function in SAS/IML to find the first value i for which x_i <= t <= x_{i+1}.
  3. Define the left and right endpoint of the interval: xL = x_i and xR = x_{i+1}. Define the corresponding response values: yL = y_i and yR = y_{i+1}.
  4. Let f = (t - xL) / (xR - xL) be the proportion of the interval to the left of t. Then p = (1 - f)*yL + f*yR is the linear interpolation at t.

The steps are implemented in the following SAS/IML function. The function accepts a vector of scoring values, t. Notice that the program does not contain any loops over the elements of t. All statements and operations are vectorized, which is very efficient.

/* Linear interpolation based on the values (x1,y1), (x2,y2),....
   The X  values must be nonmissing and in increasing order: x1 < x2 < ... < xn
   The values of the t vector are linearly interpolated.
*/
proc iml;
start LinInterp(x, y, _t);
   d = dif(x, 1, 1);                     /* check that x[i+1] > x[i] */
   if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing.";
   idx = loc(_t>=min(x) && _t<=max(x));  /* check for valid scoring values */
   if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x.";
 
   p = j(nrow(_t)*ncol(_t), 1, .);     /* allocate output (prediction) vector */
   t = _t[idx];                        /* subset t values inside range(x) */
   k = bin(t, x);                      /* find interval [x_i, x_{i+1}] that contains s */
   xL = x[k];   yL = y[k];             /* find (xL, yL) and (xR, yR) */
   xR = x[k+1]; yR = y[k+1];
   f = (t - xL) / (xR - xL);           /* f = fraction of interval [xL, xR] */
   p[idx] = (1 - f)#yL + f#yR;        /* interpolate between yL and yR */
   return( p );
finish;
 
/* example of linear interpolation in SAS */
use Points; read all var {'x' 'y'}; close;
use Score; read all var 't'; close;
 
pred = LinInterp(x, y, t);
create PRED var {'t' 'pred'}; append; close;
QUIT;

Visualize a linear interpolation in SAS

The previous program writes the interpolated values to the PRED data set. You can concatenate the original data and the interpolated values to visualize the linear interpolation:

/* Visualize: concatenate data and predicted (interpolated) values */
data All;
set Points Pred;
run;
 
title "Linear Interpolation";
title2 "No Extrapolation";
proc sgplot data=All noautolegend;
   series x=x y=y;
   scatter x=x y=y / markerattrs=(symbol=CircleFilled size=12)
                    name="data" legendlabel="Data";
   scatter x=t y=Pred / markerattrs=(symbol=asterisk size=12 color=red)
                    name="interp" legendlabel="Interpolated Values";
   fringe t / lineattrs=(color=red thickness=2)
                    name="score" legendlabel="Values to Score";
   xaxis grid values=(0 to 10) valueshint label="X";
   yaxis grid label="Y" offsetmin=0.05;
   keylegend "data" "score" "interp";
run;

The graph is shown at the top of this article. A few noteworthy items:

  • The values -1 and 10.5 are not scored because they are outside the range of the data.
  • The values 0 and 1 correspond to a data point. The interpolated value is the corresponding data value.
  • The other values are interpolated onto the straight line segments that connect the data.

Performance of the IML algorithm

The IML algorithm is very fast. On my Windows PC (Pentium i7), the interpolation takes only 0.2 seconds for 1,000 data points and one million scoring values. For 10,000 data points and one million scoring values, the interpolation takes about 0.25 seconds. The SAS program that accompanies this article includes timing code.

Other SAS procedures that can perform linear interpolation

According to a SAS Usage Note, you can perform linear interpolation in SAS by using PROC EXPAND in SAS/ETS software or PROC TRANSREG in SAS/STAT software. Each has some limitations that I don't like:

  • Both procedures use the missing value trick to perform the fitting and scoring in a single call. That means you must concatenate the sample data (which is often small) and the query data (which can be large).
  • PROC EXPAND requires that the combined data be sorted by X. That can be easily accomplished by calling PROC SORT after you concatenate the sample and query data. However, PROC EXPAND does not support duplicate X values! For me, this makes PROC EXPAND unusable. It means that you cannot score the model at points that are in the original data, nor can you have repeated values in the scoring data.
  • If you use PROC TRANSREG for linear interpolation, you must know the number of sample data points, n. You must specify n – 2 on the NKNOTS= option on a SPLINE transformation. Usually, this means that you must perform an extra step (DATA step or PROC MEANS) and store n – 2 in a macro variable.
  • For scoring values outside the range of the data, PROC EXPAND returns a missing value. However, PROC TRANSREG extrapolates. If t < x1, then the extrapolated value at t is y1. Similarly, if t > xn, then the extrapolated value at t is yn.

I've included examples of using PROC EXPAND and PROC TRANSREG in the SAS program that accompanies this article. You can use these procedures for linear interpolation, but neither is as convenient as PROC IML.

With effort, you can use Base SAS routines such as the DATA step to implement a linear interpolation algorithm. An example is provided by KSharp in a SAS Support Community thread.

Summary

If you want to perform linear interpolation in SAS, the easiest and most efficient method is PROC IML. I have provided a SAS/IML function that implements linear interpolation by using two input vectors (x and y) to define the model and one input vector (t) to specify the points at which to interpolate. The function returns the interpolated values for t.

The post Linear interpolation in SAS appeared first on The DO Loop.

5月 042020
 

SAS programmers sometimes ask about ways to perform one-dimensional linear interpolation in SAS. This article shows three ways to perform linear interpolation in SAS: PROC IML (in SAS/IML software), PROC EXPAND (in SAS/ETS software), and PROC TRANSREG (in SAS/STAT software). Of these, PROC IML Is the simplest to use and has the most flexibility. This article shows how to implement an efficient 1-D linear interpolation algorithm in SAS. You can download the SAS program that creates the analyses and graphs in this article.

Linear interpolation assumptions

For one-dimensional linear interpolation of points in the plane, you need two sets of numbers:

  1. Data: Let (x1, y1), (x2, y2), ..., (xn, yn) be a set of n data points. The data should not contain any missing values. The data must be ordered so that x1 < x2 < ... < xn. These values uniquely define the linear interpolation function on [x1, xn]. I call this the "sample data" or "fitting data" because it is used to create the linear interpolation model.
  2. Values to score: Let {t1, t2, ..., tk} be a set of k new values for the X variable. For interpolation, all values must be within the range of the data: x1 ≤ ti ≤ xn for all i. The goal of interpolation is to produce a new Y value for each value of ti. The scoring data is also called the "query data."

Interpolation requires a model. For linear interpolation, the model is the unique piecewise linear function that passes through each sample point and is linear on each interval [xi, xi+1]. The model is usually undefined outside of the range of the data, although there are various (nonunique) ways to extrapolate the model beyond the range of the data. You fit the model by using the data. You then score the model on the set of new values.

The following SAS data sets define example data for linear interpolation. The POINTS data set contains fitting data that define the linear model. The SCORE data set contains the new query points at which we want to interpolate. The linear interpolation is shown to the right. The sample data are shown as blue markers. The model is shown as blue lines. The query values to score are shown as a fringe plot along the X axis. The interpolated values are shown as red markers.

The data used for this example are:

/* Example data for 1-D interpolation */
data Points;  /* these points define the model */
input x y;
datalines;
0  1
1  3
4  5
5  4
7  6
8  3
10 3
;
 
data Score; /* these points are to be interpolated */
input t @@;
datalines;
2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9
;

For convenience, the fitting data are already sorted by the X variable, which is in the range [0, 10]. The scoring data set does not need to be sorted.

The scoring data for this example contains five special values:

  • Two scoring values (-1 and 10.5) are outside of the range of the data. An interpolation algorithm should return a missing value for these values. (Otherwise, it is extrapolation.)
  • Two scoring values (0 and 1) are duplicates of X values in the data. Ideally, this should not present a problem for the interpolation algorithm.
  • The value 9 appears twice in the scoring data.

Linear interpolation in SAS by using PROC IML

As is often the case, PROC IML enables you to implement a custom algorithm in only a few lines of code. For simplicity, suppose you have a single value, t, that you want to interpolate, based on the data (x1, y1), (x2, y2), ..., (xn, yn). The main steps for linear interpolation are:

  1. Check that the X values are nonmissing and in increasing order: x1 < x2 < ... < xn. Check that t is in the range [x1, xn]. If not, return a missing value.
  2. Find the first interval that contains t. You can use the BIN function in SAS/IML to find the first value i for which x_i <= t <= x_{i+1}.
  3. Define the left and right endpoint of the interval: xL = x_i and xR = x_{i+1}. Define the corresponding response values: yL = y_i and yR = y_{i+1}.
  4. Let f = (t - xL) / (xR - xL) be the proportion of the interval to the left of t. Then p = (1 - f)*yL + f*yR is the linear interpolation at t.

The steps are implemented in the following SAS/IML function. The function accepts a vector of scoring values, t. Notice that the program does not contain any loops over the elements of t. All statements and operations are vectorized, which is very efficient.

/* Linear interpolation based on the values (x1,y1), (x2,y2),....
   The X  values must be nonmissing and in increasing order: x1 < x2 < ... < xn
   The values of the t vector are linearly interpolated.
*/
proc iml;
start LinInterp(x, y, _t);
   d = dif(x, 1, 1);                     /* check that x[i+1] > x[i] */
   if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing.";
   idx = loc(_t>=min(x) && _t<=max(x));  /* check for valid scoring values */
   if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x.";
 
   p = j(nrow(_t)*ncol(_t), 1, .);     /* allocate output (prediction) vector */
   t = _t[idx];                        /* subset t values inside range(x) */
   k = bin(t, x);                      /* find interval [x_i, x_{i+1}] that contains s */
   xL = x[k];   yL = y[k];             /* find (xL, yL) and (xR, yR) */
   xR = x[k+1]; yR = y[k+1];
   f = (t - xL) / (xR - xL);           /* f = fraction of interval [xL, xR] */
   p[idx] = (1 - f)#yL + f#yR;        /* interpolate between yL and yR */
   return( p );
finish;
 
/* example of linear interpolation in SAS */
use Points; read all var {'x' 'y'}; close;
use Score; read all var 't'; close;
 
pred = LinInterp(x, y, t);
create PRED var {'t' 'pred'}; append; close;
QUIT;

Visualize a linear interpolation in SAS

The previous program writes the interpolated values to the PRED data set. You can concatenate the original data and the interpolated values to visualize the linear interpolation:

/* Visualize: concatenate data and predicted (interpolated) values */
data All;
set Points Pred;
run;
 
title "Linear Interpolation";
title2 "No Extrapolation";
proc sgplot data=All noautolegend;
   series x=x y=y;
   scatter x=x y=y / markerattrs=(symbol=CircleFilled size=12)
                    name="data" legendlabel="Data";
   scatter x=t y=Pred / markerattrs=(symbol=asterisk size=12 color=red)
                    name="interp" legendlabel="Interpolated Values";
   fringe t / lineattrs=(color=red thickness=2)
                    name="score" legendlabel="Values to Score";
   xaxis grid values=(0 to 10) valueshint label="X";
   yaxis grid label="Y" offsetmin=0.05;
   keylegend "data" "score" "interp";
run;

The graph is shown at the top of this article. A few noteworthy items:

  • The values -1 and 10.5 are not scored because they are outside the range of the data.
  • The values 0 and 1 correspond to a data point. The interpolated value is the corresponding data value.
  • The other values are interpolated onto the straight line segments that connect the data.

Performance of the IML algorithm

The IML algorithm is very fast. On my Windows PC (Pentium i7), the interpolation takes only 0.2 seconds for 1,000 data points and one million scoring values. For 10,000 data points and one million scoring values, the interpolation takes about 0.25 seconds. The SAS program that accompanies this article includes timing code.

Other SAS procedures that can perform linear interpolation

According to a SAS Usage Note, you can perform linear interpolation in SAS by using PROC EXPAND in SAS/ETS software or PROC TRANSREG in SAS/STAT software. Each has some limitations that I don't like:

  • Both procedures use the missing value trick to perform the fitting and scoring in a single call. That means you must concatenate the sample data (which is often small) and the query data (which can be large).
  • PROC EXPAND requires that the combined data be sorted by X. That can be easily accomplished by calling PROC SORT after you concatenate the sample and query data. However, PROC EXPAND does not support duplicate X values! For me, this makes PROC EXPAND unusable. It means that you cannot score the model at points that are in the original data, nor can you have repeated values in the scoring data.
  • If you use PROC TRANSREG for linear interpolation, you must know the number of sample data points, n. You must specify n – 2 on the NKNOTS= option on a SPLINE transformation. Usually, this means that you must perform an extra step (DATA step or PROC MEANS) and store n – 2 in a macro variable.
  • For scoring values outside the range of the data, PROC EXPAND returns a missing value. However, PROC TRANSREG extrapolates. If t < x1, then the extrapolated value at t is y1. Similarly, if t > xn, then the extrapolated value at t is yn.

I've included examples of using PROC EXPAND and PROC TRANSREG in the SAS program that accompanies this article. You can use these procedures for linear interpolation, but neither is as convenient as PROC IML.

With effort, you can use Base SAS routines such as the DATA step to implement a linear interpolation algorithm. An example is provided by KSharp in a SAS Support Community thread.

Summary

If you want to perform linear interpolation in SAS, the easiest and most efficient method is PROC IML. I have provided a SAS/IML function that implements linear interpolation by using two input vectors (x and y) to define the model and one input vector (t) to specify the points at which to interpolate. The function returns the interpolated values for t.

The post Linear interpolation in SAS appeared first on The DO Loop.

4月 222020
 

A previous article describes the funnel plot (Spiegelhalter, 2005), which can identify samples that have rates or proportions that are much different than expected. The funnel plot is a scatter plot that plots the sample proportion of some quantity against the size of the sample. The variance of the sample proportion is inversely proportional to the sample size, so the plot is funnel-shaped. An example of a funnel plot is shown to the right.

A funnel plot can visualize any rate, but this article visualizes the case fatality rate for COVID-19 deaths for counties in the US. The case fatality rate is the number of deaths due to COVID-19 divided by the number of confirmed cases. (For more information about case fatality rates, see "Understanding COVID-19 data: Case fatality rate vs. mortality rate vs. risk of dying.")

The graphs are sobering: Each marker represents dozens, hundreds, or even thousands of people who have died because of the coronavirus pandemic. A goal of this visualization is to identify counties that have extreme case-fatality rates so that public health officials can react swiftly and save lives.

The range of coronavirus cases for US counties

The following table shows the distribution of the number of confirmed coronavirus cases among the 2,717 counties that have reported at least one case by 15Apr2020.

The table shows the percentiles for the number of cases in each county that has at least one case:

  • 10% of the counties report at most 1 case.
  • 25% of the counties report at most 4 cases.
  • Half of the counties report at most 14 cases.
  • 75% of the counties report at most 55 cases.

The previous table shows that most (75%) of the US counties have only a few dozen confirmed cases. However, remember that testing is not widespread and many people who contract the virus and recover at home are never counted in the official statistics.

Many news stories focus on the hardest-hit communities. The following table shows the US counties that have reported more than 10,000 cases by 15Apr2020:

As expected, most cases occur in the most populous communities. In addition to New York City and its neighboring counties (such as Bergen, NJ, across the Hudson River from Manhatten), the list includes counties that contain the cities of Chicago, Detroit, and Los Angeles. The following sections create two different funnel plots. The first includes the most populous counties. The second highlights the majority of US counties, which have fewer cases per county.

A funnel plot for all US counties

In general, it is difficult to visualize data that range over many orders of magnitude. If you use a linear axis, the largest data tend to stretch out the axes on a plot, pushing the smaller values into a tiny slice of the graph. This is seen in the following funnel plot, which plots the case fatality rate versus the number of confirmed cases for US counties on 15Apr2020. The reference line represents the overall case fatality rate in the US, which is 4.5%. The curves indicate the usual range of variability for an estimate as a function of the sample size, assuming that each county is experiencing deaths at the overall rate.

The top counties are labeled, but New York dominates the graph. Most of the markers are squashed into the left side of the graph. You can see that New York not only has more cases, but its case fatality rate is very high relative to the national average. Because of the extreme number of cases in New York, it is difficult to see the rest of the data or the funnel-shaped region.

There are two ways to modify the graph. The first is to use a nonlinear scale on the horizontal axis (typically a logarithmic scale, but a square root transformation is another option). The second is to truncate the graph at some value such as 5,000 cases. Essentially, this is "zooming in" on part of the previous graph.

A logarithmic-scale funnel plot

The following plot shows exactly the same data, but the horizontal axis now uses a logartihmic scale.

In this graph, you can clearly see the funnel-shaped region. Markers outside the region have higher- or lower-than-expected rates. If you create the graph in SAS, you can hover the pointer over a marker in any of these plots to reveal the name of the county and additional details. An example is shown for Franklin County, MA. You might notice that markers in the lower-left corner appear to fall along curves. These curves correspond to counties that have experienced zero deaths, one death, two deaths, and so forth.

Zoom in on a funnel plot

Another alternative is to zoom in on a region of interest. For example, the following plot shows only counties that have between 20 and 5,000 cases.

A few counties are labeled because their rates are much higher or much lower than expected:

  • King County (WA): This county was the original epicenter in the US. Many residents died at a nursing home in Kirkland, WA. To date, the county has experienced 314 deaths and about 3,700 confirmed cases.
  • Macomb County (MI): Along with Wayne County, Macomb County has been a hot spot as the Detroit area struggle to contain the spread of coronavirus.
  • Hartford County (CT): Connecticut is adjacent to New York and has experienced more than 1,000 COVID-19 deaths. Hartford County is one of three hard-hit counties in Connecticut.
  • Harris County (TX): The home of Houston reports half as many COVID-related deaths as other comparable cities. I do not know why this rate is so low, but it demonstrates that the funnel plot can reveal lower-than-expected rates.

Summary

Estimates that are based on small samples are highly variable. A funnel plot is a good way to visualize many estimates that are based on samples of different sizes. The funnel plot includes curves that indicate an acceptable range of variability for each sample size. If a sample rate is far outside the region, the sample can be examined more closely to understand why the rate is extreme.

The funnel plots in this article compare the case fatality rates due to COVID-19 for thousands of US counties on 15Apr2020. The funnel plots reveal counties whose rates are higher-than-expected or lower-than-expected.

It is worth mentioning that you can use a funnel plot for any rate. Other coronavirus rates include the rate of hospitalization, the proportion of confirmed cases that are on ventilators, the rate of recovery, and so forth.

You can download the SAS program that creates the funnel plots in this article.


LEARN MORE | See all Coronavirus dashboard blog posts

The post Visualize the case fatality rate for COVID-19 in US counties appeared first on The DO Loop.