data analysis

9月 102020
 

I previously wrote about the RAS algorithm, which is a simple algorithm that performs matrix balancing. Matrix balancing refers to adjusting the cells of a frequency table to match known values of the row and column sums. Ideally, the balanced matrix will reflect the structural relationships in the original matrix. An alternative method is the iterative proportional fitting (IPF) algorithm, which is implemented in the IPF subroutine in SAS/IML. The IPF method can balance n-way tables, n ≥ 2.

The IPF function is a statistical modeling method. It computes maximum likelihood estimates for a hierarchical log-linear model of the counts as a function of the categorical variables. It is more complicated to use than the RAS algorithm, but it is also more powerful.

This article shows how to use the IPF subroutine for the two-way frequency table from my previous article. The data are for six girl scouts who sold seven types of cookies. We know the row totals for the Type variable (the boxes sold for each cookie type) and the column totals for the Person variable (the boxes sold by each girl). But we do not know the values of the individual cells, which are the Type-by-Person totals. A matrix balancing algorithm takes any guess for the table and maps it into a table that has the observed row and column sums.

Construct a table that has the observed marginal totals

As mentioned in the previous article, there are an infinite number of matrices that have the observed row and column totals. Before we discuss the IPF algorithm, I want to point out that it is easy to compute one particular balanced matrix. The special matrix is the one that assumes no association (independence) between the Type and Person variables. That matrix is formed by the the outer product of the marginal totals:
     B[i,j] = (sum of row i)(sum of col j) / (sum of all counts)

It is simple to compute the "no-association" matrix in the SAS/IML language:

proc iml;
/* Known quantities: row totals and column totals */
u = {260, 214, 178, 148, 75, 67, 59}; /* total boxes sold, by type */
v = {272 180 152 163 134 100};        /* total boxes sold, by girl */
 
/* Compute the INDEPENDENT table, which is formed by the outer product
   of the marginal totals:
   B[i,j] = (sum of row i)(sum of col j) / (sum of all counts) */
B = u*v / sum(u);
 
Type   = 'Cookie1':'Cookie7';                  /* row labels    */
Person = 'Girl1':'Girl6';                      /* column labels */
print B[r=Type c=Person format=7.1 L='Independent Table'];

For this balanced matrix, there is no association between the row and column variables (by design). This matrix is the "expected count" matrix that is used in a chi-square test for independence between the Type and Person variables.

Because the cookie types are listed in order from most popular to least popular, the values in the balanced matrix (B) decrease down each column. If you order the girls according to how many boxes they sold (you need to exchange the third and fourth columns), then the values in the balanced matrix would decrease across each row.

Use the IPF routine to balance a matrix

The documentation for the IPF subroutine states that you need to specify the following input arguments. For clarity, I will only discuss two-way r x c tables.

  • dim: The vector of dimensions of the table: dim = {r c};
  • table: This r x c matrix is used to specify the row and column totals. You can specify any matrix; only the marginal totals of the matrix are used. For example, for a two-way table, you can use the "no association" matrix from the previous section.
  • config is an input matrix whose columns specify which marginal totals to use to fit the counts. Because the model is hierarchical, all subsets of specified marginals are included in the model. For a two-way table, the most common choices are the main effect models such as config={1} (model count by first variable only), config={2} (model count by second variable only), and config={1 2} (model count by using both variables). It is possible to fit the "fully saturated model" (config={1, 2}), but that model is overparameterized and results in a perfect fit, if the IPF algorithm converges.
  • initTable (optional): The elements of this matrix will be adjusted to match the observed marginal totals. In practice, this matrix is often based on a survey and you want to adjust the counts by using known marginal totals from a recent census of the population. Or it might be an educated guess, as in the Girl Scout cookie example. If you do not specify this matrix, a matrix of all 1s is used, which implies that you think all cell entries are approximately equal.

The following SAS/IML program defines the parameters for the IPF call.

/* Specify structure between rows and columns by asking each girl 
   to estimate how many boxes of each type she sold. This is the 
   matrix whose cells will be adjusted to match the marginal totals. */
/*   G1 G2 G3 G4 G5 G6          */
A = {75 45 40 40 40 30 ,  /* C1 */
     40 35 45 35 30 30 ,  /* C2 */
     40 25 30 40 30 20 ,  /* C3 */
     40 25 25 20 20 20 ,  /* C4 */
     30 25  0 10 10  0 ,  /* C5 */
     20 10 10 10 10  0 ,  /* C6 */
     20 10  0 10  0  0 }; /* C7 */
 
Dim = ncol(A) || nrow(A);  /* dimensions of r x c table */
Config = {1 2};            /* model main effects: counts depend on Person and Type */
call IPF(FitA, Status,     /* output values */
         Dim, B,           /* use the "no association" matrix to specify marginals */
         Config, A);       /* use best-guess matrix to specify structure */
print FitA[r=Type c=Person format=7.2],     /* the IPF call returns the FitA matrix and Status vector */
      Status[c={'rc' 'Max Diff' 'Iters'}];

The IPF call returns two quantities. The first output is the balanced matrix. For this example, the balanced matrix (FitA) is identical to the result from the RAS algorithm. It has a similar structure to the "best guess" matrix (A). For example, and notice that the cells in the original matrix that are zero are also zero in the balanced matrix.

The second output is the Status vector. The status vector provides information about the convergence of the IPF algorithm:

  • If the first element is zero, the IPF algorithm converged.
  • The second element gives the maximum difference between the marginal totals of the balanced matrix (FitA) and the observed marginal totals.
  • The third element gives the number of iterations of the IPF algorithm.

Other models

The IPF subroutine is more powerful than the RAS algorithm because you can fit n-way tables and because you can model the table in several natural ways. For example, the model in the previous section assumes that the counts depend on the salesgirl and on the cookie type. You can also model the situation where the counts depend only on one of the factors and not on the other:

/* counts depend only on Person, not on cookie Type */
Config = {1};   
call IPF(FitA1, Status, Dim, B, Config, A);
print FitA1[r=Type c=Person format=7.2], Status;
 
/* counts depend only on cookie Type, not on Person */
Config = {2};   
call IPF(FitA2, Status, Dim, B, Config, A);
print FitA2[r=Type c=Person format=7.2], Status;

Summary

In summary, the SAS/IML language provides a built-in function (CALL IPF) for matrix balancing of frequency tables. The routine incorporates statistical modeling and you can specify which variables to use to model the cell counts. The documentation several examples that demonstrate using the IPF algorthm to balance n-way tables, n ≥ 2.

The post Iterative proportional fitting in SAS appeared first on The DO Loop.

8月 102020
 

A common operation in statistical data analysis is to center and scale a numerical variable. This operation is conceptually easy: you subtract the mean of the variable and divide by the variable's standard deviation. Recently, I wanted to perform a slight variation of the usual standardization:

  • Perform a different standardization for each level of a grouping variable.
  • Instead of using the sample mean and sample standard deviation, I wanted to center and scale the data by using values that I specify.

Although you can write SAS DATA step code to center and scale the data, PROC STDIZE in SAS contains options that handle both these cases.

Simulate data from two groups

Let's start by creating some example data. The following DATA step simulates samples drawn from two groups. The distribution of the first group is N(mu1, sigma) and the distribution of the second group is N(mu2, sigma). The first sample size is n1=0; the second size is n2=30. Notice that the data are ordered by the Group variable.

%let mu1 = 15;
%let mu2 = 16;
%let sigma = 5;
/* Create two samples N(mu1, sigma) and N(mu2, sigma), with samples sizes n1 and n2, respectively */
data TwoSample;
call streaminit(54321);
n1 = 20; n2 = 30;
Group = 1;
do i = 1 to n1;
   x = round(rand('Normal', &mu1, &sigma), 0.01);  output;
end;
Group = 2;
do i = 1 to n2;
   x = round(rand('Normal', &mu2, &sigma), 0.01);  output;
end;
keep Group x;
run;
 
ods graphics / width=500px height=360px;
title "Normal Data";
title2 "N(&mu1, &sigma) and N(&mu2, &sigma)";
proc sgpanel data=TwoSample noautolegend;
   panelby Group / columns=1;
   histogram x;
   density x / type=normal;
   colaxis grid;
run;

The graph shows a comparative histogram of the distribution of each group and overlays the density of the normal curve that best fits the data. You can run PROC MEANS to calculate that the estimates for the first group are mean=14.9 and StdDev=5.64. For the second group, the estimates are mean=15.15 and StdDev=4.27.

The "standard" standardization

Typically, you standardize data by using the sample mean and the sample standard deviation. You can do this by using PROC STDIZE and specify the METHOD=STD method (which is the default method). You can use the BY statement to apply the standardization separately to each group. The following statements standardize the data in each group by using the sample statistics:

/* use PROC STDIZE to standardize by the sample means and sample std dev */
proc stdize data=TwoSample out=ScaleSample method=std;
   by Group;
   var x;
run;
 
proc means data=ScaleSample Mean StdDev Min Max ndec=2; 
   class Group;
   var x;
run;

As expected, each sample now has mean=0 and unit standard deviation. Many multivariate analyses center and scale data in order to adjust for the fact that different variables are measured on different scales. PROC STDIZE has many other options for standardizing data.

Center and scale by using the DATA step

A second way to standardize the data is to use the DATA step to center and scale each variable and each group. You can overwrite the contents of each column, or (as I've done below), you can create a new variable that contains the standardized values. You need to specify values for the location and scale parameters. For these simulated data, I know the population values of the location and scale parameters. The following DATA step uses the parameters to center and scale the data:

/* because we know the population parameters, we can use them to center and scale the data */
/* DATA step method */
data ScaleData;
array mu[2] (&mu1 &mu2);    /* put parameter values into an array */
set TwoSample;
y = (x-mu[Group]) / σ
run;
 
proc means data=ScaleData Mean StdDev Min Max ndec=2;
   class Group;
   var y;
run;

Because I use the population parameters instead of the sample estimates, the transformed variable does not have zero mean nor unit variance.

Wide form: Use the LOCATION and SCALE statements in PROC STDIZE

You can perform the same calculations by using PROC STDIZE. If you look up the documentation for the METHOD= option on the PROC STDIZE statement, you will see that the procedure supports the METHOD=IN(ds) method, which enables you to read parameters from a data set. I will focus on the LOCATION and SCALE parameters; see the documentation for other options.

You can specify the parameters in "wide form" or in "long form". In wide form, each column specifies a parameter and you can include multiple rows if you want to perform a BY-group analysis. You can use the LOCATION and SCALE statements to specify the name of the variables that contain the parameters. The following DATA step creates a data set that has three variables: Group, Mu, and Sigma. Each row specifies the location and scale parameter for centering and scaling data in the levels of the Group variable.

data ScaleWide;
Group=1; Mu=&mu1; Sigma=σ output;
Group=2; Mu=&mu2; Sigma=σ output;
run;
 
proc stdize data=TwoSample out=StdWide method=in(ScaleWide);
   by Group;
   var x;
   location mu;    /* column name METHOD=IN data set */
   scale    sigma; /* column name METHOD=IN data set */
run;
 
proc means data=StdWide Mean StdDev Min Max ndec=2; 
   class Group;
   var x;
run;

The output is identical to the output from the previous section and is not shown.

Long form: Use the _TYPE_ variable

You can also specify the location and scale parameters in long form. For a long-form specification, you need to create a data set that contains a variable named _TYPE_. The parameters for centering and scaling are specified in rows where _TYPE_="LOCATION" and _TYPE_="SCALE", respectively. You can also include one or more BY-group variables (if you are using the BY statement) and one or more columns whose names are the same as the variables you intend to transform.

For example, the example data in this article has a BY-group variable named Group and an analysis variable named X. The following DATA step specifies the values to use to center (_TYPE_='LOCATION') and scale (_TYPE_='SCALE') the X variable for each group:

data ScaleParam;
length _TYPE_ $14;
input Group _TYPE_ x;
datalines;
1 LOCATION 15
1 SCALE     5
2 LOCATION 16
2 SCALE     5
;
 
proc stdize data=TwoSample out=Scale2 method=in(ScaleParam);
   by Group;
   var x;
run;
 
proc means data=Scale2; 
   class Group;
   var x;
run;

Again, the result (not shown) is the same as when the DATA step is used to center and scale the data.

Summary

This article gives examples of using PROC STDIZE in SAS/STAT to center and scale data. Most analysts want to standardize data in groups by using the sample means and sample standard deviations of each group. This is the default behavior of PROC STDIZE. However, you can also center and scale data by using values that are different from the group statistics. For example, you might want to use a grand mean and a pooled standard deviation for the groups. Or, perhaps you are using simulated data and you know the population mean and variance of each group. Regardless, there are three ways to center and scale the data when you know the location and scale parameters:

  • Manually: You can use the DATA step or PROC IML or PROC SQL to write a formula that centers and scales the data.
  • Wide Form: PROC STDIZE supports the LOCATION and SCALE statements. If you use the METHOD=IN(ds) option on the PROC STDIZE statement, you can read the parameters from a data set.
  • Long Form: If the input data set contains a variable named _TYPE_, PROC STDIZE will use the parameter values where _TYPE_='LOCATION' to center the data and where _TYPE_='SCALE' to scale the data.

The post 4 ways to standardize data in SAS appeared first on The DO Loop.

7月 202020
 

I recently showed how to compute within-group multivariate statistics by using the SAS/IML language. However, a principal of good software design is to encapsulate functionality and write self-contained functions that compute and return the results. What is the best way to return multiple statistics from a SAS/IML module?

A convenient way to return several statistics is to use lists. Lists were introduced in SAS/IML 14.2 and extended in SAS/IML 14.3. I have previously written about how to use lists to pass arguments into a SAS/IML module. For parallel processing in the iml action in SAS Viya, lists are an essential data structure.

This article shows two things. First, it shows how to use the UNIQUE-LOC technique to compute multivariate statistics for each group in the data. It then shows two ways to return those statistics by using SAS/IML lists. Depending on your application, one way might be more convenient to use than the other.

Returning matrices versus lists

Suppose you want to write a SAS/IML function that computes several statistics for each group in the data. If you are analyzing one variable (univariate data), the statistics are scalar. Therefore, the module can return a matrix in which the rows of the matrix represent the groups in the data and the columns represent the statistics, such as mean, median, variance, skewness, and so forth. This form of the output matches the output from PROC MEANS when you use a CLASS statement to specify the group variable.

For multivariate data, the situation becomes more complex. For example, suppose that for each group you want to compute the number of observations, the mean vector, and the covariance matrix. Although it is possible to pack all this information into a matrix, it requires a lot of bookkeeping to keep track of which row contains which statistics for which group. A simpler method is to pack the multivariate statistics into a list and return the list.

Multivariate statistics for the iris data

Fisher's iris data is often used to demonstrate ideas in multivariate statistics. The following call to PROC SGPLOT creates a scatter plot for two variables (PetalWidth and SepalLength) and uses the GROUP= option to color the observations by the levels of the Species variable. The plot also overlays 95% prediction ellipses for each group.

title "Iris Data, Colored by Species";
title2 "95% Prediction Ellipse, assuming MVN";
proc sgplot data=Sashelp.Iris;
   ellipse x=PetalWidth y=SepalWidth / group=Species outline fill transparency=0.6;
   scatter x=PetalWidth y=SepalWidth / group=Species transparency=0.2
                       markerattrs=(symbol=CircleFilled size=10) jitter;
   xaxis grid; yaxis grid;
run;

Although PROC SGPLOT computes these ellipses automatically, you might be interested in the multivariate statistics behind the ellipses. The ellipses depend on the number of observations, the mean vector, and the covariance matrix for each group. The following SAS/IML module computes these quantities for each group. The module assembles the statistics into three matrices:

  • The ns vector contains the number of observations. The k_th row contains the number of observations in the k_th group.
  • The means matrix contains the group means. The k_th row contains the mean vector for the k_th group.
  • The covs matrix contains the group covariance matrices. The k_th row contains the covariance matrix for the k_th group. You can use the ROWVEC function to "flatten" a matrix into a row vector. When you want to reconstruct the matrix, you can use the SHAPE function, as discussed in the article "Create an array of matrices in SAS."

These three items are packed into a list and returned by the module.

proc iml;
/* GIVEN Y and a classification variable Group that indicate group membership,
   return a list that contains:
   1. Number of observations in each group
   2. ML estimates of within-group mean
   3. ML estimate of within-group covariance 
   The k_th row of each matrix contains the statistics for the k_th group.
*/
start MLEstMVN(Y, Group);
   d = ncol(Y);
   u = unique(Group);
   G = ncol(u);
   ns    = J(G,   1, 0);           /* k_th row is number of obs in k_th group */
   means = J(G,   d, 0);           /* k_th row is mean of k_th group */
   covs  = J(G, d*d, 0) ;          /* k_th row is cov of k_th group */
 
   do k = 1 to G;
      X = Y[loc(Group=u[k]), ];    /* obs in k_th group */
      n = nrow(X);                 /* number of obs in this group */
      C = (n-1)/n * cov(X);        /* ML estimate of COV does not use the Bessel correction */
      ns[k]      = n;             
      means[k, ] = mean(X);        /* ML estimate of mean */
      covs[k, ]  = rowvec( C );    /* store COV in k_th row */
   end;
   outList = [#'n'=ns, #'mean'=means, #'cov'=Covs, #'group'=u]; /* pack into list */
   return (outList);
finish;
 
/* Example: read the iris data  */
varNames = {'PetalWidth' 'SepalWidth'};
use Sashelp.Iris;
read all var varNames into X;
read all var {'Species'};
close;
 
L = MLEstMVN(X, Species);            /* call function; return named list of results */
ns = L$'n';                          /* extract obs numbers          */
means = L$'mean';                    /* extract mean vectors         */
Covs = L$'cov';                      /* extract covariance matrix    */
lbl = L$'group';
print ns[r=lbl c='n'], 
      means[r=lbl c={'m1' 'm2'}], 
      covs[r=lbl c={C11 C12 C21 C22}];

The multivariate statistics are shown. The k_th row of each matrix contains a statistic for the k_th group. The number of observations and the mean vectors are easy to understand and interpret. The covariance matrix for each group has been flattened into a row vector. You can use the SHAPE function to rearrange the row into a matrix. For example, the covariance matrix for the first group ('Setosa') can be constructed as Setosa_Cov = shape(covs[1,], 2).

An alternative: A list of lists

It can be convenient to return a list of lists that contains the statistics. Suppose you want to compute m statistics for each of G groups. The function in the previous section returned a list that contains m items and each item had G rows. (I also included an additional list item, the group levels, but that is optional.) Alternatively, you could return a list that has G sublists, where each sublist contains the m statistics for the group. This is shown in the following function, which returns the same information but organizes it in a different way.

/* Alternative: Return a list of G sublists. Each sublist has only 
   the statistics relevant to the corresponding group.    */
start MLEst2(Y, Group);
   u = unique(Group);
   G = ncol(u);
   outList = ListCreate(G);
 
   do k = 1 to G;
      X = Y[loc(Group=u[k]), ];    /* obs in k_th group */
      n = nrow(X);                 /* number of obs in k_th group */
      SL = [#'n'=n, 
            #'mean'=mean(X),        /* ML estimate of mean */
            #'cov'=(n-1)/n*cov(X)]; /* ML estimate of COV; no need to flatten */
      call ListSetItem(outList, k, SL);   /* set sublist as the k_th item */
   end;
   call ListSetName(outList, 1:G, u);     /* give a name to each sublist */
   return (outList);
finish;
 
L2 = MLEst2(X, Species);
package load ListUtil;  /* load a package to print lists */
call Struct(L2);        /* or: call ListPrint(L2);       */

The output from the STRUCT call gives a hierarchical overview of the L2 list. The leftmost column indicates the hierarchy of the lists. The list has G=3 items, which are themselves lists. Each list has a name (the group level and m=3 items: the number of observations ('n'), the mean vector ('mean'), and the within-group MLE covariance matrix ('cov'). In this output, notice that there is no need to flatten the covariance matrix into a row vector. For some applications, this list format might be easier to work with. If you want, for example, the mean vector for the 'Setosa' group, you can access it by using the notation m = L2$'Setosa'$'mean'. Or you could extract the 'Setosa' list and access the statistics as follows:

SetosaList = L2$'Setosa';    /* get the list of statistics for the 'Setosa' group */
m = SetosaList$'mean';       /* get the mean vector */
C = SetosaList$'cov';        /* get the MLE covariance matrix */
print m[c=varNames], C[c=varNames r=varNames];

Summary

This article shows how to do two things:

  • How to use the UNIQUE-LOC technique to compute multivariate statistics for each group in the data.
  • How to return those statistics by using SAS/IML lists. Two ways to organize the output are presented. The simplest method is to stack the statistics into matrices where the k_th row represents the k_th group. You can then return a list of the matrices. The second method is to return a list of sublists, where the k_th sublist contains the statistics for the k_th list. The first method requires that you "flatten" matrices into row vectors whereas the second method does not.

The post Compute within-group multivariate statistics and store them in a list appeared first on The DO Loop.

7月 012020
 

A previous article discusses the pooled variance for two or groups of univariate data. The pooled variance is often used during a t test of two independent samples. For multivariate data, the analogous concept is the pooled covariance matrix, which is an average of the sample covariance matrices of the groups. If you assume that the covariances within the groups are equal, the pooled covariance matrix is an estimate of the common covariance. This article shows how to compute and visualize a pooled covariance matrix in SAS. It explains how the pooled covariance relates to the within-group covariance matrices. It discusses a related topic, called the between-group covariance matrix.

The within-group matrix is sometimes called the within-class covariance matrix because a classification variable is used to identify the groups. Similarly, the between-group matrix is sometimes called the between-class covariance matrix.

Visualize within-group covariances

Suppose you want to analyze the covariance in the groups in Fisher's iris data (the Sashelp.Iris data set in SAS). The data set contains four numeric variables, which measure the length and width of two flower parts, the sepal and the petal. Each observation is for a flower from an iris species: Setosa, Versicolor, or Virginica. The Species variable in the data identifies observations that belong to each group, and each group has 50 observations. The following call to PROC SGPLOT creates two scatter plots and overlays prediction ellipses for two pairs of variables:

title "68% Prediction Ellipses for Iris Data";
proc sgplot data=Sashelp.Iris;
   scatter x=SepalLength y=SepalWidth / group=Species transparency=0.5;
   ellipse x=SepalLength y=SepalWidth / group=Species alpha=0.32 lineattrs=(thickness=2);
run;
proc sgplot data=Sashelp.Iris;
   scatter x=PetalLength y=PetalWidth / group=Species transparency=0.5;
   ellipse x=PetalLength y=PetalWidth / group=Species alpha=0.32 lineattrs=(thickness=2);
run;

The ellipses enable you to visually investigate whether the variance of the data within the three groups appears to be the same. For these data, the answer is no because the ellipses have different shapes and sizes. Some of the prediction ellipses have major axes that are oriented more steeply than others. Some of the ellipses are small, others are relatively large.

You might wonder why the graph shows a 68% prediction ellipse for each group. Recall that prediction ellipses are a multivariate generalization of "units of standard deviation." If you assume that measurements in each group are normally distributed, 68% of random observations are within one standard deviation from the mean. So for multivariate normal data, a 68% prediction ellipse is analogous to +/-1 standard deviation from the mean.

The pooled covariance is an average of within-group covariances

The pooled covariance is used in linear discriminant analysis and other multivariate analyses. It combines (or "pools") the covariance estimates within subgroups of data. The pooled covariance is one of the methods used by Friendly and Sigal (TAS, 2020) to visualize homogeneity tests for covariance matrices.

Suppose you collect multivariate data for \(k\) groups and \(S_i\) is the sample covariance matrix for the \(n_i\) observations within the \(i\)th group. If you believe that the groups have a common variance, you can estimate it by using the pooled covariance matrix, which is a weighted average of the within-group covariances:
\(S_p = \Sigma_{i=1}^k (n_i-1)S_i / \Sigma_{i=1}^k (n_i - 1)\)

If all groups have the same number of observations, then the formula simplifies to \(\Sigma_{i=1}^k S_i / k\), which is the simple average of the matrices. If the group sizes are different, then the pooled variance is a weighted average, where larger groups receive more weight than smaller groups.

Compute the pooled covariance in SAS

In SAS, you can often compute something in two ways. The fast-and-easy way is to find a procedure that does the computation. A second way is to use the SAS/IML language to compute the answer yourself. When I compute something myself (and get the same answer as the procedure!), I increase my understanding.

Suppose you want to compute the pooled covariance matrix for the iris data. The fast-and-easy way to compute a pooled covariance matrix is to use PROC DISCRIM. The procedure supports the OUTSTAT= option, which writes many multivariate statistics to a data set, including the within-group covariance matrices, the pooled covariance matrix, and something called the between-group covariance. (It also writes analogous quantities for centered sum-of-squares and crossproduct (CSSCP) matrices and for correlation matrices.)

proc discrim data=sashelp.iris method=normal pool=yes outstat=Cov noprint;
   class Species;
   var SepalLength SepalWidth PetalLength PetalWidth;
run;
 
proc print data=Cov noobs; 
   where _TYPE_ = "PCOV";
   format _numeric_ 6.2;
   var _TYPE_ _NAME_ Sepal: Petal:;
run;

The table shows the "average" covariance matrix, where the average is across the three species of flowers.

Within-group covariance matrices

The same output data set contains the within-group and the between-group covariance matrices. The within-group matrices are easy to understand. They are the covariance matrices for the observations in each group. Accordingly, there are three such matrices for these data: one for the observations where Species="Setosa", one for Species="Versicolor", and one for Species="Virginica". The following call to PROC PRINT displays the three matrices:

proc print data=Cov noobs; 
   where _TYPE_ = "COV" and Species^=" ";
   format _numeric_ 6.2;
run;

The output is not particularly interesting, so it is not shown. The matrices are the within-group covariances that were visualized earlier by using prediction ellipses.

Visual comparison of the pooled covariance and the within-group covariance

Friendly and Sigal (2020, Figure 1) overlay the prediction ellipses for the pooled covariance on the prediction ellipses for the within-group covariances. A recreation of Figure 1 in SAS is shown below. You can use the SAS/IML language to draw prediction ellipses from covariance matrices.

The shaded region is the prediction ellipse for these two variables in the pooled covariance matrix. It is centered at the weighted average of the group means. You can see that the pooled ellipse looks like an average of the other ellipses. This graph shows only one pair of variables, but see Figure 2 of Friendly and Sigal (2020) for a complete scatter plot matrix that compares the pooled covariance to the within-group covariance for each pair of variables.

Between-group covariance matrices

Another matrix in the PROC DISCRIM output is the so-called between-group covariance matrix. Intuitively, the between-group covariance matrix is related to the difference between the full covariance matrix of the data (where the subgroups are ignored) and the pooled covariance matrix (where the subgroups are averaged). The precise definition is given in the next section. For now, here is how to print the between-group covariance matrix from the output of PROC DISCRIM:

proc print data=Cov noobs; 
   where _TYPE_ = "BCOV";
   format _numeric_ 6.2;
run;

How to compute the pooled and between-group covariance

If I can compute a quantity "by hand," then I know that I truly understand it. Thus, I wrote a SAS/IML program that reproduces the computations made by PROC DISCRIM. The following steps are required to compute each of these matrices from first principles.

  1. For each group, compute the covariance matrix (S_i) of the observations in that group.
  2. Note that the quantity (n_i - 1)*S_i is the centered sum-of-squares and crossproducts (CSSCP) matrix for the group. Let M be the sum of the CSSCP matrices. The sum is the numerator for the pooled covariance.
  3. Form the pooled covariance matrix as S_p = M / (N-k).
  4. Let C be the CSSCP data for the full data (which is (N-1)*(Full Covariance)). The between-group covariance matrix is BCOV = (C - M) * k / (N*(k-1)).

You can use the UNIQUE-LOC trick to iterate over the data for each group. The following SAS/IML program implements these computations:

/* Compute a pooled covariance matrix when observations
   belong to k groups with sizes n1, n2, ..., nk, where n1+n2+...+nk = N
*/
proc iml;
varNames = {'SepalLength' 'SepalWidth' 'PetalLength' 'PetalWidth'};
use Sashelp.iris;
   read all var varNames into Z;
   read all var "Species" into Group;
close;
 
/* assume complete cases, otherwise remove rows with missing values */
N   = nrow(Z);
 
/* compute the within-group covariance, which is the covariance for the observations in each group */
u = unique(Group);
k = ncol(u);                /* number of groups */
p   = ncol(varNames);       /* number of variables */
M = j(p, p, 0);             /* sum of within-group CSCCP matrices */
do i = 1 to k;
   idx = loc(Group = u[i]); /* find rows for this group */
   X = Z[idx,];             /* extract obs for i_th group */
   n_i = nrow(X);           /* n_i = size of i_th group */
   S   = cov(X);            /* within-group cov */
   /* accumulate the weighted sum of within-group covariances */
   M = M + (n_i-1) * S;     /* (n_i-1)*S is centered X`*X */
end;
 
/* The pooled covariance is an average of the within-class covariance matrices. */
Sp = M / (N-k);
print Sp[L="Pooled Cov" c=varNames r=VarNames format=6.2];
 
/* The between-class CSSCP is the difference between total CSSCP and the sum of the 
   within-group CSSCPs. The SAS doc for PROC DISCRIM defines the between-class 
   covariance matrix as the between-class SSCP matrix divided by N*(k-1)/k, 
   where N is the number of observations and k is the number of classes.
*/
/* the total covariance matrix ignores the groups */
C = (N-1)*cov(Z);    
BCSSCP = C - M;               /* between = Full - Sum(Within) */
BCov = BCSSCP * k/( N*(k-1) );
print BCov[L="Between Cov" c=varNames r=VarNames format=6.2];

Success! The SAS/IML program shows the computations that are needed to reproduce the pooled and between-group covariance matrices. The results are the same as are produced by PROC DISCRIM.

Summary

In multivariate ANOVA, you might assume that the within-group covariance is constant across different groups in the data. The pooled covariance is an estimate of the common covariance. It is a weighted average of the sample covariances for each group, where the larger groups are weighted more heavily than smaller groups. I show how to visualize the pooled covariance by using prediction ellipses.

You can use PROC DISCRIM to compute the pooled covariance matrix and other matrices that represent within-group and between-group covariance. I also show how to compute the matrices from first principles by using the SAS/IML language. You can download the SAS program that performs the computations and creates the graphs in this article.

The post Pooled, within-group, and between-group covariance matrices appeared first on The DO Loop.

6月 292020
 

The first time I saw a formula for the pooled variance, I was quite confused. It looked like Frankenstein's monster, assembled from bits and pieces of other quantities and brought to life by a madman. However, the pooled variance does not have to be a confusing monstrosity. The verb "to pool" means to combine resources from several sources, as in, "if the siblings pool their money, they can buy a nice gift for their mother." Thus a "pooled variance" combines information from several subgroups of the data in order to obtain a better estimate. This article shows how to understand the pooled variance, which is an average of the sample variances for subgroups of the data.

The following graph visualizes the pooled variance and the variances within groups. In the graph, there are three groups. The sample variance within each group is plotted as a blue marker. The pooled variance is indicated by a horizontal line. The pooled variance appears to be an average of the three sample variances. The exact formulas and the data for this graph are explained in subsequent sections.

The pooled variance is an average of group variances

Most students first encounter the pooled variance in a statistics course when learning about two-sample t tests. In a two-sample t test, you have data in two groups and you want to test whether the means of the two groups are different. In order to run a two-sample t test, you need to decide whether you think the variances of the two groups are equal. If you think the group variances are equal, you compute the pooled variance, which estimates the common variance. You use the pooled variance estimate to compute the t statistic.

The pooled variance combines (or "pools") the variance estimates within the individual groups. The pooled variance is a better estimate of the (unknown) common group variance than either of the individual group variances. If each group has the same number of observations, then the pooled variance is a simple average. If the group sizes are different, then the pooled variance is a weighted average, where larger groups receive more weight than smaller groups.

For the two-sample case, let \(s_1^2\) and \(s_2^2\) be the sample variances within the groups A and B, respectively. Let \(n_1\) and \(n_2\), respectively, be the number of observations in the groups. If you assume that the data in each group have the same variance, how can you estimate that common variance? One way is to "pool" the variances of each group and compute a weighted average of the variances. For two groups, the pooled variance is
\(s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2- 2}\)
The numerator is the weighted sum of the group variances. Dividing by the sum of the weights means that the pooled variance is the weighted average of the two quantities.

Notice that if \(n_1 = n_2\), then the formula simplifies. When the group sizes are equal, the pooled variance reduces to \(s_p^2 = (s_1^2 + s_2^2)/2\), which is the average of the two variances.

The idea behind the pooled variance generalizes to multiple groups. Suppose you collect data for \(k\) groups and \(s_i^2\) is the sample variance within the \(i\)th group, which has \(n_i\) observations. If you believe that the groups have a common variance, how might you estimate it? One way is to compute the pooled variance as a weighted average of the group variances:
\(s_p^2 = \Sigma_{i=1}^k (n_i-1)s_i^2 / \Sigma_{i=1}^k (n_i - 1)\)
Again, if all groups have the same number of observations, then the formula simplifies to \(\Sigma_{i=1}^k s_i^2 / k\), which is the average of the \(k\) variances.

Visualize the pooled variance

Formulas are necessary for computation, but I like to think in pictures. You can draw a graph that shows how the pooled variance relates to the variances of the groups.

Let's use an example that has three groups: A, B, and C. The following data (adapted from an example in the SAS documentation) are the ozone measurements (Y, in parts per billion) at three different collection sites (A, B, and C):

Site  n   Ozone (ppb)
==== === =============================================
A    22   4 6 3 4 7 8 2 3 4 1 3 8 9 5 6 4 6 3 4 7 3 7
B    15   5 3 6 2 1 2 4 3 2 4 6 4 6 3 6
C    18   8 9 7 8 6 7 6 7 9 8 9 8 7 8 5 8 9 7

The following graph shows a box plot of the ozone levels at each site. The height of each box gives a rough estimate of the variance in each group. The mean and variance for sites A and B appear to be similar to each other. However, the mean at site C appears to be larger and the variance seems smaller, compared to the other sites.

You can use your favorite statistical software to estimate the variance of the data for each site. The following table is produced in SAS by using the UNIVARIATE procedure and the CIBASIC option:

The table shows an estimate for the variance of the data within each group. Although the smallest sample variance (Group C: 1.32) seems much smaller than the largest sample variance (Group A: 4.69), notice that the 95% confidence intervals overlap. A parameter value such as 2.8 or 2.9 would simultaneously be in all three confidence intervals. If we assume that the variance of the groups are equal, the pooled variance formula provides a way to estimate the common variance. Applying the formula, the estimate for the pooled variance for these data is
\(s_p^2 = (21*4.69 + 14*2.89 + 17*1.32) / (21+14+17) = 3.10\)

The graph at the top of this article visualizes the information in the table and uses a reference line to indicate the pooled variance. The blue markers indicate the sample variances of each group. The confidence intervals for the population variances are shown by the vertical lines. The pooled variance is indicated by the horizontal reference line. It is the weighted average of the sample variances. If you think that all groups have the same variance, the pooled variance estimates that common variance.

Summary

In two-sample t tests and ANOVA tests, you often assume that the variance is constant across different groups in the data. The pooled variance is an estimate of the common variance. It is a weighted average of the sample variances for each group, where the larger groups are weighted more heavily than smaller groups.

You can download a SAS program that computes the pooled variance and creates the graphs in this article. Although the pooled variance provides an estimate for a common variance among groups, it is not always clear when the assumption of a common variance is valid. The visualization in this article can help, or you can perform a formal "homogeneity of variance" test as part of an ANOVA analysis.

The post What is a pooled variance? appeared first on The DO Loop.

6月 082020
 

A SAS customer asked how to specify interaction effects between a classification variable and a spline effect in a SAS regression procedure. There are at least two ways to do this. If the SAS procedure supports the EFFECT statement, you can build the interaction term in the MODEL statement. For procedures that do not support the EFFECT statement, you can generate the interaction effects yourself. Because the customer's question was specifically about the GAMPL procedure, which fits generalized additive models, I also show a third technique, which enables you to model interactions in the GAMPL procedure.

To illustrate the problem, consider the following SAS DATA step, which simulates data from two different response functions:

data Have;
call streaminit(1);
Group="A"; 
do x = 1 to 10 by .1;
   y =    2*x + sin(x) + rand("Normal"); output;
end;
Group="B";
do x = 1 to 10 by .1;
   y = 10 - x - sin(x) + rand("Normal"); output;
end;
run;

The graph shows the data and overlays a regression model. The model predicts the response for each level of the Group variable. Notice that the two curves have different shapes, which means that there is an interaction between the Group variable and the x variable. The goal of this article is to show how to use a spline effect to predict the response for each group.

Use the grammar of the procedure to form interactions

Many SAS regression procedures support the EFFECT statement, the CLASS statement, and enable you to specify interactions on the MODEL statement. Examples include the GLMMIX, GLMSELECT, LOGISTIC, QUANTREG, and ROBUSTREG procedures. For example, the following call to PROC GLMSELECT specifies several model effects by using the "stars and bars" syntax:

  • The syntax Group | x includes the classification effect (Group), a linear effect (x), and an interaction effect (Group*x).
  • The syntax Group * spl includes an interaction effect between the classification variable and the spline.
title "GLMSELECT Model with Spline and Interaction Effects";
ods select none;
proc glmselect data=Have
        outdesign(addinputvars fullmodel)=SplineBasis; /* optional: write design matrix to data set */
   class Group;
   effect spl = spline(x);
   model y = Group|x  Group*spl / selection=none;
   output out=SplineOut predicted=p;            /* output predicted values */
quit;
ods select all;
 
%macro PlotInteraction(dsname);
   proc sgplot data=&dsname;
      scatter x=x y=y / group=Group;
      series x=x y=p / group=Group curvelabel;
      keylegend / location=inside position=E;
   run;
%mend;
%PlotInteraction(SplineOut);

The predicted curves are shown in the previous section. I wrapped the call to PROC SGPLOT in a macro so that I can create the same plot for other models.

Output the design matrix for the interactions

Some SAS procedures do not support the EFFECT statement. For these procedures, you cannot form effects like Group*spline(x) within the procedure. However, the GLMSELECT procedure supports the OUTDESIGN= option, which writes the design matrix to a SAS data set. I used the option in the previous section to create the SplineBasis data set. The data set includes variables for the spline basis and for the interaction between the Group variable and the spline basis. Therefore, you can use the splines and interactions with splines in other SAS procedures.

For example, the GLM procedure does not support the EFFECT statement, so you cannot generate a spline basis within the procedure. However, you can read the design matrix from the previous call to PROC GLMSELECT. The interaction effects are named spl_Group_1_A, spl_Group_1_B, spl_Group_2_A, ...., where the suffix "_i_j" indicates the interaction effect between the i_th spline basis and the j_th level of the Group variable. You can type the names of the design variables, or you can use the "colon notation" to match all variables that begin with the prefix "spl_Group". The following call to PROC GLM computes the same model as in the previous section:

proc glm data=SplineBasis;
   class Group;
   model y = Group|x  spl_Group: ;           /* spl_Group: => spl_Group_1_A,  spl_Group_1_B, ... */
   output out=GLMOut predicted=p;            /* output predicted values */
quit;
 
/* show that the predicted values are the same */
data Comp;
   merge SplineOut GLMOut(keep=p rename=(p=p2));
   diff = p - p2;
run;
proc means data=comp Mean Min Max;  var diff;  run;

The output of PROC MEANS shows that the two models are the same. There is no difference between the predicted values from PROC GLM (which reads the design matrix) and the values from PROC GLMSELECT (which reads the raw data).

The splines of the interactions versus the interactions of the splines

Some nonparametric regression procedures, such as the GAMPL procedure, have their own syntax to generate spline effects. In fact, PROC GAMPL uses thin-plate splines, which are different from the splines that are supported by the EFFECT statement. Recently, a SAS customer asked how he could model interaction terms in PROC GAMPL.

The GAMPL procedure supports semiparametric models. In the parametric portion of the model, it is easy to specify interactions by using the standard "stars and bars" notation in SAS. However, the SPLINE() transformation in PROC GAMPL does not support interactions between a classification variable and a continuous variable. To be specific, consider the following semiparametric model:

title "No Interaction Between Classification and Spline Effects";
proc gampl data=Have; *plots=components;
   class Group;
   model Y = param(Group | x)     /* parametric terms */
             spline(x);           /* nonprametric, but no interaction */
   output out=GamPLOut pred=p r=r;
   id Y X Group;
run;
 
%PlotInteraction(SplineOut);

The output shows that the model does not capture the nonlinear features of the data. This is because the spline term is computed for all values of x, not separately for each group. When the groups are combined, the spline effect is essentially linear, which is probably not the best model for these data.

Unfortunately, in SAS/STAT 15.1, PROC GAMPL does not support a syntax such as "spline(Group*x)", which would enable the shape of the nonparametric curve to vary between levels of the Group variable. In fact, the 15.1 documentation states, "only continuous variables (not classification variables) can be specified in spline-effects."

However, if your goal is to use a flexible model to fit the data, there is a model that might work. Instead of the interaction between the Group variable and spline(x), you could use the spline effects for the interaction between the Group variable and x. These are not the same models: the interaction with the splines is not equal to the spline of the interactions! However, let's see how to fit this model to these data.

The idea is to form the interaction effect Group*x in a separate step. You can use the design matrix (SplineBasis) that was created in the first section, but to make the process as transparent as possible, I am going to manually generate the dummy variables for the interaction effect Group*x:

data Interact; 
set Have; 
x_Group_A = x*(Group='A'); 
x_Group_B = x*(Group='B');
run;

You can use the spline effects for the variable x_Group_A and x_Group_B in regression models. The following model is not the same as the previous models, but is a flexible model that seems to fit these data:

title "Interaction Between Classification and Spline Effects";
proc gampl data=Interact plots=components;
   class Group;
   model Y = param(Group | x) 
             spline(x_Group_A) spline(x_Group_B);  /* splines of interactions */
   output out=GamPLOut pred=p;
   id Y X Group;
run;
 
title "GAMPL with Interaction Effects";
%PlotInteraction(GamPLOut);

The graph shows that (visually) the model seems to capture the undulations in the data as well as the earlier model. Although PROC GAMPL does not currently support interactions between classification variables and spline effects, this trick provides a way to model the data.

The post Interactions with spline effects in regression models appeared first on The DO Loop.

6月 032020
 

I recently read an article that describes ways to compute confidence intervals for the difference in a percentile between two groups. In Eaton, Moore, and MacKenzie (2019), the authors describe a problem in hydrology. The data are the sizes of pebbles (grains) in rivers at two different sites. The authors ask whether the p_th percentile of size is significantly different between the two sites. They also want a confidence interval for the difference. The median (50th percentile) and 84th percentile were the main focus of their paper.

For those of us who are not hydrologists, consider a sample from group 'A' and a sample from group 'B'. The problem is to test whether the p_th percentile is different in the two groups and to construct a confidence interval for the difference.

The authors show two methods: a binomial-based probability model (Section 2) and a bootstrap confidence interval (Section 3). However, there is another option: use quantile regression to estimate the difference in the quantiles and to construct a confidence interval. In this article, I show the computation for the 50th and 84th percentiles and compare the results to a bootstrap estimate. You can download the SAS program that creates the analyses and graphs in this article.

If you are not familiar with quantile regression, see my previous article about how to use quantile regression to estimate the difference between two medians.

A data distribution of particle sizes

For convenience, I simulated samples for the sizes of pebbles in two stream beds:

  • For Site 'A', the sizes are lognormally distributed with μ=0.64 and σ=1. For the LN(0.64, 1) distribution, the median pebble size is 1.9 mm and the size for the 84th percentile is 5.1 mm.
  • For Site 'B', the sizes are lognormally distributed with μ=0.53 and σ=1. For the LN(0.53, 1) distribution, the median pebble size is 1.7 mm and the size for the 84th percentile is 4.6 mm.

I assume that sizes are measured to the nearest 0.1 mm. I simulated 500 pebbles from Site 'A' and 300 pebbles from Site 'B'. You can use PROC UNIVARIATE in SAS to compute a comparative histogram and to compute the percentiles. The sample distributions are shown below:

For Site 'A', the estimates for the sample median and 84th percentile are 2.05 mm and 5.1 mm, respectively. For Site 'B', the estimates are 1.60 mm and 4.7 mm, respectively. These estimates are close to their respective parameter values.

Estimate difference in percentiles

The QUANTREG procedure in SAS makes it easy to estimate the difference in the 50th and 84th percentiles between the two groups. The syntax for the QUANTREG procedure is similar to other SAS regression procedures such as PROC GLM:

proc quantreg data=Grains;
   class Site;
   model Diameter = Site / quantile=0.5 0.84;
   estimate 'Diff in Pctl' Site 1 -1 / CL;
run;

The output from PROC QUANTREG shows estimates for the difference between the percentiles of the two groups. For the 50th percentile, an estimate for the difference is 0.4 with a 95% confidence of [0.09, 0.71]. Because 0 is not in the confidence interval, you can conclude that the median pebble size at Site 'A' is significantly different from the median pebble size at Site 'B'. For the 84th percentile, an estimate for the difference is 0.3 with a 95% confidence of [-0.57, 1.17]. Because 0 is in the interval, the difference between the 84th percentiles is not significantly different from 0.

Methods for estimating confidence intervals

The QUANTREG procedure supports several different methods for estimating a confidence interval: sparsity, rank, and resampling. The estimates in the previous section are by using the RANK method, which is the default for smaller data sets. You can use the CI= option on the PROC QUANTREG statement to use these methods and to specify options for the methods. The following graph summarizes the results for four combinations of methods and options. The results of the analysis do not change: The medians are significantly different but the 84th percentiles are not.

A comparison with bootstrap estimates

When you use a resampling-based estimate for the confidence interval, the interval depends on the random number seed, the algorithm used to generate random numbers, and the number of bootstrap iterations. This can make it hard to compare Monte Carlo results to the results from deterministic statistical methods. Nevertheless, I will present one result from a bootstrap analysis of the simulated data. The following table shows the bootstrap percentile estimates (used by Eaton, Moore, and MacKenzie (2019)) for the difference between percentiles.

The confidence intervals (the Lower and Upper columns of the table) are comparable to the intervals produced by PROC QUANTREG. The confidence interval for the difference between medians seems equivalent. The bootstrap interval for the 84th percentile is shifted to the right relative to the QUANTREG intervals. However, the inferences are the same: the medians are different but there is no significant difference between the 84th percentiles.

The following histogram shows the difference between the 84th percentiles for 5,000 bootstrap samples. The confidence interval is determined by the 2.5th and 97.5th percentiles of the bootstrap distribution. The computation requires only a short SAS/IML program, which runs very quickly.

Summary

Data analysts might struggle to find an easy way to compute the difference between percentiles for two (or more) groups. A recent paper by Eaton, Moore, and MacKenzie (2019) proposes one solution, which is to use resampling methods. An alternative is to use quantile regression to estimate the difference between percentiles, as shown in this article. I demonstrate the method by using simulated data of the form studied by Eaton, Moore, and MacKenzie.

You can download the SAS program that creates the analyses and graphs in this article.

The post How to estimate the difference between percentiles appeared first on The DO Loop.

5月 282020
 

The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. An application in machine learning is to measure how distributions in a parametric family differ from a data distribution. This article shows that if you minimize the Kullback–Leibler divergence over a set of parameters, you can find a distribution that is similar to the data distribution. This article focuses on discrete distributions.

The Kullback–Leibler divergence between two discrete distributions

As explained in a previous article, the Kullback–Leibler (K-L) divergence between two discrete probability distributions is the sum
KL(f, g) = Σx f(x) log( f(x)/g(x) )
where the sum is over the set of x values for which f(x) > 0. (The set {x | f(x) > 0} is called the support of f.) For this sum to be well defined, the distribution g must be strictly positive on the support of f.

One application of the K-L divergence is to measure the similarity between a hypothetical model distribution defined by g and an empirical distribution defined by f.

Example data for the Kullback–Leibler divergence

As an example, suppose a call center averages about 10 calls per hour. An analyst wants to investigate whether the number of calls per hour can be modeled by using a Poisson(λ=10) distribution. To test the hypothesis, the analyst records the number of calls for each hour during 100 hours of operation. The following SAS DATA step reads the data. The call to PROC SGPLOT creates a histogram shows the distribution of the 100 counts:

data Calls;
input N @@;
label N = "Calls per hour";
datalines;
11 19 11 13 13 8 11 9 9 14 
10 13 8 15 7 9 6 12 7 13 
12 19 6 12 11 12 11 9 15 4 
7 12 12 10 10 16 18 13 13 8 
13 10 9 9 12 13 12 8 13 9 
7 9 10 9 4 10 12 5 4 12 
8 12 14 16 11 7 18 8 10 13 
12 5 11 12 16 9 11 8 11 7 
11 15 8 7 12 16 9 18 9 8 
10 7 11 12 13 15 6 10 10 7 
;
 
title "Number of Calls per Hour";
title2 "Data for 100 Hours";
proc sgplot data=Calls;
   histogram N / scale=proportion binwidth=1;
   xaxis values=(4 to 19) valueshint;
run;

The graph should really be a bar chart, but I used a histogram with BINWIDTH=1 so that the graph reveals that the value 17 does not appear in the data. Furthermore, the values 0, 1, 2, and 3 do not appear in the data. I used the SCALE=PROPORTION option to plot the data distribution on the density scale.

The call center wants to model these data by using a Poisson distribution. The traditional statistical approach is to use maximum likelihood estimation (MLE) to find the parameter, λ, in the Poisson family so that the Poisson(λ) distribution is the best fit to the data. However, let's see how using the Kullback–Leibler divergence leads to a similar result.

The Kullback–Leibler divergence between data and a Poisson distribution

Let's compute the K-L divergence between the empirical frequency distribution and a Poisson(10) distribution. The empirical distribution is the reference distribution; the Poisson(10) distribution is the model. The Poisson distribution has a nonzero probability for all x ≥ 0, but recall that the K-L divergence is computed by summing over the observed values of the empirical distribution, which is the set {4, 5, ..., 19}, excluding the value 17.

proc iml;
/* read the data, which is the reference distribution, f */
use Calls;  read all var "N" into Obs;  close;
call Tabulate(Levels, Freq, Obs);   /* find unique values and frequencies */
Proportion = Freq / nrow(Obs);      /* empirical density of frequency of calls (f) */
 
/* create the model distribution: Poisson(10) */
lambda = 10;   
poisPDF = pdf("Poisson", Levels, lambda); /* Poisson model on support(f) */
 
/* load K-L divergence module or include the definition from: 
 https://blogs.sas.com/content/iml/2020/05/26/kullback-leibler-divergence-discrete.html
*/
load module=KLDiv;
KL = KLDiv(Proportion, poisPDF); 
print KL[format=best5.];

Notice that although the Poisson distribution has infinite support, you only need to evaluate the Poisson density on the (finite) support of empirical density.

Minimize the Kullback–Leibler divergence between data and a Poisson distribution

The previous section shows how to compute the Kullback–Leibler divergence between an empirical density and a Poisson(10) distribution. You can repeat that computation for a whole range of λ values and plot the divergence versus the Poisson parameter. The following statements compute the K-L divergence for λ on [4, 16] and plots the result. The minimum value of the K-L divergence is achieved near λ = 10.7. At that value of λ, the K-L divergence between the data and the Poisson(10.7) distribution is 0.105.

/* Plot the K-L div versus lambda for a sequence of Poisson(lambda) models */
lambda = do(4, 16, 0.1);
KL = j(1, ncol(lambda), .);
do i = 1 to ncol(lambda);
   poisPDF = pdf("Poisson", Levels, lambda[i]);
   KL[i] = KLDiv(Proportion, poisPDF); 
end;
 
title "K-L Divergence from Poisson(lambda)";
call series(lambda, KL) grid={x y} xvalues=4:16 label={'x' 'K-L Divergence'};

The graph shows the K-L divergence for a sequence of Poisson(λ) models. The Poisson(10.7) model has the smallest divergence from the data distribution, therefore it is the most similar to the data among the Poisson(λ) distributions that were considered. You can use a numerical optimization technique in SAS/IML if you want to find a more accurate value that minimizes the K-L divergences.

The following graph overlays the PMF for the Poisson(10.7) distribution on the empirical distribution for the number of calls.

The minimal Kullback–Leibler divergence and the maximum likelihood estimate

You might wonder how minimizing the K-L divergence relates to the traditional MLE method for fitting a Poisson model to the data. The following call to PROC GENMOD shows that the MLE estimate is λ = 10.71:

proc genmod data=MyData;
   model Obs = / dist=poisson;
   output out=PoissonFit p=lambda;
run;
 
proc print data=PoissonFit(obs=1) noobs;
   var lambda;
run;

Is this a coincidence? No. It turns out that there a connection between the K-L divergence and the negative log-likelihood. Minimizing the K-L divergence is equivalent to minimizing the negative log-likelihood, which is equivalent to maximizing the likelihood between the Poisson model and the data.

Summary

This article shows how to compute the Kullback–Leibler divergence between an empirical distribution and a Poisson distribution. The empirical distribution was the observed number of calls per hour for 100 hours in a call center. You can compute the K-L divergence for many parameter values (or use numerical optimization) to find the parameter that minimizes the K-L divergence. This parameter value corresponds to the Poisson distribution that is most similar to the data. It turns out that minimizing the K-L divergence is equivalent to maximizing the likelihood function. Although the parameter estimates are the same, the traditional MLE estimate comes with additional tools for statistical inference, such as estimates for confidence intervals and standard errors.

The post Minimizing the Kullback–Leibler divergence appeared first on The DO Loop.

5月 062020
 

During this coronavirus pandemic, there are many COVID-related graphs and curves in the news and on social media. The public, politicians, and pundits scrutinize each day's graphs to determine which communities are winning the fight against coronavirus.

Interspersed among these many graphs is the oft-repeated mantra, "Flatten the curve!" As people debate whether to reopen businesses and schools, you might hear some people claim, "we have flattened the curve," whereas others argue, "we have not yet flattened the curve." But what is THE curve to which people are referring? And when does a flat curve indicate success against the coronavirus pandemic?

This article discusses "flattening the curve" in the context of three graphs:

  1. A "what if" epidemiological curve.
  2. The curve of cumulative cases versus time.
  3. A smoothing curve added to a graph of new cases versus time. Spoiler alert: This is the WRONG curve to flatten! You want this curve to be decreasing, not flat.

Before you read further, I strongly encourage you to watch the two-minute video "What does 'flattening the curve' actually mean?" by the Australian Academy of Science. It might be the best-spent two minutes of your day.

Flattening the "what if" epidemiological curve

In 2007, the US Centers for Disease Control and Prevention (CDC) published a report on pre-pandemic planning guidance, which discussed how "nonpharmaceutical interventions" (NPIs) can mitigate the spread of viral infections. Nonpharmaceutical interventions include isolating infected persons, social distancing, hand hygiene, and more. The report states (p. 28), "NPIs can delay and flatten the epidemic peak.... [Emphasis added.] Delay of the epidemic peak is critically important because it allows additional time for vaccine development and antiviral production."

This is the original meaning of "flatten the curve." It has to do with the graph of new cases versus time. A graph appeared on p. 18, but the following graph is Figure 1 in the updated 2017 guidelines:

FIGURE 1. Goals of community mitigation for pandemic influenza (CDC, 2007 and 2017)

This image represents a "what if" scenario. The purple curve on the left is a hypothetical curve that represents rapid spread. It shows what might happen if the public does NOT adopt public-health measures to slow the spread of the virus. This is "the curve" that we wish to flatten. The smaller curve on the right represents a slower spread. It shows what can happen if society adopts public-health interventions. The peak of the second curve has been delayed (moved to a later date) and reduced (fewer cases). This second curve is the "flattened curve."

Because the curve that we are trying to flatten (the rapid-spread curve) is unobservable, how can we measure success? One measure of success is if the number of daily cases remains less than the capacity of the healthcare system. Another is that the total number of affected persons is less than predicted under the no-intervention model.

If the public adopts measures that mitigate the spread, the observed new-cases-versus-time curve should look more like the slower-spread curve on the right and less like the tall rapid-spread curve on the left. Interestingly, I have heard people complain that the actual numbers of infections, hospitalizations, and deaths due to COVID-19 are lower than initially projected. They argue that these low numbers are evidence that the mathematical models were wrong. The correct conclusion is that the lower numbers are evidence that social distancing and other NPIs helped to slow the spread.

Flattening the curve of cumulative cases versus time

The previous section discusses the new-cases-by-time graph. Another common graph in the media is the cumulative number of confirmed cases. I have written about the relationship between the new-cases-by-time graph and the cumulative-cases-by-time graph. In brief, the slope of the cumulative-cases-by-time graph equals the height of the new-cases-by-time graph.

For the rapid-spread curve, the associated cumulative curve is very steep, then levels out at a large number (the total number of cases). For the slower-spread curve, the associated cumulative curve is not very steep. It climbs gradually and eventually then levels out at a smaller number of total cases.

When you read that some country (such as South Korea, Australia, or New Zealand) has flattened the curve, the curve that usually accompanies the article is the cumulative-cases-by-time graph. A cumulative curve that increases slowly means that the rate of new cases is small. A cumulative curve that has zero slope means that no new cases are being confirmed. This is definitely a good thing!

A hypothetical scenario is shown below. The upper graph shows the cumulative cases. The lower graph shows the new cases on the same time scale. After Day 60, there are very few new cases. Accordingly, the curve of cumulative cases is flat.

Thus, a cumulative curve that increases slowly and then levels out is analogous to the slower-the-spread epidemiological curve. When the cumulative curve becomes flat, it indicates that new cases are zero or nearly zero.

The wrong curve: Flattening new cases versus time

Unfortunately, not every curve that is flat indicates that conditions are improving. The primary example is the graph of new cases versus time. A constant rate of new cases is not good—although it is certainly better than an increasing rate. To defeat the pandemic, the number of new cases each day should be decreasing. For a new-cases-by-time curve, a downward-sloping curve is the goal, not a flat curve.

The following graph shows new cases by day in the US (downloaded from the New York Times data on 03May2020). I have added a seven-day rolling average to show a smoothed version of the data:

The seven-day average is no longer increasing. It has leveled out. However, this is NOT an example of a "flattened" curve. This graph shows that the average number of new cases was approximately constant (or perhaps slightly declining) for the last three weeks of April. The fact that the curve is approximately horizontal indicates that approximately 28,000 new cases of coronavirus are being confirmed every day.

The cumulative graph for the same US data is shown below. The slope of the cumulative curve in April 2020 is approximately 28,000 cases per day. This cumulative curve is not flattening, but healthcare workers and public-health officials are all working to flatten it.

Sometimes graphs do not have well-labeled axes, so be sure to identify which graph you are viewing. For a curve of cumulative cases, flatness is "good": there are very few new cases. For a curve of new cases, a downward-sloping curve is desirable.

Summary

In summary, this article discusses "flattening the curve," which is often used without specifying what "the curve" actually is. In the classic epidemiological model, "the curve" is the hypothetical number of new cases versus time under the assumption that society does not adopt public-health measures to slow the spread. A flattened curve refers to the effect of interventions that delay and reduce the peak number of cases.

You cannot observe a hypothetical curve, but you can use the curve of cumulative cases to assess the spread of the disease. A cumulative curve that increases slowly and flattens out indicates a community that has slowed the spread. So, conveniently, you can apply the phrase "flatten the curve" to the curve of cumulative cases.

Note that you do not want the graph of new cases versus time to be flat. You want that curve to decrease towards zero.

You can download the SAS program used to create graphs in this article.


LEARN MORE | See all Coronavirus dashboard blog posts

The post What does 'flatten the curve' mean? To which curve does it apply? appeared first on The DO Loop.

5月 062020
 

During this coronavirus pandemic, there are many COVID-related graphs and curves in the news and on social media. The public, politicians, and pundits scrutinize each day's graphs to determine which communities are winning the fight against coronavirus.

Interspersed among these many graphs is the oft-repeated mantra, "Flatten the curve!" As people debate whether to reopen businesses and schools, you might hear some people claim, "we have flattened the curve," whereas others argue, "we have not yet flattened the curve." But what is THE curve to which people are referring? And when does a flat curve indicate success against the coronavirus pandemic?

This article discusses "flattening the curve" in the context of three graphs:

  1. A "what if" epidemiological curve.
  2. The curve of cumulative cases versus time.
  3. A smoothing curve added to a graph of new cases versus time. Spoiler alert: This is the WRONG curve to flatten! You want this curve to be decreasing, not flat.

Before you read further, I strongly encourage you to watch the two-minute video "What does 'flattening the curve' actually mean?" by the Australian Academy of Science. It might be the best-spent two minutes of your day.

Flattening the "what if" epidemiological curve

In 2007, the US Centers for Disease Control and Prevention (CDC) published a report on pre-pandemic planning guidance, which discussed how "nonpharmaceutical interventions" (NPIs) can mitigate the spread of viral infections. Nonpharmaceutical interventions include isolating infected persons, social distancing, hand hygiene, and more. The report states (p. 28), "NPIs can delay and flatten the epidemic peak.... [Emphasis added.] Delay of the epidemic peak is critically important because it allows additional time for vaccine development and antiviral production."

This is the original meaning of "flatten the curve." It has to do with the graph of new cases versus time. A graph appeared on p. 18, but the following graph is Figure 1 in the updated 2017 guidelines:

FIGURE 1. Goals of community mitigation for pandemic influenza (CDC, 2007 and 2017)

This image represents a "what if" scenario. The purple curve on the left is a hypothetical curve that represents rapid spread. It shows what might happen if the public does NOT adopt public-health measures to slow the spread of the virus. This is "the curve" that we wish to flatten. The smaller curve on the right represents a slower spread. It shows what can happen if society adopts public-health interventions. The peak of the second curve has been delayed (moved to a later date) and reduced (fewer cases). This second curve is the "flattened curve."

Because the curve that we are trying to flatten (the rapid-spread curve) is unobservable, how can we measure success? One measure of success is if the number of daily cases remains less than the capacity of the healthcare system. Another is that the total number of affected persons is less than predicted under the no-intervention model.

If the public adopts measures that mitigate the spread, the observed new-cases-versus-time curve should look more like the slower-spread curve on the right and less like the tall rapid-spread curve on the left. Interestingly, I have heard people complain that the actual numbers of infections, hospitalizations, and deaths due to COVID-19 are lower than initially projected. They argue that these low numbers are evidence that the mathematical models were wrong. The correct conclusion is that the lower numbers are evidence that social distancing and other NPIs helped to slow the spread.

Flattening the curve of cumulative cases versus time

The previous section discusses the new-cases-by-time graph. Another common graph in the media is the cumulative number of confirmed cases. I have written about the relationship between the new-cases-by-time graph and the cumulative-cases-by-time graph. In brief, the slope of the cumulative-cases-by-time graph equals the height of the new-cases-by-time graph.

For the rapid-spread curve, the associated cumulative curve is very steep, then levels out at a large number (the total number of cases). For the slower-spread curve, the associated cumulative curve is not very steep. It climbs gradually and eventually then levels out at a smaller number of total cases.

When you read that some country (such as South Korea, Australia, or New Zealand) has flattened the curve, the curve that usually accompanies the article is the cumulative-cases-by-time graph. A cumulative curve that increases slowly means that the rate of new cases is small. A cumulative curve that has zero slope means that no new cases are being confirmed. This is definitely a good thing!

A hypothetical scenario is shown below. The upper graph shows the cumulative cases. The lower graph shows the new cases on the same time scale. After Day 60, there are very few new cases. Accordingly, the curve of cumulative cases is flat.

Thus, a cumulative curve that increases slowly and then levels out is analogous to the slower-the-spread epidemiological curve. When the cumulative curve becomes flat, it indicates that new cases are zero or nearly zero.

The wrong curve: Flattening new cases versus time

Unfortunately, not every curve that is flat indicates that conditions are improving. The primary example is the graph of new cases versus time. A constant rate of new cases is not good—although it is certainly better than an increasing rate. To defeat the pandemic, the number of new cases each day should be decreasing. For a new-cases-by-time curve, a downward-sloping curve is the goal, not a flat curve.

The following graph shows new cases by day in the US (downloaded from the New York Times data on 03May2020). I have added a seven-day rolling average to show a smoothed version of the data:

The seven-day average is no longer increasing. It has leveled out. However, this is NOT an example of a "flattened" curve. This graph shows that the average number of new cases was approximately constant (or perhaps slightly declining) for the last three weeks of April. The fact that the curve is approximately horizontal indicates that approximately 28,000 new cases of coronavirus are being confirmed every day.

The cumulative graph for the same US data is shown below. The slope of the cumulative curve in April 2020 is approximately 28,000 cases per day. This cumulative curve is not flattening, but healthcare workers and public-health officials are all working to flatten it.

Sometimes graphs do not have well-labeled axes, so be sure to identify which graph you are viewing. For a curve of cumulative cases, flatness is "good": there are very few new cases. For a curve of new cases, a downward-sloping curve is desirable.

Summary

In summary, this article discusses "flattening the curve," which is often used without specifying what "the curve" actually is. In the classic epidemiological model, "the curve" is the hypothetical number of new cases versus time under the assumption that society does not adopt public-health measures to slow the spread. A flattened curve refers to the effect of interventions that delay and reduce the peak number of cases.

You cannot observe a hypothetical curve, but you can use the curve of cumulative cases to assess the spread of the disease. A cumulative curve that increases slowly and flattens out indicates a community that has slowed the spread. So, conveniently, you can apply the phrase "flatten the curve" to the curve of cumulative cases.

Note that you do not want the graph of new cases versus time to be flat. You want that curve to decrease towards zero.

You can download the SAS program used to create graphs in this article.


LEARN MORE | See all Coronavirus dashboard blog posts

The post What does 'flatten the curve' mean? To which curve does it apply? appeared first on The DO Loop.