rick wicklin

7月 012020
 

A previous article discusses the pooled variance for two or groups of univariate data. The pooled variance is often used during a t test of two independent samples. For multivariate data, the analogous concept is the pooled covariance matrix, which is an average of the sample covariance matrices of the groups. If you assume that the covariances within the groups are equal, the pooled covariance matrix is an estimate of the common covariance. This article shows how to compute and visualize a pooled covariance matrix in SAS. It explains how the pooled covariance relates to the within-group covariance matrices. It discusses a related topic, called the between-group covariance matrix.

The within-group matrix is sometimes called the within-class covariance matrix because a classification variable is used to identify the groups. Similarly, the between-group matrix is sometimes called the between-class covariance matrix.

Visualize within-group covariances

Suppose you want to analyze the covariance in the groups in Fisher's iris data (the Sashelp.Iris data set in SAS). The data set contains four numeric variables, which measure the length and width of two flower parts, the sepal and the petal. Each observation is for a flower from an iris species: Setosa, Versicolor, or Virginica. The Species variable in the data identifies observations that belong to each group, and each group has 50 observations. The following call to PROC SGPLOT creates two scatter plots and overlays prediction ellipses for two pairs of variables:

title "68% Prediction Ellipses for Iris Data";
proc sgplot data=Sashelp.Iris;
   scatter x=SepalLength y=SepalWidth / group=Species transparency=0.5;
   ellipse x=SepalLength y=SepalWidth / group=Species alpha=0.32 lineattrs=(thickness=2);
run;
proc sgplot data=Sashelp.Iris;
   scatter x=PetalLength y=PetalWidth / group=Species transparency=0.5;
   ellipse x=PetalLength y=PetalWidth / group=Species alpha=0.32 lineattrs=(thickness=2);
run;

The ellipses enable you to visually investigate whether the variance of the data within the three groups appears to be the same. For these data, the answer is no because the ellipses have different shapes and sizes. Some of the prediction ellipses have major axes that are oriented more steeply than others. Some of the ellipses are small, others are relatively large.

You might wonder why the graph shows a 68% prediction ellipse for each group. Recall that prediction ellipses are a multivariate generalization of "units of standard deviation." If you assume that measurements in each group are normally distributed, 68% of random observations are within one standard deviation from the mean. So for multivariate normal data, a 68% prediction ellipse is analogous to +/-1 standard deviation from the mean.

The pooled covariance is an average of within-group covariances

The pooled covariance is used in linear discriminant analysis and other multivariate analyses. It combines (or "pools") the covariance estimates within subgroups of data. The pooled covariance is one of the methods used by Friendly and Sigal (TAS, 2020) to visualize homogeneity tests for covariance matrices.

Suppose you collect multivariate data for \(k\) groups and \(S_i\) is the sample covariance matrix for the \(n_i\) observations within the \(i\)th group. If you believe that the groups have a common variance, you can estimate it by using the pooled covariance matrix, which is a weighted average of the within-group covariances:
\(S_p = \Sigma_{i=1}^k (n_i-1)S_i / \Sigma_{i=1}^k (n_i - 1)\)

If all groups have the same number of observations, then the formula simplifies to \(\Sigma_{i=1}^k S_i / k\), which is the simple average of the matrices. If the group sizes are different, then the pooled variance is a weighted average, where larger groups receive more weight than smaller groups.

Compute the pooled covariance in SAS

In SAS, you can often compute something in two ways. The fast-and-easy way is to find a procedure that does the computation. A second way is to use the SAS/IML language to compute the answer yourself. When I compute something myself (and get the same answer as the procedure!), I increase my understanding.

Suppose you want to compute the pooled covariance matrix for the iris data. The fast-and-easy way to compute a pooled covariance matrix is to use PROC DISCRIM. The procedure supports the OUTSTAT= option, which writes many multivariate statistics to a data set, including the within-group covariance matrices, the pooled covariance matrix, and something called the between-group covariance. (It also writes analogous quantities for centered sum-of-squares and crossproduct (CSSCP) matrices and for correlation matrices.)

proc discrim data=sashelp.iris method=normal pool=yes outstat=Cov noprint;
   class Species;
   var SepalLength SepalWidth PetalLength PetalWidth;
run;
 
proc print data=Cov noobs; 
   where _TYPE_ = "PCOV";
   format _numeric_ 6.2;
   var _TYPE_ _NAME_ Sepal: Petal:;
run;

The table shows the "average" covariance matrix, where the average is across the three species of flowers.

Within-group covariance matrices

The same output data set contains the within-group and the between-group covariance matrices. The within-group matrices are easy to understand. They are the covariance matrices for the observations in each group. Accordingly, there are three such matrices for these data: one for the observations where Species="Setosa", one for Species="Versicolor", and one for Species="Virginica". The following call to PROC PRINT displays the three matrices:

proc print data=Cov noobs; 
   where _TYPE_ = "COV" and Species^=" ";
   format _numeric_ 6.2;
run;

The output is not particularly interesting, so it is not shown. The matrices are the within-group covariances that were visualized earlier by using prediction ellipses.

Visual comparison of the pooled covariance and the within-group covariance

Friendly and Sigal (2020, Figure 1) overlay the prediction ellipses for the pooled covariance on the prediction ellipses for the within-group covariances. A recreation of Figure 1 in SAS is shown below. You can use the SAS/IML language to draw prediction ellipses from covariance matrices.

The shaded region is the prediction ellipse for these two variables in the pooled covariance matrix. It is centered at the weighted average of the group means. You can see that the pooled ellipse looks like an average of the other ellipses. This graph shows only one pair of variables, but see Figure 2 of Friendly and Sigal (2020) for a complete scatter plot matrix that compares the pooled covariance to the within-group covariance for each pair of variables.

Between-group covariance matrices

Another matrix in the PROC DISCRIM output is the so-called between-group covariance matrix. Intuitively, the between-group covariance matrix is related to the difference between the full covariance matrix of the data (where the subgroups are ignored) and the pooled covariance matrix (where the subgroups are averaged). The precise definition is given in the next section. For now, here is how to print the between-group covariance matrix from the output of PROC DISCRIM:

proc print data=Cov noobs; 
   where _TYPE_ = "BCOV";
   format _numeric_ 6.2;
run;

How to compute the pooled and between-group covariance

If I can compute a quantity "by hand," then I know that I truly understand it. Thus, I wrote a SAS/IML program that reproduces the computations made by PROC DISCRIM. The following steps are required to compute each of these matrices from first principles.

  1. For each group, compute the covariance matrix (S_i) of the observations in that group.
  2. Note that the quantity (n_i - 1)*S_i is the centered sum-of-squares and crossproducts (CSSCP) matrix for the group. Let M be the sum of the CSSCP matrices. The sum is the numerator for the pooled covariance.
  3. Form the pooled covariance matrix as S_p = M / (N-k).
  4. Let C be the CSSCP data for the full data (which is (N-1)*(Full Covariance)). The between-group covariance matrix is BCOV = (C - M) * k / (N*(k-1)).

You can use the UNIQUE-LOC trick to iterate over the data for each group. The following SAS/IML program implements these computations:

/* Compute a pooled covariance matrix when observations
   belong to k groups with sizes n1, n2, ..., nk, where n1+n2+...+nk = N
*/
proc iml;
varNames = {'SepalLength' 'SepalWidth' 'PetalLength' 'PetalWidth'};
use Sashelp.iris;
   read all var varNames into Z;
   read all var "Species" into Group;
close;
 
/* assume complete cases, otherwise remove rows with missing values */
N   = nrow(Z);
 
/* compute the within-group covariance, which is the covariance for the observations in each group */
u = unique(Group);
k = ncol(u);                /* number of groups */
p   = ncol(varNames);       /* number of variables */
M = j(p, p, 0);             /* sum of within-group CSCCP matrices */
do i = 1 to k;
   idx = loc(Group = u[i]); /* find rows for this group */
   X = Z[idx,];             /* extract obs for i_th group */
   n_i = nrow(X);           /* n_i = size of i_th group */
   S   = cov(X);            /* within-group cov */
   /* accumulate the weighted sum of within-group covariances */
   M = M + (n_i-1) * S;     /* (n_i-1)*S is centered X`*X */
end;
 
/* The pooled covariance is an average of the within-class covariance matrices. */
Sp = M / (N-k);
print Sp[L="Pooled Cov" c=varNames r=VarNames format=6.2];
 
/* The between-class CSSCP is the difference between total CSSCP and the sum of the 
   within-group CSSCPs. The SAS doc for PROC DISCRIM defines the between-class 
   covariance matrix as the between-class SSCP matrix divided by N*(k-1)/k, 
   where N is the number of observations and k is the number of classes.
*/
/* the total covariance matrix ignores the groups */
C = (N-1)*cov(Z);    
BCSSCP = C - M;               /* between = Full - Sum(Within) */
BCov = BCSSCP * k/( N*(k-1) );
print BCov[L="Between Cov" c=varNames r=VarNames format=6.2];

Success! The SAS/IML program shows the computations that are needed to reproduce the pooled and between-group covariance matrices. The results are the same as are produced by PROC DISCRIM.

Summary

In multivariate ANOVA, you might assume that the within-group covariance is constant across different groups in the data. The pooled covariance is an estimate of the common covariance. It is a weighted average of the sample covariances for each group, where the larger groups are weighted more heavily than smaller groups. I show how to visualize the pooled covariance by using prediction ellipses.

You can use PROC DISCRIM to compute the pooled covariance matrix and other matrices that represent within-group and between-group covariance. I also show how to compute the matrices from first principles by using the SAS/IML language. You can download the SAS program that performs the computations and creates the graphs in this article.

The post Pooled, within-group, and between-group covariance matrices appeared first on The DO Loop.

6月 292020
 

The first time I saw a formula for the pooled variance, I was quite confused. It looked like Frankenstein's monster, assembled from bits and pieces of other quantities and brought to life by a madman. However, the pooled variance does not have to be a confusing monstrosity. The verb "to pool" means to combine resources from several sources, as in, "if the siblings pool their money, they can buy a nice gift for their mother." Thus a "pooled variance" combines information from several subgroups of the data in order to obtain a better estimate. This article shows how to understand the pooled variance, which is an average of the sample variances for subgroups of the data.

The following graph visualizes the pooled variance and the variances within groups. In the graph, there are three groups. The sample variance within each group is plotted as a blue marker. The pooled variance is indicated by a horizontal line. The pooled variance appears to be an average of the three sample variances. The exact formulas and the data for this graph are explained in subsequent sections.

The pooled variance is an average of group variances

Most students first encounter the pooled variance in a statistics course when learning about two-sample t tests. In a two-sample t test, you have data in two groups and you want to test whether the means of the two groups are different. In order to run a two-sample t test, you need to decide whether you think the variances of the two groups are equal. If you think the group variances are equal, you compute the pooled variance, which estimates the common variance. You use the pooled variance estimate to compute the t statistic.

The pooled variance combines (or "pools") the variance estimates within the individual groups. The pooled variance is a better estimate of the (unknown) common group variance than either of the individual group variances. If each group has the same number of observations, then the pooled variance is a simple average. If the group sizes are different, then the pooled variance is a weighted average, where larger groups receive more weight than smaller groups.

For the two-sample case, let \(s_1^2\) and \(s_2^2\) be the sample variances within the groups A and B, respectively. Let \(n_1\) and \(n_2\), respectively, be the number of observations in the groups. If you assume that the data in each group have the same variance, how can you estimate that common variance? One way is to "pool" the variances of each group and compute a weighted average of the variances. For two groups, the pooled variance is
\(s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2- 2}\)
The numerator is the weighted sum of the group variances. Dividing by the sum of the weights means that the pooled variance is the weighted average of the two quantities.

Notice that if \(n_1 = n_2\), then the formula simplifies. When the group sizes are equal, the pooled variance reduces to \(s_p^2 = (s_1^2 + s_2^2)/2\), which is the average of the two variances.

The idea behind the pooled variance generalizes to multiple groups. Suppose you collect data for \(k\) groups and \(s_i^2\) is the sample variance within the \(i\)th group, which has \(n_i\) observations. If you believe that the groups have a common variance, how might you estimate it? One way is to compute the pooled variance as a weighted average of the group variances:
\(s_p^2 = \Sigma_{i=1}^k (n_i-1)s_i^2 / \Sigma_{i=1}^k (n_i - 1)\)
Again, if all groups have the same number of observations, then the formula simplifies to \(\Sigma_{i=1}^k s_i^2 / k\), which is the average of the \(k\) variances.

Visualize the pooled variance

Formulas are necessary for computation, but I like to think in pictures. You can draw a graph that shows how the pooled variance relates to the variances of the groups.

Let's use an example that has three groups: A, B, and C. The following data (adapted from an example in the SAS documentation) are the ozone measurements (Y, in parts per billion) at three different collection sites (A, B, and C):

Site  n   Ozone (ppb)
==== === =============================================
A    22   4 6 3 4 7 8 2 3 4 1 3 8 9 5 6 4 6 3 4 7 3 7
B    15   5 3 6 2 1 2 4 3 2 4 6 4 6 3 6
C    18   8 9 7 8 6 7 6 7 9 8 9 8 7 8 5 8 9 7

The following graph shows a box plot of the ozone levels at each site. The height of each box gives a rough estimate of the variance in each group. The mean and variance for sites A and B appear to be similar to each other. However, the mean at site C appears to be larger and the variance seems smaller, compared to the other sites.

You can use your favorite statistical software to estimate the variance of the data for each site. The following table is produced in SAS by using the UNIVARIATE procedure and the CIBASIC option:

The table shows an estimate for the variance of the data within each group. Although the smallest sample variance (Group C: 1.32) seems much smaller than the largest sample variance (Group A: 4.69), notice that the 95% confidence intervals overlap. A parameter value such as 2.8 or 2.9 would simultaneously be in all three confidence intervals. If we assume that the variance of the groups are equal, the pooled variance formula provides a way to estimate the common variance. Applying the formula, the estimate for the pooled variance for these data is
\(s_p^2 = (21*4.69 + 14*2.89 + 17*1.32) / (21+14+17) = 3.10\)

The graph at the top of this article visualizes the information in the table and uses a reference line to indicate the pooled variance. The blue markers indicate the sample variances of each group. The confidence intervals for the population variances are shown by the vertical lines. The pooled variance is indicated by the horizontal reference line. It is the weighted average of the sample variances. If you think that all groups have the same variance, the pooled variance estimates that common variance.

Summary

In two-sample t tests and ANOVA tests, you often assume that the variance is constant across different groups in the data. The pooled variance is an estimate of the common variance. It is a weighted average of the sample variances for each group, where the larger groups are weighted more heavily than smaller groups.

You can download a SAS program that computes the pooled variance and creates the graphs in this article. Although the pooled variance provides an estimate for a common variance among groups, it is not always clear when the assumption of a common variance is valid. The visualization in this article can help, or you can perform a formal "homogeneity of variance" test as part of an ANOVA analysis.

The post What is a pooled variance? appeared first on The DO Loop.

6月 242020
 

If you have ever run a Kolmogorov-Smirnov test for normality, you have encountered the Kolmogorov D statistic. The Kolmogorov D statistic is used to assess whether a random sample was drawn from a specified distribution. Although it is frequently used to test for normality, the statistic is "distribution free" in the sense that it compares the empirical distribution to any specified distribution. I have previously written about how to interpret the Kolmogorov D statistic and how to compute it in SAS.

If you can compute a statistic, you can use simulation to approximate its sampling distribution and to approximate its critical values. Although I am a fan of simulation, in this article I show how to compute the exact distribution and critical values for Kolmogorov D statistic. The technique used in this article is from Facchinetti (2009), "A Procedure to Find Exact Critical Values of Kolmogorov-Smirnov Test."

An overview of the Facchinetti method

Suppose you have a sample of size n and you perform a Kolmogorov-Smirnov test that compares the empirical distribution to a reference distribution. You obtain the test statistic D. To know whether you should reject the null hypothesis (that the test came from the reference distribution), you need to be able to compute the probability CDF(D) = Pr(Dn ≤ D). Facchinetti shows that you can compute the probability for any D by solving a system of linear equations. If a random sample contains n observations, you can form a certain (2n+2) x (2n+2) matrix whose entries are composed of binomial probabilities. You can use a discrete heat map to display the structure of the matrix that is used in the Facchinetti method. One example is shown below. For n=20, the matrix is 42 x 42. As Facchinetti shows in Figure 6 of her paper, the matrix has a block structure, with triangular matrices in four blocks.

The right-hand side of the linear system is also composed of binomial probabilities. The system is singular, but you can use a generalized inverse to obtain a solution. From the solution, you can compute the probability. For the details, see Facchinetti (2009).

Facchinetti includes a MATLAB program in the appendix of her paper. I converted the MATLAB program into SAS/IML and improved the efficiency by vectorizing some of the computations. You can download the SAS/IML program that computes the CDF of the Kolmogorov D statistic and all the graphs in this article. The name of the SAS/IML function is KolmogorovCDF, which requires two arguments, n (sample size) and D (value of the test statistic).

Examples of using the CDF of the Kolmorov distribution

Suppose you have a sample of size 20. You can use a statistical table to look up the critical value of the Kolmogorov D statistic. For α = 0.05, the (two-sided) critical value for n=20 is D = 0.294. If you load and call the KolmogorovCDF function, you can confirm that the CDF for the Kolmogorov D distribution is very close to 0.95 when D=0.294:

proc iml;
load  module=KolmogorovCDF;      /* load the function that implements the Facchinetti method */
/* simple example of calling the function */
n = 20;
D = 0.29408;
cdf = KolmogorovCDF(n, D);
print n D cdf;

The interpretation of this result is that if you draw a random samples of size 20 from the reference distribution, there is a 5% chance that the Kolmogorov D value will exceed 0.294. A test statistic that exceeds 0.294 implies that you should reject the null hypothesis at the 0.05 significance level.

Visualize the Kolmogorov D distribution

If you can compute the CDF for one value of D, you can compute it for a sequence of D values. In this way, you can visualize the CDF curve. The following SAS/IML statements compute the CDF for the Kolmogorov D distribution when n=20 by running the computation on a sequence of D values in the open interval (0, 1):

D = do(0.01, 0.99, 0.01);             /* sequence of D values */
CDF = j(1, ncol(D), .);
do i = 1 to ncol(D);
   CDF[i] = KolmogorovCDF(n, D[i]);   /* CDF for each D */ 
end;
title "Kolmogorov CDF, n=20";
call series(D, CDF) grid={x y};       /* create a graph of the CDF */

The graph enables you to read probabilities and quantiles for the D20 distribution. The graph shows that almost all random samples of size 20 (drawn from the reference distribution) will have a D statistic of less than 0.4. Geometrically, the D statistic represents the largest gap between the empirical distribution and the reference distribution. A large gap indicates that the sample was probably not drawn from the reference distribution.

Visualize a family of Kolmogorov D distributions

The previous graph was for n=20, but you can repeat the computation for other values of n to obtain a family of CDF curves for the Kolmogorov D statistic. (Facchinetti writes Dn to emphasize that the statistic depends on the sample size.) The computation is available in the program that accompanies this article. The results are shown below:

This graph enables you to estimate probabilities and quantiles for several Dn distributions.

Visualize a family of density curves

Because the PDF is the derivative of the CDF, you can use a forward difference approximation of the derivative to also plot the density of the distribution. The following density curves are derived from the previous CDF curves:

A table of critical values

If you can compute the CDF, you can compute quantiles by using the inverse CDF. You can use a root-finding technique to obtain a quantile.

In the SAS/IML language, the FROOT function can find univariate roots. The critical values of the Dn statistic are usually tabulated by listing the sample size down columns and the significance level (α) or confidence level (1-α) across the columns. Facchinetti (p. 352) includes a table of critical values in her paper. The following SAS/IML statements show how to recreate the table. For simplicity, I show only a subset of the table for n=20, 25, 20, and 35, and for four common values of α.

/* A critical value is the root of the function
   f(D) = KolmogorovCDF(n, D) - (1-alpha)
   for any specified n and alpha value.
*/
start KolmogorovQuantile(D) global (n, alpha);
   return  KolmogorovCDF(n, D) - (1-alpha);
finish;
 
/* create a table of critical values for the following vector of n and alpha values */
nVals = do(20, 35, 5);
alphaVals = {0.001 0.01 0.05 0.10};
CritVal = j(ncol(nVals), ncol(alphaVals), .);
do i = 1 to ncol(nVals);
   n = nVals[i];
   do j = 1 to ncol(alphaVals);
      alpha = alphaVals[j];
      CritVal[i,j] = froot("KolmogorovQuantile", {0.01, 0.99});
   end;
end;
print CritVal[r=nVals c=alphaVals format=7.5 L="Critical Values (n,alpha)"];

For each sample size, the corresponding row of the table shows the critical values of D for which CDF(D) = 1-α. Of course, a printed table is unnecessary when you have a programmatic way to compute the quantiles.

Summary

Facchinetti (2009) shows how to compute the CDF for the Kolmogorov D distribution for any value of n and D. The Kolmogorov D statistic is often used for goodness-of-fit tests. Facchinetti's method consists of forming and solving a system of linear equations. To make her work available to the SAS statistical programmer, I translated her MATLAB program into a SAS/IML program that computes the CDF of the Kolmogorov D distribution. This article demonstrates how to use the KolmogorovCDF in SAS/IML to compute and visualize the CDF and density of the Kolmogorov D statistic. It also shows how to use the computation in conjunction with a root-finding method to obtain quantiles and exact critical values of the Kolmogorov D distribution.

The post The Kolmogorov D distribution and exact critical values appeared first on The DO Loop.

6月 222020
 

Sometimes in matrix computations, it is important to display the nonzero elements of a matrix. This can be useful for visualizing the structure of a sparse matrix (one that has many zeros) and it is also useful when describing a matrix algorithm (such as Gaussian elimination) that introduces zeros at each step of the algorithm.

Let A be the matrix you want to visualize. You can visualize the nonzero elements of the matrix A in a graph or in a table:

  • Use a custom SAS format that prints an asterisk (*) for nonzero values and a blank for zero values.
  • Create a discrete heat map of the binary matrix A^=0, which replaces the nonzero cells of A with the value 1.

A text display method

For small matrices, you can use a custom SAS format to print the matrix. The format is defined by using PROC FORMAT. This format displays a star (*) for nonzero values and displays a blank for zero values:

proc format;
value SparseFmt   
      0 = " "      /* display blank instead of 0 */
      other = "*"; /* display '*' instead of number */
run;
 
proc iml;
A = {1.0  0.2  0.0  0.0  0.0  0.0 1e-6,
     0.2  1.0  0.2  0.0  0.0  0.0 0.0,
     0.0  0.2  1.0  0.2  0.0  0.0 0.0,
     0.0  0.0  0.2  1.0  0.2  0.0 0.0,
     1e-6 0.0  0.0  0.2  1.0  0.2 0.0,
     2e-6 1e-6 0.0  0.0  0.2  1.0 0.2,
     3e-6 2e-6 1e-6 0.0  0.0  0.2 1.0 };
 
/* use a custom format to print * or blank in each cell */
print A[format=SparseFmt.];
Visualize a sparse matrix by using a custom format that print a blank or an asterisk

You can see the banded structure of the matrix and also see that elements near the corners are nonzero. This textual visualization is concise and is used in textbooks about numerical linear algebra.

A graphical method

For larger matrices, a graphical method is probably better. You can form the binary logical matrix B=(A^=0) and then plot B by using a two-color heat map. In the SAS/IML language, you can use the HEATMAPDISC call to create a discrete heat map of a matrix. I like to use white for the zeroes. The following SAS/IML statements define a 7 x 7 matrix and use a discrete heat map to display the nonzero values:

/* discrete heat map of binary matrix A^=0 */
ods graphics / width=400px height=400px;
call heatmapdisc(A^=0) colorramp={white steelblue}
         title="Discrete heat map of binary matrix";
Visualize a sparse matrix by using a heat map of a binary matrix where each cell is zero (0) or nonzero (1)

The information in the heat map is identical to the printed display. However, for large matrices, the heat map is easier to view. As shown, you can also use the ODS GRAPHICS statement to control the dimensions of the graph. So, for example, if you are visualizing a square matrix, you can make the graph square.

The post Visualize the structure of a sparse matrix appeared first on The DO Loop.

6月 172020
 

A previous article shows how to use the iml action to read a CAS data table into an IML matrix. This article shows how to write a CAS table from data in an IML matrix. You can read an overview of the iml action, which was introduced in SAS Viya 3.5.

Write a CAS data table from the iml action

Suppose you are running a program in the iml action and you want to write the result of a computation to a CAS data table. The simplest way is to use the the MatrixWriteToCAS subroutine. The syntax of the function is
call MatrixWriteToCAS(matrix, caslib, TableName, colnames);
where

  • matrix is the matrix that you want to save.
  • caslib is a caslib that specifies the location of the table. A blank string means "use the default caslib." Otherwise, specify the name of a caslib. For example, if the CAS table is in your personal caslib, specify 'CASUSER(userName)', where userName is your login name. For more information, see the CAS documentation about caslibs.
  • TableName is a string that specifies the name of the CAS table.
  • colnames is a character vector, whose elements specify the columns in the CAS table.

For simplicity, the following call to the iml action defines a matrix and then writes it to a table. In practice, the matrix would be the result of some computation:

proc cas;
loadactionset 'iml';   /* load action set (once per session) */
source WriteMat;
   corr = {1.0000  0.7874 -0.7095 -0.7173  0.8079,
           0.7874  1.0000 -0.6767 -0.6472  0.6308,
          -0.7095 -0.6767  1.0000  0.9410 -0.7380,
          -0.7173 -0.6472  0.9410  1.0000 -0.7910,
           0.8079  0.6308 -0.7380 -0.7910 1.0000 };
   varNames = {'EngineSize' 'Horsepower' 'MPG_City' 'MPG_Highway' 'Weight'};
   call MatrixWriteToCas(corr, ' ', 'MyCorr', varNames);  /* Write to CAS data table in default caslib */ 
endsource;
iml / code=WriteMat;      /* call the iml action to run the program */
run;

The program writes the matrix to a CAS table named MYCORR. You can call the columnInfo action (which is similar to PROC CONTENTS) to verify that the CAS table exists:

proc cas;
   columnInfo / table='MyCorr';
run;

The output of the columnInfo action shows that the MYCORR data table was created. Notice that in addition to the five specified variables, the CAS table contains a sixth variable, called _ROWID_. This variable is automatically added by the MatrixWriteToCAS subroutine.

How did that variable get there? As explained in the previous article, CAS data tables do not have an inherent row order, but matrices do. The iml action includes the _ROWID_ variable in the output data table in case you need to read the data back into an IML matrix at a later time. If you do, the rows of the new matrix will be in the same order as the rows of the CORR matrix that created the data table. For a correlation matrix, this is important because (by convention) the order of the rows corresponds to the order of the columns. Preserving the row order ensures that the matrix has 1s along the main diagonal.

In summary, you can use the MatrixWriteToCAS subroutine to write a matrix to a CAS table. Not only does the call create a CAS table, but it augments the table with information that can recreate the matrix in the same row order. By the way, if you want to save mixed-type data, you can use the TableWriteToCAS subroutine to write a SAS/IML table to a CAS table.

Further reading

The post Write a CAS data table by using the iml action appeared first on The DO Loop.

6月 152020
 

A previous article compares a SAS/IML program that runs in PROC IML to the same program that runs in the iml action. (You can read an overview of the iml action.) The example in the previous article was very simple and did not read or write data. This article compares a PROC IML program that reads a SAS data set to a similar program that runs in the iml action and reads CAS tables. The iml action was introduced in SAS Viya 3.5.

PROC IML and SAS data sets

An important task for statistical programmers is reading data from a SAS data set into a SAS/IML matrix. SAS/IML programs are often used to pre-process or post-process data that are created by using other SAS procedures, and SAS data sets enable the output from one procedure to become the input to another procedure.

To simplify matters, let's focus on reading numerical data into a SAS/IML matrix. In PROC IML, the USE and READ statements are used to read data into a matrix. The following program is typical. Several numerical variables from the SasHelp.Cars data are read into the matrix X. The call to the CORR function then computes the correlation matrix for those variables:

proc iml;
varNames = {'EngineSize' 'Horsepower' 'MPG_City' 'MPG_Highway' 'Weight'};
use Sashelp.Cars;              /* specify the libref and data set name */
read all var varNames into X;  /* read all specified variables into columns of X */
close;
 
corr = corr(X);
print corr[r=varNames c=varNames format=7.4];
quit;

The USE and READ statements read SAS data sets. However, there are no SAS data sets in CAS, so these statements are not supported in the iml action. Instead, the iml action supports reading a CAS table into a matrix. Whereas the READ statement in PROC IML reads data sequentially, the iml action reads data from a CAS table in parallel by using multiple threads.

The rows in a CAS table do not have a defined order

Before showing how to read a CAS table into a matrix, let's discuss a characteristic of CAS tables that might be unfamiliar to you. Namely, the order of observations in a CAS table is undefined.

Recall that one purpose of CAS is to process massive amounts of data. Large data tables might be stored in a distributed fashion across multiple nodes in a cluster. When rows of data are read, they are read by using multiple threads that execute in parallel. A consequence of this fact is that CAS data tables do not have an inherent order. If this sounds shocking, realize that the order of the observations does not matter for most data analysis. For example, order does not affect descriptive statistics such as the mean or percentiles. Furthermore, any analysis that uses the sum-of-squares crossproduct matrix (X`*X) is unaffected by reordering the observations. This includes correlation, regression, and a lot of multivariate statistics (for example, principal component analysis).

Of course, for some analyses (such as time series) and for some matrix computations, the order of rows is important. Therefore, the iml action has a way to ensure that the rows of a matrix are in a specific order. If the CAS table has a variable with the name _ROWID_ (or _ROWORDER_), then the data rows are sorted by that variable before the IML matrix is created. This happens automatically when you use the MatrixCreateFromCAS function, which is discussed in a subsequent section. (The _ROWID_ is used only to sort the rows; it does not become a column in the matrix.)

You can use the iml action to read any CAS table. If you read a table that does not contain a _ROWID_ variable, the order of rows in the data matrix might change from run to run.

Upload a data set from SAS to CAS

There are several ways to upload a SAS data set into a CAS table. This example uses the DATA step. By using the DATA step, you can also add a _ROWID_ variable to the data, in case you want the rows of the IML matrix to be in the same order as the data set.

I assume you have already established a CAS session. The following statements use the LIBNAME statement in SAS to create a libref to the active caslib in the current CAS session. You can use this libref to upload a data set into a CAS table:

libname mycaslib cas;    /* libref to the active caslib in the current CAS session */
data mycaslib.cars;      /* create CAS table named CARS in the active caslib */
   set Sashelp.Cars;
   _ROWID_ + 1;          /* add a sort variable */
run;
 
/* Optional: verify that the CAS table includes the _ROWID_ variable */
proc cas;
   columnInfo / table='cars';
run;

The output of the columnInfo action shows that the cars data table was created and includes the _ROWID_ variable.

How to read a CAS data table in the iml action

To read variables from a CAS data table into a matrix, you can use the CreateMatrixFromCAS function. The syntax of the function is
X = MatrixCreateFromCAS(caslib, TableName, options);
where

  • caslib is a caslib that specifies the location of the table. You can specify a blank string if you want to use the default caslib. Or you can specify a caslib as a string. For example, if the CAS table is in your personal caslib, specify 'CASUSER(userName)', where userName is your login name. For more information, see the CAS documentation about caslibs.
  • TableName is a string that specifies the name of the CAS table.
  • options is string that enables you to read only part of the data. The most common option is the KEEP statement, which you can use to specify all numeric values ('KEEP=_NUMERIC_', which is the default), all character variables ('KEEP=_CHARACTER_'), or to specify certain variables ('KEEP=X1 X1 Y Z').

Recall that matrices in the SAS/IML language are either numeric or character. If you want to read mixed-type data, you can use the TableCreateFromCAS function.

Read a CAS table in the iml action

In the previous sections, we created a CAS table ('cars') that contains the Sashelp.Cars data. You can use iml action and the MatrixCreateFromCAS function to read that data table into a matrix, as follows:

proc cas;
   loadactionset 'iml';    /* load the iml action set (only once per session) */
run;
 
proc cas;
source ReadData;
   KeepStmt = 'KEEP=EngineSize Horsepower MPG_City MPG_Highway Weight';
   X = MatrixCreateFromCas(' ', 'cars', KeepStmt);
 
   corr = corr(X);
   varNames = {'EngineSize' 'Horsepower' 'MPG_City' 'MPG_Highway' 'Weight'};
   print corr[r=varNames c=varNames format=7.4];
endsource;
iml / code=ReadData;
run;

The output is identical to the output from PROC IML and is not shown.

The KEEP= option specifies five numeric variables to read. The first argument to the MatrixCreateFromCAS function is a blank string, which means "use the default caslib." Alternatively, you could specify a caslib such as 'CASUSER(userName)'. The matrix X is the data matrix. Because the CARS table contains a _ROWID_ variable, the rows of X are in the same order as the rows of the Sashelp.Cars data set.

After you read the data, you can write standard SAS/IML programs to manipulate or analyze X. For example, I used the CORR function and the PRINT statement to reproduce the results of the previous PROC IML program.

In the next article, I will show how to use the iml action to write a CAS table that contains data in an IML matrix.

Further reading

The post Read a CAS data table by using the iml action appeared first on The DO Loop.

6月 122020
 

A previous article provides an introduction and overview of the iml action, which is available in SAS Viya 3.5. The article compares the iml action to PROC IML and states that most PROC IML programs can be modified to run in iml action. This article takes a closer look at what a SAS/IML program looks like in the iml action. If you have an existing PROC IML program, how can you modify it to run in the iml action?

PROC CAS and the SOURCE/ENDSOURCE block

A feature of SAS Viya is that the actions can be called from multiple languages: SAS, Python, R, Lua, etc. I am a SAS programmer, so in my blog, I will use SAS to call actions.

An action (sometimes called a "CAS action") runs by using the SAS Cloud Analytic Services (CAS). The CAS procedure provides a language (called "CASL," for "CAS Language") that enables you to call CAS actions. The documentation for CASL states that "CASL is a scripting language that you use to prepare arguments for execution of CAS actions, submit actions to the CAS server, and then process action results." In short, PROC CAS enables you to "stitch together" a sequence of action calls into a workflow.

PROC CAS supports a SOURCE/ENDSOURCE block, which defines a program that you can submit to an action that supports programming statements. I will use the SOURCE/ENDSOURCE block to define programs for the iml action. To call the iml action by using PROC CAS, you need to do three things:

  1. Load the iml action set. Any CAS action set that is not auto-loaded must be loaded before you can use it. It only needs to be loaded once per session.
  2. Use the SOURCE/ENDSOURCE block to define the IML program.
  3. Call the iml action to run the program. The syntax to call the action is iml / code=ProgramName

In terms of syntax, a typical program in the iml action has the following structure:

PROC CAS;
loadactionset 'iml';        /* load the iml action set */
source ProgramName;
    < put the IML program here >
endsource;
iml / code=ProgramName;     /* run the program in the iml action */
RUN;

Convert a program from PROC IML to the iml action

The following SAS/IML program defines two vectors and computes their inner product. It then computes the variance of the x data vector. Lastly, it centers and scales the data and computes the variance of the standardized data. This program runs in PROC IML:

PROC IML;
   c = {1, 2, 1, 3, 2, 0, 1};   /* weights */
   x = {0, 2, 3, 1, 0, 2, 2};   /* data */
   wtSum = c` * x;              /* inner product (weighted sum) */
   var1 = var(x);               /* variance of original data */
   stdX = (x-mean(x)) / std(x); /* standardize data */
   var2 = var(stdX);            /* variance of standardized data */
   print wtSum var1 var2;
QUIT;

This program does not read or write any data sets, which makes it easy to convert to the iml action. The following statements assume that you have a license for the SAS IML product in Viya and that you have already connected to a CAS server.

/* Example of using PROC CAS in SAS to call the iml action */
PROC CAS;
loadactionset 'iml';            /* load the action set (once) */
source pgm;
   c = {1, 2, 1, 3, 2, 0, 1};   /* weights */
   x = {0, 2, 3, 1, 0, 2, 2};   /* data */
   wtSum = c` * x;              /* inner product (weighted sum) */
   var1 = var(x);               /* variance of original data */
   stdX = (x-mean(x)) / std(x); /* standardize data */
   var2 = var(stdX);            /* variance of standardized data */
   print wtSum var1 var2;
endsource;
iml / code=pgm;                 /* call iml action to run the 'pgm' program */
RUN;

For this example, the SAS/IML statements are exactly the same in both programs. The difference is the way that the program is submitted for execution. For this example, the program is named 'pgm', but you could also have named the program 'MyProgram' or 'StdAnalysis' or 'Robert' or 'Jane'. Whatever identifier you use on the SOURCE statement, use that same identifier for the CODE= parameter when you call in the action.

Calling the iml action from Python

I use PROC CAS in SAS to call CAS actions. However, I know that Python is a popular programming language for some data scientists, so here is how to call the iml action from Python. You first need to install the SAS Scripting Wrapper for Analytics Transfer (SWAT) package. You can then use the following statements to connect to a CAS server and load the action:

# Example of using Python to call the iml action
import swat # load the swat package
s = swat.CAS('myhost', 12345)    # use server='myhost'; port=12345
s.loadactionset('iml')           # load the action set (once)

As mentioned earlier, you only need to load the action set one time per session. After the action set is loaded, you can call the iml action by using the following. In Python, triple quotes enable you to preserve the indention and comments in the IML program.

m = s.iml(code=
"""
   c = {1, 2, 1, 3, 2, 0, 1};   /* weights */
   x = {0, 2, 3, 1, 0, 2, 2};   /* data */
   wtSum = c` * x;              /* inner product (weighted sum) */
   var1 = var(x);               /* variance of original data */
   stdX = (x-mean(x)) / std(x); /* standardize data */
   var2 = var(stdX);            /* variance of standardized data */
   print wtSum var1 var2;
""")

More realistic examples

In this case, the program did not read or write data. However, a typical IML program reads in data, analyzes it, and optionally writes out the results. This is the first (and most common) difference between a program that runs in PROC IML and a similar program that runs in the iml action. A PROC IML program reads and writes SAS data sets, whereas actions read and write CAS data tables.

The next article shows an example of a PROC IML program that reads and writes SAS data sets. It compares the PROC IML program to an analogous program that runs in the iml action and reads and writes CAS tables.

Further reading

The post Getting started with the iml action in SAS Viya appeared first on The DO Loop.

6月 102020
 

This article introduces the iml action, which is available in SAS Viya 3.5. The iml action supports most of the same syntax and functionality as the SAS/IML matrix language, which is implemented in PROC IML. With minimal changes, most programs that run in PROC IML also run in the iml action. In addition, the iml action supports new programming features for parallel programming.

Most actions in SAS Viya perform a specific task, but the iml action is different. The iml action provides a set of general programming tools that you can use to implement a custom parallel algorithm. The programmer can control many aspects of the computation, including how the computation is distributed among nodes and threads on a cluster of machines (or threads on a single machine).

Future articles will address the parallel programming capabilities of the iml action. This article provides an overview of the iml action. What is it? How do you get access to it? How is it similar to and different from PROC IML?

What is the iml action?

Recall that the SAS/IML language is a matrix-vector programming language that supports a rich library of functions in statistics, data analysis, matrix computations, numerical analysis, simulation, and optimization. In SAS 9 ("traditional SAS"), you can access the SAS/IML language by licensing the SAS/IML product and calling the IML procedure. PROC IML is also available in the SAS University Edition.

In SAS Viya 3.5. you get access to the SAS/IML language by licensing the SAS IML product. (Notice that there is no “slash” in the product name.) The SAS IML product gives you access to the iml action and to the IML procedure. Thus, in Viya, you can run all existing PROC IML programs, and you can also write new programs that run in the iml action and use SAS Cloud Analytic Services (CAS).

The iml action belongs to the iml action set. In addition to supporting most of the statements and functions in the SAS/IML language, the iml action supports new functionality that enables you to take advantage of the distributed computational resources in SAS Viya. In particular, you can use the iml action to implement custom parallel algorithms that use multiple nodes and threads on a cluster of machines. Even on one machine, you can run custom parallel programs on a multicore processor.

How is the iml action similar to PROC IML?

The iml action and the IML procedure share a common syntax. The mathematical and statistical function library is essentially the same in the action and in the procedure. Both environments support arithmetic and linear algebraic operations on matrices, operations to subset and query matrices, and programming features such as writing loops and using IF-THEN/ELSE logic.

Of the 300 functions and statements in the SAS/IML run-time library, only a handful of statements are not supported in the iml action. Most differences are related to the difference between the SAS 9 and SAS Viya environments. PROC IML interacts with traditional SAS constructs (such as data sets and catalogs) and supports calling SAS procedures and interacting with files on your local computer. The iml action interacts with analogous constructs in the Viya environment. It can read and write CAS tables, write analytic stores (astores), and can call other Viya actions.

Why use the iml action?

The iml action runs on a CAS server. Why might you choose to use the iml action instead of PROC IML? Or convert an existing PROC IML program into the iml action? There are two main reasons:

  • You want to use the SAS/IML language as part of a sequence of actions that analyze data that are in CAS tables. By using the iml action, you can read and write CAS tables directly. If you use PROC IML, you need to pull the data from CAS into a SAS data set, run the analysis in PROC IML, and then push the results to a CAS table.
  • You want to take advantage of the capabilities of the CAS server to perform parallel processing. You can use the iml action to create custom parallel computations.

How is the iml action different from PROC IML?

As mentioned previously, the iml action does not support every function and statement that PROC IML supports. The unsupported functions and statements are primarily in four areas:

  • Base SAS functions that are not supported in the CAS DATA step. For example, the old random number generator functions RANUNI and RANNOR are not supported in CAS because they cannot generate independent streams of random number in parallel.
  • Statements that read or write SAS data sets or text files.
  • Functions that create graphics. CAS is for computations. You can download the results of the computation to whatever language you are using to call the CAS actions. For example, I download the results to SAS and create SAS graphs, but you could also use Python or R.
  • SAS/IML functions that were deprecated in earlier releases of SAS.

See the documentation for the iml action for a complete list of the PROC IML functions and statements that are not supported in the iml action.

Will my program run faster in the iml action than PROC IML?

If you take an existing PROC IML program and run it in the iml action, it will take about the same amount of time to run. Sure, it might run a little faster if the machines in your CAS cluster are newer and more powerful than the SAS server or your PC, but that speedup is due to hardware. An existing program does not automatically run faster in the iml action because it runs serially until it encounters a programming statement that can be executed in parallel. There are two main sets of statements that run in parallel:

  • Reading and writing data from CAS tables.
  • Functions that distribute computations. You, the programmer, need to call these functions to write parallel programs.

So, yes, you can get certain programs to run faster in the iml action, but it doesn't happen automatically. You have to add new input/output statements or call functions that execute tasks in parallel.

Should you convert from PROC IML to the iml action?

What does this mean for the SAS/IML programmer whose company is changing from SAS 9 to Viya? Do you need to convert hundreds of existing PROC IML programs to run in the iml action? No, absolutely not. As mentioned previously, when you license SAS IML on Viya, you get both PROC IML and the iml action. The existing programs that you wrote in SAS 9 will continue to run in PROC IML in Viya.

The Viya platform provides an opportunity to use the iml action but does not require it. Under what circumstances might you want to convert a program from PROC IML to the iml action? Or write a new program in the iml action instead of using PROC IML? In my opinion, it comes down to two issues: workflow and performance.

  • Workflow: If your company is using CAS actions and CAS-enabled procedures, and if your data are stored in CAS tables, then it makes sense to use the iml action instead of the IML procedure. The SAS/IML language is often used to pre- or post-process data for other procedures or actions. The iml action can read and write CAS tables that are created by other CAS actions or that will be consumed by other CAS actions.
  • Performance: Suppose that you have a computation that takes a long time to process in PROC IML, but the computation is "embarrassingly parallel." An embarrassingly parallel problem, is one that consists of many identical independent subtasks. The iml action supports several functions for distributing a computation to multiple threads. Examples of embarrassingly parallel computations in statistics and machine learning include Monte Carlo simulation, resampling methods such as the bootstrap, ensemble models, and many "brute force" computations.

Further reading

I continue this exploration of the iml action in subsequent articles. Related articles include:

  • A Getting Started example that shows how to call the iml action and discusses how the action is similar to and different from PROC IML.
  • An example that shows how to read and write CAS tables from the iml action.
  • The MAPREDUCE function, which enables you to distribute a computation across threads and nodes.
  • The PARTASKS function, which enables you to distribute multiple independent computations across threads and nodes.
  • The SCORE function, which enables you to evaluate a function in parallel on every row of a CAS table.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post An introduction to the iml action in SAS Viya appeared first on The DO Loop.

6月 082020
 

A SAS customer asked how to specify interaction effects between a classification variable and a spline effect in a SAS regression procedure. There are at least two ways to do this. If the SAS procedure supports the EFFECT statement, you can build the interaction term in the MODEL statement. For procedures that do not support the EFFECT statement, you can generate the interaction effects yourself. Because the customer's question was specifically about the GAMPL procedure, which fits generalized additive models, I also show a third technique, which enables you to model interactions in the GAMPL procedure.

To illustrate the problem, consider the following SAS DATA step, which simulates data from two different response functions:

data Have;
call streaminit(1);
Group="A"; 
do x = 1 to 10 by .1;
   y =    2*x + sin(x) + rand("Normal"); output;
end;
Group="B";
do x = 1 to 10 by .1;
   y = 10 - x - sin(x) + rand("Normal"); output;
end;
run;

The graph shows the data and overlays a regression model. The model predicts the response for each level of the Group variable. Notice that the two curves have different shapes, which means that there is an interaction between the Group variable and the x variable. The goal of this article is to show how to use a spline effect to predict the response for each group.

Use the grammar of the procedure to form interactions

Many SAS regression procedures support the EFFECT statement, the CLASS statement, and enable you to specify interactions on the MODEL statement. Examples include the GLMMIX, GLMSELECT, LOGISTIC, QUANTREG, and ROBUSTREG procedures. For example, the following call to PROC GLMSELECT specifies several model effects by using the "stars and bars" syntax:

  • The syntax Group | x includes the classification effect (Group), a linear effect (x), and an interaction effect (Group*x).
  • The syntax Group * spl includes an interaction effect between the classification variable and the spline.
title "GLMSELECT Model with Spline and Interaction Effects";
ods select none;
proc glmselect data=Have
        outdesign(addinputvars fullmodel)=SplineBasis; /* optional: write design matrix to data set */
   class Group;
   effect spl = spline(x);
   model y = Group|x  Group*spl / selection=none;
   output out=SplineOut predicted=p;            /* output predicted values */
quit;
ods select all;
 
%macro PlotInteraction(dsname);
   proc sgplot data=&dsname;
      scatter x=x y=y / group=Group;
      series x=x y=p / group=Group curvelabel;
      keylegend / location=inside position=E;
   run;
%mend;
%PlotInteraction(SplineOut);

The predicted curves are shown in the previous section. I wrapped the call to PROC SGPLOT in a macro so that I can create the same plot for other models.

Output the design matrix for the interactions

Some SAS procedures do not support the EFFECT statement. For these procedures, you cannot form effects like Group*spline(x) within the procedure. However, the GLMSELECT procedure supports the OUTDESIGN= option, which writes the design matrix to a SAS data set. I used the option in the previous section to create the SplineBasis data set. The data set includes variables for the spline basis and for the interaction between the Group variable and the spline basis. Therefore, you can use the splines and interactions with splines in other SAS procedures.

For example, the GLM procedure does not support the EFFECT statement, so you cannot generate a spline basis within the procedure. However, you can read the design matrix from the previous call to PROC GLMSELECT. The interaction effects are named spl_Group_1_A, spl_Group_1_B, spl_Group_2_A, ...., where the suffix "_i_j" indicates the interaction effect between the i_th spline basis and the j_th level of the Group variable. You can type the names of the design variables, or you can use the "colon notation" to match all variables that begin with the prefix "spl_Group". The following call to PROC GLM computes the same model as in the previous section:

proc glm data=SplineBasis;
   class Group;
   model y = Group|x  spl_Group: ;           /* spl_Group: => spl_Group_1_A,  spl_Group_1_B, ... */
   output out=GLMOut predicted=p;            /* output predicted values */
quit;
 
/* show that the predicted values are the same */
data Comp;
   merge SplineOut GLMOut(keep=p rename=(p=p2));
   diff = p - p2;
run;
proc means data=comp Mean Min Max;  var diff;  run;

The output of PROC MEANS shows that the two models are the same. There is no difference between the predicted values from PROC GLM (which reads the design matrix) and the values from PROC GLMSELECT (which reads the raw data).

The splines of the interactions versus the interactions of the splines

Some nonparametric regression procedures, such as the GAMPL procedure, have their own syntax to generate spline effects. In fact, PROC GAMPL uses thin-plate splines, which are different from the splines that are supported by the EFFECT statement. Recently, a SAS customer asked how he could model interaction terms in PROC GAMPL.

The GAMPL procedure supports semiparametric models. In the parametric portion of the model, it is easy to specify interactions by using the standard "stars and bars" notation in SAS. However, the SPLINE() transformation in PROC GAMPL does not support interactions between a classification variable and a continuous variable. To be specific, consider the following semiparametric model:

title "No Interaction Between Classification and Spline Effects";
proc gampl data=Have; *plots=components;
   class Group;
   model Y = param(Group | x)     /* parametric terms */
             spline(x);           /* nonprametric, but no interaction */
   output out=GamPLOut pred=p r=r;
   id Y X Group;
run;
 
%PlotInteraction(SplineOut);

The output shows that the model does not capture the nonlinear features of the data. This is because the spline term is computed for all values of x, not separately for each group. When the groups are combined, the spline effect is essentially linear, which is probably not the best model for these data.

Unfortunately, in SAS/STAT 15.1, PROC GAMPL does not support a syntax such as "spline(Group*x)", which would enable the shape of the nonparametric curve to vary between levels of the Group variable. In fact, the 15.1 documentation states, "only continuous variables (not classification variables) can be specified in spline-effects."

However, if your goal is to use a flexible model to fit the data, there is a model that might work. Instead of the interaction between the Group variable and spline(x), you could use the spline effects for the interaction between the Group variable and x. These are not the same models: the interaction with the splines is not equal to the spline of the interactions! However, let's see how to fit this model to these data.

The idea is to form the interaction effect Group*x in a separate step. You can use the design matrix (SplineBasis) that was created in the first section, but to make the process as transparent as possible, I am going to manually generate the dummy variables for the interaction effect Group*x:

data Interact; 
set Have; 
x_Group_A = x*(Group='A'); 
x_Group_B = x*(Group='B');
run;

You can use the spline effects for the variable x_Group_A and x_Group_B in regression models. The following model is not the same as the previous models, but is a flexible model that seems to fit these data:

title "Interaction Between Classification and Spline Effects";
proc gampl data=Interact plots=components;
   class Group;
   model Y = param(Group | x) 
             spline(x_Group_A) spline(x_Group_B);  /* splines of interactions */
   output out=GamPLOut pred=p;
   id Y X Group;
run;
 
title "GAMPL with Interaction Effects";
%PlotInteraction(GamPLOut);

The graph shows that (visually) the model seems to capture the undulations in the data as well as the earlier model. Although PROC GAMPL does not currently support interactions between classification variables and spline effects, this trick provides a way to model the data.

The post Interactions with spline effects in regression models appeared first on The DO Loop.

6月 032020
 

I recently read an article that describes ways to compute confidence intervals for the difference in a percentile between two groups. In Eaton, Moore, and MacKenzie (2019), the authors describe a problem in hydrology. The data are the sizes of pebbles (grains) in rivers at two different sites. The authors ask whether the p_th percentile of size is significantly different between the two sites. They also want a confidence interval for the difference. The median (50th percentile) and 84th percentile were the main focus of their paper.

For those of us who are not hydrologists, consider a sample from group 'A' and a sample from group 'B'. The problem is to test whether the p_th percentile is different in the two groups and to construct a confidence interval for the difference.

The authors show two methods: a binomial-based probability model (Section 2) and a bootstrap confidence interval (Section 3). However, there is another option: use quantile regression to estimate the difference in the quantiles and to construct a confidence interval. In this article, I show the computation for the 50th and 84th percentiles and compare the results to a bootstrap estimate. You can download the SAS program that creates the analyses and graphs in this article.

If you are not familiar with quantile regression, see my previous article about how to use quantile regression to estimate the difference between two medians.

A data distribution of particle sizes

For convenience, I simulated samples for the sizes of pebbles in two stream beds:

  • For Site 'A', the sizes are lognormally distributed with μ=0.64 and σ=1. For the LN(0.64, 1) distribution, the median pebble size is 1.9 mm and the size for the 84th percentile is 5.1 mm.
  • For Site 'B', the sizes are lognormally distributed with μ=0.53 and σ=1. For the LN(0.53, 1) distribution, the median pebble size is 1.7 mm and the size for the 84th percentile is 4.6 mm.

I assume that sizes are measured to the nearest 0.1 mm. I simulated 500 pebbles from Site 'A' and 300 pebbles from Site 'B'. You can use PROC UNIVARIATE in SAS to compute a comparative histogram and to compute the percentiles. The sample distributions are shown below:

For Site 'A', the estimates for the sample median and 84th percentile are 2.05 mm and 5.1 mm, respectively. For Site 'B', the estimates are 1.60 mm and 4.7 mm, respectively. These estimates are close to their respective parameter values.

Estimate difference in percentiles

The QUANTREG procedure in SAS makes it easy to estimate the difference in the 50th and 84th percentiles between the two groups. The syntax for the QUANTREG procedure is similar to other SAS regression procedures such as PROC GLM:

proc quantreg data=Grains;
   class Site;
   model Diameter = Site / quantile=0.5 0.84;
   estimate 'Diff in Pctl' Site 1 -1 / CL;
run;

The output from PROC QUANTREG shows estimates for the difference between the percentiles of the two groups. For the 50th percentile, an estimate for the difference is 0.4 with a 95% confidence of [0.09, 0.71]. Because 0 is not in the confidence interval, you can conclude that the median pebble size at Site 'A' is significantly different from the median pebble size at Site 'B'. For the 84th percentile, an estimate for the difference is 0.3 with a 95% confidence of [-0.57, 1.17]. Because 0 is in the interval, the difference between the 84th percentiles is not significantly different from 0.

Methods for estimating confidence intervals

The QUANTREG procedure supports several different methods for estimating a confidence interval: sparsity, rank, and resampling. The estimates in the previous section are by using the RANK method, which is the default for smaller data sets. You can use the CI= option on the PROC QUANTREG statement to use these methods and to specify options for the methods. The following graph summarizes the results for four combinations of methods and options. The results of the analysis do not change: The medians are significantly different but the 84th percentiles are not.

A comparison with bootstrap estimates

When you use a resampling-based estimate for the confidence interval, the interval depends on the random number seed, the algorithm used to generate random numbers, and the number of bootstrap iterations. This can make it hard to compare Monte Carlo results to the results from deterministic statistical methods. Nevertheless, I will present one result from a bootstrap analysis of the simulated data. The following table shows the bootstrap percentile estimates (used by Eaton, Moore, and MacKenzie (2019)) for the difference between percentiles.

The confidence intervals (the Lower and Upper columns of the table) are comparable to the intervals produced by PROC QUANTREG. The confidence interval for the difference between medians seems equivalent. The bootstrap interval for the 84th percentile is shifted to the right relative to the QUANTREG intervals. However, the inferences are the same: the medians are different but there is no significant difference between the 84th percentiles.

The following histogram shows the difference between the 84th percentiles for 5,000 bootstrap samples. The confidence interval is determined by the 2.5th and 97.5th percentiles of the bootstrap distribution. The computation requires only a short SAS/IML program, which runs very quickly.

Summary

Data analysts might struggle to find an easy way to compute the difference between percentiles for two (or more) groups. A recent paper by Eaton, Moore, and MacKenzie (2019) proposes one solution, which is to use resampling methods. An alternative is to use quantile regression to estimate the difference between percentiles, as shown in this article. I demonstrate the method by using simulated data of the form studied by Eaton, Moore, and MacKenzie.

You can download the SAS program that creates the analyses and graphs in this article.

The post How to estimate the difference between percentiles appeared first on The DO Loop.